javascript - 抓取隐藏在 javascript 对象后面(？)的文本

我正在使用 scrapy，我想提取文本元素。这是我要抓取的网页 http://www.idealo.de/preisvergleich/OffersOfProduct/3131289_-vitodens-222-f-13-kw-viessmann.html

我正在使用以下 xpath 命令:

for sel in response.xpath('//tr'):
sel.xpath('td[@class="title"]/a[@class="offer-title link-2 webtrekk wt-prompt"]/text()').extract()

html 代码中有一些产品(表中的行)可以正常工作。但是，在某些情况下，JavaScript 会直接嵌入到文本之前:

<td class="title">
  <a class="offer-title link-2 webtrekk wt-prompt" ... >
    <script type="text/javascript"> ... </script>
    text I need 
  </a>
</td>

在这些情况下，我无法检索“我需要的文本”。

我还搜索并尝试了其他几个 xpath 选项，例如获取所有子节点。这些是我尝试过的变体:

        # item['longtitle'] = sel.xpath('td[@class="title"]/a[@class="offer-title link-2 webtrekk wt-prompt"]/script[@type="text/javascript"]/following-sibling::*').extract()
        # item['longtitle'] = sel.xpath('td[@class="title"]/a[@class="offer-title link-2 webtrekk wt-prompt"]/script[@type="text/javascript"]/node()').extract()
        item['longtitle'] = sel.xpath('td[@class="title"]/text()[0]').extract()
        ## item['longtitle'] = sel.xpath('td[@class="title"]/node()').extract()
        ## item['longtitle'] = sel.xpath('td[@class="title"]/text()').extract()
        ## item['longtitle'] = sel.xpath('td[@class="title"]/a[@class="offer-title link-2 webtrekk wt-prompt"]/node()').extract()
        ## item['longtitle'] = sel.xpath('td[@class="title"]/a[@class="offer-title link-2 webtrekk wt-prompt"]/text()').extract()
        ## item['longtitle'] = sel.xpath('td[@class="title"]/a[2]').extract()
        ## item['longtitle'] = sel.xpath('td[@class="title"]/a[@class="offer-title link-2 webtrekk wt-prompt"]/*').extract()
        ## item['longtitle'] = sel.xpath('td[@class="title"]/a[@class="offer-title link-2 webtrekk wt-prompt"]/script[@type="text/javascript"]/text()').extract()

但我总是失败。

我很乐意提供任何帮助。谢谢。

最佳答案

看起来对于那些 <script> 的单元格来说标签存在，HTML 节点中没有文本。通过对其 JavaScript 的一些快速检查(恰好未缩小)，看起来这些单元格在 JS 运行时填充了文本。所以你不会发疯，这些单元格中肯定没有任何文本。

要获取该文本，您需要点击链接并从下一页获取标题(这必须以某种方式有条件，因为并非每个链接都指向同一个网站)，或者您需要使用一些 JS 引擎拉取页面，例如 Selenium ( pip install selenium ):

>>> from selenium import webdriver
>>> my_driver = webdriver.PhantomJS()
>>> my_driver.get(response.url)
>>> results = my_driver.find_elements_by_xpath('//table[contains(@class, "modular")]//tr[.//a]')
>>> for row in results:
...     print row.find_element_by_xpath('./td[@class="title"]/a').text
Viessmann Vitodens 222-F Kompakt-Brennwerttherme, 13 kW, VT100, HE ohne Abgaspaket, ohne Anschluss-Set Viessmann
Viessmann Vitodens 222-F wahlweise 13,19, 26 oder 35 kW + Vitotronic 100 oder 200 (Regelung: Vitotronic 100, max. Wärmeleistung (KW): 13)
Paket Vitodens 222-W 13KW mit Vitotronic 200, Ladespeicher und Montagehilfe AP
Viessmann Vitodens 222-F nach Wahl, 13, 19 & 26 kW, Gas-Brennwert-Kompaktgeräte (Abgaspaket: Ohne, Anschluss-Set: Ohne, Regelung: Vitotronic 100, Heizkreispumpe: Hocheffizient, Leistung: 13kW)
Vitodens 222-F 13 kW mit Vitotronic 100 HC1B, hocheffizient
Viessmann Paket Vitodens 222-F 13 kW Vitotronic
Viessmann Paket Vitodens 222-F 13 kW Vitotronic
Viessmann Vitodens 222-F B2TA mit Vitotronic 100 3,2 - 13,0 kW
Viessmann Vitodens 222-F B2SA mit Vitotronic 100 3,2 - 13,0 kW
Viessmann 222-F Gastherme, 13 kW, B2SA010, Speicher innenbeheizt, Aufputz l/r Viessmann
Viessmann 222-F Gastherme, 13 kW, B2SA007, Speicher innenbeheizt, Aufputz oben Viessmann
Vitodens 222-F mit Vitotronic 200, Ladespeicher 3,2 - 13,0 kW
Viessmann Vitodens 222-F BS2A mit Vitotronic 200 3,2 - 13,0 kW
Vitodens 222-F 13KW mit Speicher mit Vitotronic 200 Kompaktgerät
Vitotronic 200 HO1B, Montagehilfe AP 3,2 - 13,0 kW, Aufputz-Montage

这就是你想要的。 15 个结果。

注意:这个功能在下载器中间件中显然会更好，这样就不会向同一个 URL 发出多个请求，但我会把它留给你;)

关于javascript - 抓取隐藏在 javascript 对象后面(？)的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31135541/

javascript - 抓取隐藏在 javascript 对象后面(？)的文本

上一篇：python - 在同一时间戳下对 csv 数据进行分组

下一篇：python - Django 脆皮形式 : add text next to a checkbox?