我构建了一个蜘蛛来爬行一个单独的网站: www.docteur.ch/generalistes/generalistes_k_ag.html
它使用以下格式抓取表格的 td:
<table class="novip">
<tr class="novip">
<td class="novip-portrait-picture"
rowspan="5">
<a class="novip-portrait-picture"
href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html">
<img class="novip-portrait-picture"
src="/customer_controlled/pictures/65903/portrait/65903.png"
alt="Pas d'image encore"
onError="portrait_m_image_failover(this)" />
</a>
</td>
<td class="novip-left">
<a class="novip-firmen-name"
href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html"
target="_top">
Baumberger Hans Rudolf
</a>
</td>
<td class="novip-right"
width="25%">
<a class="novip"
href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html"
target="_top">
rating info: <img class="novip-inforating"
src="/img/general/stars/stars3 "
alt="rating info"
width="70" height="14" align="bottom" border="0" />
</a>
</td>
</tr>
<tr class="novip">
<td class="novip-left">
Dr. med. Facharzt FMH für Allgemeine Innere Medizin
</td>
</tr>
<tr class="novip">
<td class="novip-left">
Bahnhofstrasse 92, 5000 Aarau
</td>
<td class="novip-right-telefon">
tél: 062 822 46 28
</td>
</tr>
<tr class="novip">
<td class="novip-left-email">
e-mail:
<a class="novip-left-send-message-button-inactive"
href="/eintrag/fr_keine_mitteilung_moeglich.html">
Envoyer un message
</a>
<a class="novip-left-make_appointment-button-inactive"
href="/eintrag/fr_kein_termin_moeglich.html">
prendre un rendez-vous
</a>
</td>
<td class="novip-right-fax">
fax: 062 822 35 20
</td>
</tr>
</table>
我只想使用以下代码提取医生的个人姓名:
import scrapy
from docteur.items import DocteurItem
class DocteurGeneralistSpider(scrapy.Spider):
name = "docteur_generalist"
allowed_domains = ["docteur.ch"]
start_urls = [
'http://www.docteur.ch/generalistes/generalistes_k_ag.html',
]
def parse(self, response):
for sel in response.xpath('//table/tr[@class="novip"]'):
item = DocteurItem()
item['name'] = sel.xpath('.//td[2]/a[@class="novip-firmen-name"]/text()[normalize-space()]').extract_first(default='not-found')
#item['phone'] = sel.xpath('.//td[@class="novip-right-telefon"]/text()[normalize-space()]').extract_first()
yield item
我提取了名称,但对于每个条目还有两个空字段,尽管页面的源代码中没有空的 td
[{"name": "\n Baumberger\u00a0Hans Rudolf\n "},
{"name": "not-found"},
{"name": "not-found"},
{"name": "not-found"},
{"name": "\n Bettschart\u00a0Robert\n "},
{"name": "not-found"},
{"name": "not-found"},
{"name": "not-found"},
....]
我的代码有什么问题?如何只提取具有值的单元格?
最佳答案
这将获取所有名称:
names = response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]//text()').extract()
它只返回 467 个名称:
In [14]: names = response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]')
In [15]: len(names)
Out[15]: 467
当您检查所有 trs 时,您会得到空结果,因此当您找不到带有 class="novip-firmen-name"
的 trs 时,您会得到默认值输出。
如果我们采取前几个,您可以看到我们发生了什么:
In [23]: for sel in response.xpath('//table/tr[@class="novip"]')[:5]:
print(sel.xpath('.//td[2]/a[@class="novip-firmen-name"]'))
....:
[<Selector xpath='.//td[2]/a[@class="novip-firmen-name"]' data=u'<a class="novip-firmen-name" href="/mede'>]
[]
[]
[]
[<Selector xpath='.//td[2]/a[@class="novip-firmen-name"]' data=u'<a class="novip-firmen-name" href="/mede'>]
如果您仅搜索具有 class="novip-firmen-name"
的 anchor 标记,您将得到您想要的内容:
In [38]: for sel in response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]')[:5]:
print(sel.xpath('.//text()').extract_first().strip())
....:
Baumberger Hans Rudolf
Bettschart Robert
Bock Andreas
Brändli Heinrich
Buchser Marcel
或者您可以搜索具有 anchor 标记的 tds 以及您想要获取这些 tds 的类:
In [39]: for sel in response.xpath('//table/tr[@class="novip"]/td[a[@class="novip-firmen-name"]]')[:5]:
print(sel.xpath('./a/text()').extract_first()).strip()
....:
Baumberger Hans Rudolf
Bettschart Robert
Bock Andreas
Brändli Heinrich
Buchser Marcel
关于python - Scrapy 提取空 td 值,尽管表中没有空值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37100827/