python - lxml etree.parse.xpath() 返回仅包含制表符和换行符的项目

标签 python xpath lxml

对于典型的 eBay 搜索结果页面,例如 this ,我使用 lxml 提取每个结果的价格:

import urllib2
from lxml import etree

url =  "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Mizuno+Pants+Baseball&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
xpathselector="//span[@class ='bold bidsold']/text()"
tree.xpath(xpathselector)

虽然有 50 个搜索结果(因此有价格),tree.xpath(xpathselector) 返回一个长度为 100 的列表,其中包含所有价格,但也包含仅包含换行符和制表符的项目(忽略这些结果与网页上的结果的价格差异 - 这是由于我的地理位置)。这是为什么?

['\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 u' 1\xc2\xa0049.27',
 '\n\t\t\t\t\t',
 ' 965.31',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 ' 883.56',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 ' 827.21',
 '\n\t\t\t\t\t',
 ' 827.21',
 '\n\t\t\t\t\t',
 ' 827.21',
 '\n\t\t\t\t\t',
 ' 827.21',
 '\n\t\t\t\t\t',
 ' 800.97',
 '\n\t\t\t\t\t',
 ' 799.59',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 ' 716.73',
 '\n\t\t\t\t\t',
 ' 716.73',
 '\n\t\t\t\t\t',
 ' 716.73',
 '\n\t\t\t\t\t',
 ' 690.22',
 '\n\t\t\t\t\t',
 ' 662.60',
 '\n\t\t\t\t\t',
 ' 662.60',
 '\n\t\t\t\t\t',
 ' 635.25',
 '\n\t\t\t\t\t',
 ' 606.25',
 '\n\t\t\t\t\t',
 ' 606.25',
 '\n\t\t\t\t\t',
 ' 552.39',
 '\n\t\t\t\t\t',
 ' 552.39',
 '\n\t\t\t\t\t',
 ' 552.39',
 '\n\t\t\t\t\t',
 ' 552.39',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 ' 551.01',
 '\n\t\t\t\t\t',
 ' 551.01',
 '\n\t\t\t\t\t',
 ' 517.59',
 '\n\t\t\t\t\t',
 ' 497.16',
 '\n\t\t\t\t\t',
 ' 496.88',
 '\n\t\t\t\t\t',
 ' 496.88',
 '\n\t\t\t\t\t',
 ' 496.60',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 ' 469.26',
 '\n\t\t\t\t\t',
 '\n\t\t\t\t\t\t\t\t',
 '\n\t\t\t\t\t',
 ' 468.15',
 '\n\t\t\t\t\t',
 ' 414.30',
 '\n\t\t\t\t\t',
 ' 414.02',
 '\n\t\t\t\t\t',
 ' 414.02',
 '\n\t\t\t\t\t',
 ' 414.02',
 '\n\t\t\t\t\t',
 ' 414.02',
 '\n\t\t\t\t\t',
 ' 386.68']

最佳答案

直接位于目标 span 内的换行符和其他空格也是文本节点,因此它由 span[...]/text() 选择器选择你的 xpath。您可以在谓词中使用 xpath normalize-space() 函数来过滤掉空文本节点:

xpathselector="//span[@class ='bold bidsold']/text()[normalize-space()]"

输出:

['506,533.33', '506,000.00', '466,000.00', '399,333.33', '399,333.33', '399,333.33', '399,333.33', '399,333.33', '386,666.67', '386,000.00', '346,000.00', '346,000.00', '346,000.00', '333,200.00', '333,200.00', '333,066.67', '319,866.67', '319,866.67', '306,666.67', '293,066.67', '292,666.67', '292,666.67', '266,666.67', '266,666.67', '266,666.67', '266,666.67', '266,533.33', '266,533.33', '266,533.33', '266,000.00', '266,000.00', '253,200.00', '249,866.67', '240,000.00', '239,866.67', '239,866.67', '239,733.33', '226,533.33']

关于python - lxml etree.parse.xpath() 返回仅包含制表符和换行符的项目,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33048471/

相关文章:

Python:生成二维点/簇

python - 从 Keras 调用方法 timeseries_dataset_from_array 时出错

python - 谷歌地理编码 : points of interest within a specified radius

python - lxml cssselect 解析

python - 用lxml解析html(标签h3)

python - 从 Pandas 数据框中选择时的内存优化

Python- Selenium : Chrome headless setting does not work with "WebDriverWait"

python - 在 xpath 中使用 contains(text(), ) 时如何获取 sibling

c# - 更改 .net (C#) 中 XmlDocument 的 XPath 根目录?

python - pip 安装 lxml 错误