xml - XPath如何获取子节点text和self

标签 xml xpath web-scraping

我想要一个 XPath 来获取特定节点和子节点中包含的所有文本。

在下面的示例中,我试图获取:“Neil Carmichael (Stroud) (Con):”

<p>
<a class="anchor" name="qn_o0"> </a>
<a class="anchor" name="160210-0001.htm_wqn0"> </a>
<a class="anchor" name="160210109000034"> </a>
1. <a class="anchor" name="160210109000555"> </a>
    <b><b>Neil Carmichael</b>
     "(Stroud) (Con):"
    </b>
    "What assessment he has made of the value to the economy in Scotland of UK membership of the single market. [903484]"
</p>

到目前为止,我只能使用以下代码获取其中一个部分:

from lxml import html 
import requests 
page = requests.get('http://www.publications.parliament.uk/pa/cm201516/cmhansrd/cm160210/debtext/160210-0001.htm') 
tree = html.fromstring(page.content) 

test2 = tree.xpath('//div[@id="content-small"]/p[(a[@name[starts-with(.,"st_o")]] or a[@name[starts-with(.,"qn_")]])]/b/text()')

欢迎任何帮助!

最佳答案

/b 处停止您的 XPath所以它返回 <b>元素而不是 <b> 内的文本节点。那么您可以调用text_content()在每个元素上获得预期的文本输出,例如:

from lxml import html

raw = '''<p>
<a class="anchor" name="qn_o0"> </a>
<a class="anchor" name="160210-0001.htm_wqn0"> </a>
<a class="anchor" name="160210109000034"> </a>
1. <a class="anchor" name="160210109000555"> </a>
    <b><b>Neil Carmichael</b>
     "(Stroud) (Con):"
    </b>
    "What assessment he has made of the value to the economy in Scotland of UK membership of the single market. [903484]"
</p>'''

root = html.fromstring(raw)
result = root.xpath('//p/b')
print result[0].text_content()

# output :
# 'Neil Carmichael\n     "(Stroud) (Con):"\n    '

作为text_content()的替代品,您可以使用 XPath string()功能和可选 normalize-space() :

print result[0].xpath('string(normalize-space())')
# output :
# Neil Carmichael "(Stroud) (Con):"

关于xml - XPath如何获取子节点text和self,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35536808/

相关文章:

c# - 从 XML 读取值

xslt - xsl:element + xsl:function =怪异? (撒克逊人,Java)

python - Scrapy不抓取下一页url

python - 网页抓取 : Yahoo provides dirtyurl instead of normal url

python - Scrapy编程错误: Not all parameters were used in the SQL statement

xml - Lua、XML、UTF-8

java - 从解析的 XML 文档返回 NodeList 的字符串值

java - 如何将值插入到xml中?

python - ElementNotInteractableException : Message: element not interactable error sending text in search field using Selenium Python

xml - 使用 XPath/Sitecore 查询获取层次结构中特定级别的祖先