python - 如何使用 lxml 仅检索可见节点文本

如何使用Python包lxml检索节点中的可见文本，不包括任何子节点或隐藏元素？

我在文档中只能找到 node.text_content()，但所做的只是去掉 html 标签，返回所有深度的所有文本，而不考虑可见性。我还尝试了 node.text，但这似乎只是为所有节点返回 None。

最佳答案

您可以使用 xpath 来提取您需要的文本。由于您只需要可见文本，请将 /text() 放在您的 xpath 中。

例如，如果我需要提取你的第二段:

import requests
from lxml import etree
r = requests.get("https://stackoverflow.com/questions/32029681/how-to-retrieve-only-visible-node-text-with-lxml")
html = etree.HTML(r.text)
list_of_text = html.xpath('//*[@id="question"]/div[2]/div[2]/div[1]/p//text()') #xpath copied from browser
''.join(list_of_text)

您将获得:

'How do you use the Python package lxml to retrieve visible text in a node, excluding any child nodes or hidden elements?All I can find in the docs is node.text_content(), but all that does is strip out html tags, returning all text at all depths regardless of visibility. I also tried node.text, but that seems to just return None for all nodes.'

请注意，//text() 与 /text() 略有不同，它将选择父节点到最后一个子节点之间的任何节点。

关于python - 如何使用 lxml 仅检索可见节点文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32029681/

上一篇：qt - app.setWindowIcon(icon) 仅有时有效

下一篇：apache - 如何衡量 linux 中的基准测试

相关文章：

python - 如何更快地找出 HTTP 状态？

python - 在正则表达式中使用变量而不是模式

python - Mac 和 Linux 中 lxml.etree.tostring() 中的缩进有所不同

python - 使用 dryscrape 和 BeautifulSoup 进行网页抓取

python - lxml xpath 找不到 anchor 文本

Python - 列表矩阵列表的列切片

python - 如何在 Flask 的发布请求中给出状态？

xml - 用python解析xml(查找带有特定文本的标签)

python - 使用 lxml 将文本插入表中

python - 实现导航栏时，如何在 url 中没有很长的 html 列表？