python - 无法创建合适的选择器来解析某个字符串

我创建了一个选择器来从一些 html 元素中抓取特定的字符串。元素中有两个字符串。使用下面脚本中的选择器，我可以解析它们两个，而我希望得到后一个，在这种情况下 I wanna be sraped alone。我如何使用任何选择器来为要解析的第一个字符串创建障碍？

这是 html 元素:

html_elem="""
<a class="expected-content" href="/4570/I-wanna-be-scraped-alone">
    <span class="undesirable-content">I shouldn't be parsed</span>
    I wanna be scraped alone
</a>
"""

我试过:

from lxml.html import fromstring

root = fromstring(html_elem)
for item in root.cssselect(".expected-content"):
    print(item.text_content())

我得到的输出:

 I shouldn't be parsed
 I wanna be scraped alone

预期输出:

I wanna be scraped alone

顺便说一句，我也尝试过使用 root.cssselect(".expected-content:not(.undesirable-content)") 但这绝对不是正确的方法。任何帮助将不胜感激。

最佳答案

对于这道题的具体例子，最佳答案是:

for item in root.cssselect(".expected-content"):
    print(item.tail)

as element.tail 在最后一个 child 之后返回文本。但是，如果所需的文本位于子节点之前或之间，这将不起作用。所以一个更健壮的解决方案是:

item.text_content() 根据文档:

Returns the text content of the element, including the text content of its children, with no markup.

因此，如果您不想要子文本，请先删除它们:

from lxml.html import fromstring

html_elem="""
<a class="expected-content" href="/4570/I-wanna-be-scraped-alone">
    <span class="undesirable-content">I shouldn't be parsed</span>
    I wanna be scraped alone
</a>
"""

root = fromstring(html_elem)
for item in root.cssselect(".expected-content"):
    for child in item:
        child.drop_tree()
    print(item.text_content())

请注意，此示例也返回了一些空白，我相信这很容易清理。

关于python - 无法创建合适的选择器来解析某个字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46889274/

python - 无法创建合适的选择器来解析某个字符串

上一篇：python - 使用应用于数据框中每一列的多个参数的自定义函数

下一篇：python - 如何将值从 Python Pandas 中的多个字典对象插入到数据帧中