python - 解析 html 时，为什么我有时需要 item.text 而其他人需要 item.text_content()

还在学习lxml。我发现有时我无法使用 item.text 从树中获取项目的文本。如果我使用 item.text_content() 我很高兴。我不确定我明白为什么了。任何提示将不胜感激

好吧，我不确定如何在不让您处理文件的情况下提供示例:

这是我写的一些代码，试图弄清楚为什么我没有收到预期的文本:

theTree=html.fromstring(open(notmatched[0]).read()) 
text=[]
text_content=[]
notText=[]
hasText=[]
for each in theTree.iter():
    if each.text:
        text.append(each.text)
        hasText.append(each)   # list of elements that has text each.text is true
    text_content.append(each.text_content()) #the text for all elements 
    if each not in hasText:
        notText.append(each)

所以在我运行这个之后我会看

>>> len(notText)
3612
>>> notText[40]
<Element b at 26ab650>
>>> notText[40].text_content()
'(I.R.S. Employer'
>>> notText[40].text

最佳答案

根据the docs text_content方法:

Returns the text content of the element, including the text content of its children, with no markup.

例如，

import lxml.html as lh
data = """<a><b><c>blah</c></b></a>"""
doc = lh.fromstring(data)
print(doc)
# <Element a at b76eb83c>

doc是Element a . a标签后面没有紧跟的文本(在 <a> 和 <b> 之间。所以 doc.text 是 None :

print(doc.text)
# None

但是在c之后有文字标记，所以 doc.text_content()不是 None :

print(doc.text_content())
# blah

附言。 text的含义有明确的描述属性 here .尽管它是 lxml.etree.Element 文档的一部分，我觉得text的意思和 tail属性同样适用于 lxml.html.Element对象。

关于python - 解析 html 时，为什么我有时需要 item.text 而其他人需要 item.text_content()，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3517461/

python - 解析 html 时，为什么我有时需要 item.text 而其他人需要 item.text_content()

上一篇：HTML 字符实体和字符编码集

下一篇：html - Markdown 或 HTML