python - 删除lxml中的img标签

标签 python html html-parsing lxml lxml.html

我有这个代码:

from lxml.html import fromstring, tostring

html = "<p><img src='some_pic.jpg' />Here is some text</p>"

doc = fromstring(html)
img = doc.find('.//img')
doc.remove(img)

print tostring(doc)

输出为:<p></p>

为什么删除 img 标签也会删除其后面的文本？换句话说，为什么没有打印出结果: <p>Here is some text</p> 我怎样才能只删除该标签而不删除文本？请注意，即使我在 img 上包含显式结束标记，我也会得到相同的结果，即:

html = "<p><img src='some_pic.jpg'></img>Here is some text</p>"

最佳答案

Here is some text text 是 img 标记的 tail - 它是元素的一部分，并且将随元素一起删除.

要保留 tail - 将其分配给 img 父级文本:

from lxml.html import fromstring, tostring

html = "<p><img src='some_pic.jpg' />Here is some text</p>"

doc = fromstring(html)
img = doc.find('.//img')
parent = img.getparent()
parent.text = img.tail
doc.remove(img)

print tostring(doc)

打印:

<p>Here is some text</p>

关于python - 删除lxml中的img标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24666712/

上一篇：html - 移动到图形图像中的 Figcaption 中的 Div 不显示其背景

下一篇：html - 引导CSS : aligning columns on small screens

javascript - 如何创建模态图像网格？

python bottle 总是记录到控制台，不记录到文件

python - 直接在图表上可视化 matplotlib 直方图 bin 计数

python - 如何处理 CSV 的列不一致

javascript - 为什么我的 javascript 代码不能从上到下编码？

python - Python 中的 HTML 截断

python - 在python中使用lxml解析http[s] weboages

internet-explorer - Play Framework Internet Explorer 解析器错误

python - 使用 Python 和 BeautifulSoup 获取字符串中 1-10 的正则表达式时出现问题