python - 从文本中删除所有 HTML 标签及其内容

标签 python html html-parsing beautifulsoup

我想知道如何使用 BeautifulSoup 删除所有 HTML 标签及其内容。

输入:

... text <strong>ha</strong> ... text

输出:

... text ... text

最佳答案

使用replace_with() (或 replaceWith()):

from bs4 import BeautifulSoup, Tag


text = "text <strong>ha</strong> ... text"

soup = BeautifulSoup(text)

for tag in soup.find_all('strong'):
    tag.replaceWith('')

print soup.get_text()

打印:

text  ... text

或者，正如@mata 所建议的，您可以使用 tag.decompose() 而不是 tag.replaceWith('') - 会产生相同的结果，但看起来更合适。

关于python - 从文本中删除所有 HTML 标签及其内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18453176/

上一篇：javascript - ng-cloak 指令被过早删除

下一篇：javascript - HTML2Canvas 将溢出的内容转换为图像

ruby - Ruby 中可用的网页抓取 gem /工具

python - 是否有一种矢量化的方法来检查 DataFrame 中的值(例如，如果一切都是长度为 0 的字符串)

javascript - 使用 angularJS 自定义弹出窗口

javascript - 保存 SVG 中图案图像的纵横比，如 css 背景图像封面

Python Beautiful Soup 在 div 标签本身中提取数据

python - 有没有办法使用 numpy 广播来 NOT 任意 M x N 矩阵？

python - 基于两列作为索引创建新变量，一列作为新变量名称 python pandas 或 R

python - Django:测试中的一般帮助

html - 使用 HTML/CSS 的动态定价表/功能矩阵