python - 如何消除html标签？

我正在从页面中获取第一段，并尝试提取适合作为标签或关键字的单词。在某些段落中有链接，我想删除标签:

例如，如果文本是

A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ...

我要删除

<b></b><a href="/wiki/Byte" title="Byte"></a>

最终得到这个

A hex triplet is a six-digit, three-byte ...

像这样的正则表达式不起作用:

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
    enter code heretitle="Byte">byte</a> ..."""
>>> f = re.findall(r'<.+>', text)
>>> f
['<b>hex triplet</b>', '</a>']
>>>

最好的方法是什么？

我发现了几个类似的问题，但我认为它们都不能解决这个特定问题。

使用 BeautifulSoup 提取示例进行更新(提取会删除包含其文本的标签，并且必须分别为每个标签运行:

>>> soup = BeautifulSoup(text)
>>> [s.extract() for s in soup('b')]
[<b>hex triplet</b>]
>>> soup
A  is a six-digit, three-<a href="/wiki/Byte" enter code heretitle="Byte">byte</a> ...
>>> [s.extract() for s in soup('a')]
[<a href="/wiki/Byte" enter code heretitle="Byte">byte</a>]
>>> soup
A  is a six-digit, three- ...
>>>

更新

对于有同样问题的人:正如 Brendan Long 提到的，this answer使用 HtmlParser 效果最好。

最佳答案

Beautiful Soup是您问题的答案!尝试一下，非常棒!

一旦使用，Html解析就会变得如此简单。

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a> ..."""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.findAll(text=True))
u'A hex triplet is a six-digit, three-byte ...'

如果您想要提取的所有文本都包含在一些外部标签中，例如 <body> ... </body>或一些<div id="X"> .... </div> ，然后您可以执行以下操作(此插图假定您要提取的所有文本都包含在 <body> 标记内)。现在您可以有选择地仅从某些所需的标签中提取文本。

(看看文档和示例，你会发现很多解析 DOM 的方法)

>>> text = """<body>A <b>hex triplet</b> is a six-digit, 
... three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.body.findAll(text=True))
u'A hex triplet is a six-digit, three-byte'

关于python - 如何消除html标签？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7775800/

python - 如何消除html标签？

上一篇：python - 如何从 QTextEdit 上下文菜单中删除标准菜单项

下一篇：python - Python 统计包