python - 如何使用正则表达式删除某些 HTML 标记中的字符串并且字符串必须包含空格

我尝试在 python 中使用正则表达式清理一些 HTML 数据。给定带有 HTML 标签的输入字符串，如果内容包含空格，我想删除标签及其内容。要求如下:

inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = regexProcess(inputString)
print outputString

>>I want to remove not sole <code>word</code>

正则表达式 re.sub("<code>.+?</code>", " ", inputString)只能去掉所有标签，如何改进还是有其他方法？

提前致谢。

最佳答案

在 HTML 中使用正则表达式充满了各种问题，这就是为什么您应该了解所有可能的后果。所以，你的 <code>.+?</code>正则表达式仅在 <code> 的情况下有效和 </code>标签在一行上，如果没有嵌套 <code>它们里面的标签。

假设没有嵌套code标签你可以扩展你当前的方法:

import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)

re.S标志将启用 .匹配换行符和 lambda 将有助于对每个匹配项执行检查:任何在其节点值中包含空格的代码标记将被转换为常规空格，否则将被保留。

参见 this Python demo

在 Python 中解析 HTML 的一种更常见的方法是使用 BeautifulSoup。首先，解析 HTML，然后获取所有 code标签，然后替换 code如果节点包含空格，则标记:

>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
    if p.string and " " in p.string:
        p.replace_with(" ")

>>> print(soup)
I want to remove   not sole <code>word</code>

关于python - 如何使用正则表达式删除某些 HTML 标记中的字符串并且字符串必须包含空格，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41440271/

python - 如何使用正则表达式删除某些 HTML 标记中的字符串并且字符串必须包含空格

上一篇：python - 列表基于 Python 中的值重新分组

下一篇：python - 发现字典中的最大值