python - 选择用\n 分隔的 标记

我正在使用Python来解析/清理一个html文档，但它的格式很糟糕。例如

<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>

我想转换 \n至但我似乎无法瞄准 \n或  之间任意数量的空格标签。

到目前为止我已经尝试过

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(re.compile("<p>\\n+<p>", "<p>", html))

但是，这失败了。

最佳答案

使用以下方法:

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(r'<p>[\n\s]+<p>[\n\s]*|<(\/)p>[\n\s]+<\/p>[\n\s]*', r"<\1p>", html)

print(html)

输出:

<p>Python initially inherited its parsing from C.  While this has been
generally useful, there are some remnants which have been less useful
for Python, and should be eliminated.</p>

替代品r"<\1p>"暗示结束标签符号 /来自第一个捕获组 <(\/)p>如果匹配的话

关于python - 选择用\n 分隔的 标记，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42179469/

python - 选择用\n 分隔的 <p> 标记

上一篇：python - 获取 Sympy 将分数展开为多项式方程

下一篇：python - 从字符串中提取整数 - 包括负整数