python - 选择用\n 分隔的 <p> 标记

标签 python regex

我正在使用Python来解析/清理一个html文档,但它的格式很糟糕。例如

<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>

我想转换 <p>\n<p><p>但我似乎无法瞄准 \n<p> 之间任意数量的空格标签。

到目前为止我已经尝试过

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(re.compile("<p>\\n+<p>", "<p>", html))

但是,这失败了。

最佳答案

使用以下方法:

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(r'<p>[\n\s]+<p>[\n\s]*|<(\/)p>[\n\s]+<\/p>[\n\s]*', r"<\1p>", html)

print(html)

输出:

<p>Python initially inherited its parsing from C.  While this has been
generally useful, there are some remnants which have been less useful
for Python, and should be eliminated.</p>

替代品r"<\1p>"暗示结束标签符号 /来自第一个捕获组 <(\/)p>如果匹配的话

关于python - 选择用\n 分隔的 <p> 标记,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42179469/

相关文章:

python - Windows - 执行不带 .exe 扩展名的可执行文件 (.exe)

Python - 如果列表中的任何元素在行中

java - 提取文件名的正则表达式?

java - 我可以进一步改善此正则表达式的性能吗

java - 请解释此正则表达式的输出(以正面前瞻开头)

Python 正则表达式或 |贪心

python - GTK3:听主题变化

python - 将 BoxLayout 与 GridLayout 内的按钮居中

python - 如何找到脚本的目录?

ruby - 如何在 ruby​​ 中替换字符串中的\r