我正在使用Python来解析/清理一个html文档,但它的格式很糟糕。例如
<p>\n<p>\n Python initially inherited its parsing from C. While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>
我想转换 <p>\n<p>
至<p>
但我似乎无法瞄准 \n
或 <p>
之间任意数量的空格标签。
到目前为止我已经尝试过
html = "<p>\n<p>\n Python initially inherited its parsing from C. While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(re.compile("<p>\\n+<p>", "<p>", html))
但是,这失败了。
最佳答案
使用以下方法:
html = "<p>\n<p>\n Python initially inherited its parsing from C. While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(r'<p>[\n\s]+<p>[\n\s]*|<(\/)p>[\n\s]+<\/p>[\n\s]*', r"<\1p>", html)
print(html)
输出:
<p>Python initially inherited its parsing from C. While this has been
generally useful, there are some remnants which have been less useful
for Python, and should be eliminated.</p>
替代品r"<\1p>"
暗示结束标签符号 /
来自第一个捕获组 <(\/)p>
如果匹配的话
关于python - 选择用\n 分隔的 <p> 标记,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42179469/