python - Beautiful Soup 4中如何处理 和 ？

标签 python html parsing web-scraping beautifulsoup

我正在尝试使用 python 和 Beautiful Soup 4 用新行替换某些 html 中的每个中断标记。

该文档有  ,  和标签，但由于 Beautiful Soup 处理标签的方式，每当它找到   ，它会删除它与下一个  之间的所有内容它看到了。

有办法解决这个问题吗？

最佳答案

尝试使用 HTMLParserTreeBuilder 作为构建器类:

from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder

html_doc = """
<html>this is a test<br> ...between a start and end br... </br> a blank br: <br/> something else
"""

soup = BeautifulSoup(html_doc, builder=HTMLParserTreeBuilder())
print soup.prettify()

比较未给出 builder= 参数时的输出。

您可以确定 bs4 正在使用哪个构建器: