python - 使用 Beautiful Soup 保存实体进行抓取

标签 python web-scraping beautifulsoup html-parsing html-entities

我想从网上抓取一张表格并保留实体完好无损，以便我以后可以重新发布为 HTML。 BeautifulSoup 似乎正在将这些转换为空格。示例:

from bs4 import BeautifulSoup

html = "<html><body><table><tr>"
html += "<td>&nbsp;hello&nbsp;</td>"
html += "</tr></table></body></html>"

soup = BeautifulSoup(html)
table = soup.find_all('table')[0]
row = table.find_all('tr')[0]
cell = row.find_all('td')[0]

print cell

观察结果:

<td> hello </td>

要求的结果:

<td>&nbsp;hello&nbsp;</td>

最佳答案

在 bs4 中，不再支持 BeautifulSoup 构造函数的 convertEntities 参数。 HTML 实体总是被转换成相应的 Unicode 字符(参见 docs )。

根据文档，您需要使用输出格式化程序，如下所示:

print soup.find_all('td')[0].prettify(formatter="html")

关于python - 使用 Beautiful Soup 保存实体进行抓取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16135951/

上一篇：python - 使用 Regex 引用组内组

下一篇：python - 有没有标准的方法在 Python 中存储 XY 数据？

相关文章：

python - Django:尽管引用正确，静态文件仍未显示在网站中

python - 为什么使用 np.mean() 和 Mean() 给我不同的输出数字？

html - 将 rvest::html_nodes() 与来自 SelectorGadget 或 Chrome 开发者工具的 CSS 标签一起使用总是返回空列表

python - 使用 python requests 模块登录基于 WordPress 的网站

python - Scrapy 登录后解析 url 列表

javascript - HTTP解析器: scraping single page application: many GETs,如何找出页面何时结束

Python/Beautifulsoup/解析

python - 使用 BeautifulSoup4 解析 HTML 表格

python - 将 lxml 设置为默认 BeautifulSoup 解析器

python - 在 Python 3 中替换字符串中的 unicode 字符