python - 如何处理utf-8编码的String和BeautifulSoup？

标签 python beautifulsoup

如何用正确的 unicode 替换 unicode 字符串中的 HTML 实体？

u'&quot;HAUS Kleider&quot; - &Uuml;ber das Bekleiden und Entkleiden, das Verh&Yuml;llen und Veredeln'

至

u'"HAUS-Kleider" - Über das Bekleiden und Entkleiden, das Verhüllen und Veredeln'

编辑
事实上，实体是错误的。看起来 BeautifulSoup 已经满足了它。

所以问题是:如何处理utf-8编码的String和BeautifulSoup？

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
            l += [r.string] # here things seem to go wrong
    allArticles+=[l]

Ü -> Ÿ 而不是 Ü 但实际上我不想更改编码。

>>> soup.originalEncoding
'utf-8'

但我无法生成正确的 unicode 字符串

最佳答案

我想你需要的是ICU transliterators 。我认为有一种方法可以将 HTML 实体音译为 Unicode。

尝试使用您想要的音译器ID Hex/XML-Any。在演示页面上，您可以选择“插入示例:化合物”，然后在“化合物1”框中输入Hex/XML-Any，在框中添加一些输入数据，然后按“转换”。是this有帮助吗？

有一个 Python ICU 绑定(bind)，但我认为它没有得到很好的处理。

关于python - 如何处理utf-8编码的String和BeautifulSoup？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/4054551/

上一篇：python - 使用 Django 缩放 disqus，水平和垂直分区辅助方法，请解释

下一篇：c# - Python XML Parse(使用模式生成数据集)

相关文章：

python - 将包含数据的行保留在Python的列列表中

python - setup.py bdist_egg 没有将文件放入 Egg 中

python - Openshift绑定(bind)TCP端口

python - 从 bs4.element.Tag 获取项目

python - 在Python中将SRC属性与汤返回隔离

python - 如何解析表格中的行，这些行不仅由 <td> 单元格组成，而且偶尔还由 <th> 单元格组成？

python - 尝试使用 BeautifulSoup 在 HTML 文档中查找特定表格

python - 替换列表中的字符串值

python - 在 pip 包构建上编译 Cython

python - 在一个类中抓取一个类

©2024 IT工具网联系我们