python3 : different charset support

我在 Windows 7 中使用 python 3.3。

if "iso-8859-1" in str(source):
    source = source.decode('iso-8859-1')
if "utf-8" in str(source):
    source = source.decode('utf-8')

所以，目前我的应用程序仅对上述两个字符集有效......但我想涵盖所有可能的字符集。

实际上，我是从网站的源头手动找到这些字符集的，而且我经历过世界上所有的网站不仅仅是这两个。有时网站不会在 HTML 源代码中显示其字符集!所以，我的申请无法继续进行!

我应该怎么做才能自动检测字符集并根据它进行解码？如果可能的话，请尝试让我深入了解并举例说明。您也可以建议重要的链接。

最佳答案

BeautifulSoup提供了一个函数UnicodeDammit()它会执行多个步骤¹来确定您提供的任何字符串的编码，并将其转换为 unicode。使用起来非常简单:

from bs4 import UnicodeDammit
unicode_string = UnicodeDammit(encoded_string)

如果你使用BeautifulSoup来处理你的HTML，它会automatically use UnicodeDammit为您将其转换为 unicode。

<小时/>

¹ According to the documentation for BeautifulSoup 3 ，这些是 UnicodeDammit 采取的操作:

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:

An encoding you pass in as the fromEncoding argument to the soup constructor.

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.

An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.

An encoding sniffed by the chardet library, if you have it installed.

UTF-8

Windows-1252

BeautifulSoup 4 文档中似乎没有这个解释，但想必 BS4 的 UnicodeDammit 的工作方式大致相同(尽管我还没有检查源代码以确保)。

关于python3 : different charset support，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14910121/

python3 : different charset support

上一篇：python - 无法解析从 channel 发送的 json

下一篇：python - Mongoengine链接filter()和ReferenceField()导致 "TypeError: ' Collection'对象不可调用”