我在 Windows 7 中使用 python 3.3。
if "iso-8859-1" in str(source):
source = source.decode('iso-8859-1')
if "utf-8" in str(source):
source = source.decode('utf-8')
所以,目前我的应用程序仅对上述两个字符集有效......但我想涵盖所有可能的字符集。
实际上,我是从网站的源头手动找到这些字符集的,而且我经历过世界上所有的网站不仅仅是这两个。有时网站不会在 HTML 源代码中显示其字符集!所以,我的申请无法继续进行!
我应该怎么做才能自动检测字符集并根据它进行解码? 如果可能的话,请尝试让我深入了解并举例说明。您也可以建议重要的链接。
最佳答案
BeautifulSoup提供了一个函数UnicodeDammit()
它会执行多个步骤1来确定您提供的任何字符串的编码,并将其转换为 unicode。使用起来非常简单:
from bs4 import UnicodeDammit
unicode_string = UnicodeDammit(encoded_string)
如果你使用BeautifulSoup来处理你的HTML,它会automatically use UnicodeDammit为您将其转换为 unicode。
<小时/>1 According to the documentation for BeautifulSoup 3 ,这些是 UnicodeDammit 采取的操作:
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
- An encoding you pass in as the fromEncoding argument to the soup constructor.
- An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
- An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252
BeautifulSoup 4 文档中似乎没有这个解释,但想必 BS4 的 UnicodeDammit 的工作方式大致相同(尽管我还没有检查源代码以确保)。
关于python3 : different charset support,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14910121/