python - 如何确定文本的编码

我收到了一些经过编码的文本，但我不知道使用了什么字符集。有没有办法使用 Python 确定文本文件的编码？ How can I detect the encoding/codepage of a text file处理 C#。

最佳答案

编辑:chardet 似乎无人维护，但大多数答案都适用。查看 https://pypi.org/project/charset-normalizer/换一种方式

始终正确检测编码不可能。

(来自 chardet 常见问题解答:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有chardet使用该研究来尝试检测编码的库。 chardet 是 Mozilla 中自动检测代码的一个端口。

您也可以使用UnicodeDammit .它会尝试以下方法:

在文档本身中发现的编码:例如，在 XML 声明或(对于 HTML 文档)http-equiv META 标记中。如果 Beautiful Soup 在文档中发现这种编码，它会从头开始重新解析文档并尝试新的编码。唯一的异常(exception)是，如果您明确指定了编码，并且该编码确实有效:那么它将忽略它在文档中找到的任何编码。
通过查看文件的前几个字节来嗅探的编码。如果在此阶段检测到编码，它将是 UTF-* 编码、EBCDIC 或 ASCII 之一。
chardet 嗅探到的编码库(如果您已安装)。
UTF-8
Windows-1252

关于python - 如何确定文本的编码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/436220/

python - 如何确定文本的编码

上一篇：linux - 测试每周的 cron 作业

下一篇：linux - Linux 相当于 DOS 暂停是什么？