python - 使用Python 3.6读取文件

您好，我使用 Allen Downey 的 oreilly 书来学习 Python3.x。第 9 章中有一个使用 Moby 项目文件中的单词列表的示例。

https://en.wikipedia.org/wiki/Moby_Project

https://web.archive.org/web/20170930060409/http://icon.shef.ac.uk/Moby/

我使用以下 Python 行读取了 german.txt 文件。

with open("german.txt") as log:
        for line in log:
                word = line.strip()
                if len(word) > 20:
                        print(word)

读了一些单词，但中间休息了一下，我明白了这一行。

Amtsueberschreitungen
Traceback (most recent call last):
  File "einlesen.py", line 8, in <module>
    for line in log:
  File "/home/alexander/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 394: invalid start byte

是什么符号？我该如何用 python 代码处理这个问题。

谢谢

最佳答案

根据documentation of open() :

if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

因此每个人读取文件的方式都不同。为了保证文件被正确读取，您需要指定正确的编码。

根据documentation of the Moby Project on Wikipedia ，“保留一些非 ASCII 重音字符，使用 Mac OS Roman 编码表示”。在 documentation of the Python codecs module您可以找到该编解码器的正确名称，即“mac_roman”。因此，您可以使用以下代码，这不会导致解码错误:

with open("german.txt", 'rt', encoding='mac_roman') as log:
    for line in log:
        word = line.strip()
        if len(word) > 20:
            print(word)

更新

尽管有文档，但该文件似乎并未使用 Mac OS 罗马编码进行编码。我使用 all possible encodings 解码了该文件并比较了结果。列表中只有 9 个非 ASCII 单词，并且正如另一个答案中指出的那样，单词“André”似乎是正确的。以下是可能的编码列表(没有失败，并且包括单词“André”)以及根据该编码解码的 9 个非 ASCII 单词:

encodings: cp437, cp860, cp861, cp863, cp865
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p≥ange

encodings: cp720
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, pٌange

encodings: cp775
words: André, Attaché, Chāteau, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhōnetal, p“ange

encodings: cp850, cp858
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p‗ange

encodings: cp852
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p˛ange

对于所有上述编码，解码时前 8 个字是相同的。仅最后一个词就有 9 种不同的结果。

根据此结果，我认为使用了cp720编码。但是，我不认识列表中的最后一个单词，所以我不能确定。由您决定哪种解码最适合您。

关于python - 使用Python 3.6读取文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56818268/

python - 使用Python 3.6读取文件

上一篇：python - 使用 XlsxWriter 将文本框或文本框中的文本定向为垂直？

下一篇：python - 在 Python 脚本中使用 Scrapy Spider 输出的问题