Python:文件编码错误

几天以来，我一直在用 Python 编写的小程序中解决这个恼人的文件编码问题。

我经常使用 MediaWiki - 最近我将文档从 .doc 转换为 Wikisource。

在 Libre Office 中打开 Microsoft Word 格式的文档，然后导出为 Wikisource 格式的 .txt 文件。我的程序正在搜索 [[Image:]] 标签，并将其替换为从列表中获取的图像名称 - 该机制工作得非常好(非常感谢 brjaga 的帮助!)。当我对我创建的 .txt 文件进行一些测试时，一切都正常，但是当我将 .txt 文件放入 Wikisource 时，整个事情不再那么有趣了:D

我在 Python 舞会上收到了这条消息:

Traceback (most recent call last):
  File "C:\Python33\final.py", line 15, in <module>
    s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
  File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>

这是我的 Python 代码:

li = [
    "[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]"
    ]


with open ("C:\\124_BPP_PL_PL.txt") as myfile:
    s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w')

for item in li:
     s = s.replace("[[Image:]]", item, 1)

dest.write(s)
dest.close()

好的，所以我做了一些研究，发现这是编码的问题。所以我安装了一个程序Notepad++，并将维基文库的.txt文件的编码更改为:UTF-8并保存。然后我对代码做了一些更改:

with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile:
        s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

但是我收到了这个新的错误消息:

Traceback (most recent call last):
  File "C:\Python33\final.py", line 22, in <module>
    dest.write(s)
  File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

我真的被这个问题困住了。我想，当我在 Notepad++ 中手动更改编码，然后我会告诉我设置的编码 - 一切都会好起来的。

请帮忙，提前谢谢。

最佳答案

当 Python 3 打开文本文件时，它会在尝试解码文件时使用系统的默认编码，以便为您提供完整的 Unicode 文本(str 类型完全支持 Unicode)。当写出此类 Unicode 文本值时，它会执行相同的操作。

你已经解决了输入端的问题；您在读取时指定了编码。写入时执行相同操作:指定用于写出可处理 Unicode 的文件的编解码器，包括代码点 U+FEFF 处的不间断空白字符。 UTF-8 通常是一个不错的默认选择:

dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8')

您在编写时也可以使用 with 语句并节省 .close() 调用:

for item in li:
     s = s.replace("[[Image:]]", item, 1)

with open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8') as dest:        
    dest.write(s)

关于Python:文件编码错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20164611/

Python:文件编码错误

上一篇：python - pip install psycopg2 venv 卡住

下一篇：python - 在Python中行走简单的树