python - 在 Python 中从中文到 Latin1 的字符编码

我正在尝试转换包含中文字符的本地化文件，以便将中文字符转换为 latin1 编码。

但是，当我运行 python 脚本时出现此错误...

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb9 in position 0: ordinal not in range(128)

这是我的 python 脚本，它基本上只需要用户输入来转换所述文件。然后转换文件(所有以 [ 开头或为空的行，应跳过)...需要转换的部分始终位于列表中的索引 1。

# coding: utf8

# Enter File Name
file_name = raw_input('Enter File Path/Name To Convert: ')

# Open the File we Write too...
write_file = open(file_name + "_temp", 'w+')

# Open the File we Read From...
read_file = open(file_name)

with open(file_name) as file_to_write:
    for line in file_to_write:
        # We ignore any line that starts with [] or is empty...
        if line and line[0:1] != '[':
            split_string = line.split("=")
            if len(split_string) == 2:
                write_file.write(split_string[0] + "=" + split_string[1].encode('gbk').decode('latin1') + "\n")
            else:
                write_file.write(line)
        else:
            write_file.write(line)



# Close File we Write too..
write_file.close()

# Close File we read too..
read_file.close()

示例配置文件是...

[Example]
Password=密碼

输出应转换为...

[Example]
Password=±K½X

最佳答案

Latin1 编码不能表示汉字。如果您的输出只有 latin1，那么您可以获得更好的转义序列。

您使用的是 Python 2.x - Python3.x 以文本形式打开文件，并在读取时自动将读取的字节解码为 (unicode) 字符串。

在 Python2 中，当你读取一个文件时，你会得到字节——你负责将这些字节解码为文本(Python 2.x 中的 unicode 对象)——处理它们，然后重新编码它们在将信息记录到另一个文件时转换为所需的编码。

所以，这行内容是:

write_file.write(split_string[0] + "=" + split_string[1].encode('gbk').decode('latin1') + "\n")

应该是:

write_file.write(split_string[0] + "=" + split_string[1].decode('gbk').encode('latin1', errors="escape") + "\n")

相反。

现在，请注意我在 decode 调用中添加了参数 errors="escape" - 正如我上面所说的那样:latin1 是一个 233 左右的字符集字符 - 它确实包含拉丁字母和最常用的重音字符(“á é í ó ú ç ã ñ”...等)、一些标点符号和数学符号，但不包含其他语言的字符。

如果您必须将这些表示为文本，您应该使用 utf-8 编码 - 并将您使用的任何软件配置为使用该编码来使用生成的文件。

也就是说，您正在做的只是一种可怕的做法。除非您打开一个真正可怕的文件，该文件已知包含不同编码的文本，否则您应该将所有文本解码为 unicode，然后它们将全部重新编码——而不仅仅是数据中包含非 ASCII 的部分人物。如果原始文件中有其他与 gbk 不兼容的字符，请不要这样做，否则，您的内部循环也可能是:

with open(file_name) as read_file, open(file_name + "_temp", "wt") as write_file:
    for line in read_file:
        write_file.write(line.decode("gbk").encode("utf-8")

至于您的“示例输出”——那只是_very_same 文件，即第一个文件中的相同字节。显示以下行的程序:“Password=密码”正在“查看”具有 GBK 编码的文件，而另一个程序正在“查看”完全相同的字节，但将它们解释为 latin1。您不必从一个转换为另一个。

关于python - 在 Python 中从中文到 Latin1 的字符编码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34930301/

python - 在 Python 中从中文到 Latin1 的字符编码

上一篇：python - 如何对条目中包含多个数据的列表进行排序？

下一篇：python - 如何计算字符串中的某些单词(不仅仅是一个单词)，然后如果单词数量不同则输出不同的代码？