Python + PostgreSQL + 奇怪的ascii = UTF8编码错误

我有包含字符 "\x80" 的 ascii 字符串来表示欧元符号:

>>> print "\x80"
€

将包含此字符的字符串数据插入我的数据库时，我得到:

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0x80
HINT:  This error can also happen if the byte sequence does not match the encodi
ng expected by the server, which is controlled by "client_encoding".

我是一个 unicode 新手。如何将包含 "\x80" 的字符串转换为包含相同欧元符号的有效 UTF-8？我试过在各种字符串上调用 .encode 和 .decode，但遇到错误:

>>> "\x80".encode("utf-8")
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    "\x80".encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

最佳答案

问题以错误前提开始:

I have ascii strings which contain the character "\x80" to represent the euro symbol.

ASCII 字符在“\x00”到“\x7F”的范围内。

以前接受但现在已删除的答案是在两个严重的误解下运作的 (1) locale == encoding (2) latin1 编码将“\x80”映射到欧元字符。 p>

事实上，所有 ISO-8859-x 编码都将“\x80”映射到 U+0080，它是 C1 控制字符之一，而不是欧元字符。这些编码中只有 3 个(x in (7, 15, 16))提供欧元字符，如“\xA4”。参见 this Wikipedia article .

您需要知道您的数据采用什么编码。它是在什么机器上创建的？如何？创建它的语言环境(不一定是您的语言环境)可能会为您提供线索。

请注意，“我的数据是用 latin1 编码的”与“支票在邮寄中”和“当然我会在早上爱你”一起出现。您的数据可能使用 Windows 平台上的一种 cp125x 编码进行编码。请注意，除了 cp1251(Windows 西里尔文)之外，所有这些都将“\x80”映射到欧元字符:

>>> ['\x80'.decode('cp125' + str(x), 'replace') for x in range(9)]
[u'\u20ac', u'\u0402', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac']

更新以响应 OP 的评论

I'm reading this data from a file, e.g. open(fname).read(). It contains strings with \x80 in them that represents the euro character. it's just a plain text file. it is generated by another program, but I don't know how it goes about generating the text. what would be a good solution? I'm thinking I can assume that it outputs "\x80" for a euro character, meaning I can assume it's encoded with a cp125x that has that char as the euro.

这有点令人困惑:首先你说

It contains strings with \x80 in them that represents the euro character

但后来你说

I'm thinking I can assume that it outputs "\x80" for a euro character

请解释。

选择合适的 cp125x 编码:文件是在哪里(地理位置)创建的？文本是用什么语言写的？除假定的欧元值 > "\x7f"以外的任何字符？如果是，它们在哪些情况下使用？

更新 2 如果您“不知道程序是如何编写的”，您和我们都无法就它是否始终使用“\x80”作为欧元字符形成意见。尽管不这样做会非常愚蠢，但不能排除这种可能性。

如果文本是用英语和/或在美国编写的，和/或在 Windows 平台上编写的，那么可以合理地确定 cp1252 是要走的路...直到你得到相反的证据，在这种情况下你需要自己猜测编码或回答(什么语言，什么地方)问题。

关于Python + PostgreSQL + 奇怪的ascii = UTF8编码错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/2991660/

Python + PostgreSQL + 奇怪的ascii = UTF8编码错误

上一篇：python - Python 和 C 之间独立于操作系统的程序间通信

下一篇：python - 我可以将类方法作为默认参数传递给另一个类方法吗