python - 将 CSV 文件中的 unicode 键/值存储到字典中时出现 UTF-8 "inconsistency"

在 Python 中处理 Unicode 和 UTF-8 字符和编码时，我总是感到非常困惑。对于我下面要详细介绍的内容，可能有一个简单的解释，但是到目前为止，我还无法理解它。

假设我有一个非常非常简单的 .csv 文件，其中包含非 ASCII 字符:

tildes.csv:

Año,Valor
2001,Café
2002,León

我想使用 csv.DictReader 读取该文件对象并将其键/值存储为 unicode 字符串，并在 python 字典中正确处理(未转义)波浪号等。我见过 Tornado 和 Django 正确处理 unicode 键/值集，所以我对自己说是的，我也能做到!...但是不...看起来我可以' t。

import csv

with open('tildes.csv', 'r') as csv_f:
    reader = csv.DictReader(csv_f)
    for dct in reader:
        print "dct (original): %s" % dct
        for k, v in dct.items():
            print '%s: %s' % (unicode(k, 'utf-8'), unicode(v, 'utf-8'))
        utf_dct = dict((unicode(k, 'utf-8'), unicode(v, 'utf-8')) \
                  for k, v in dct.items())
        print utf_dct

所以，我想:好吧，我从文件中读取了一个 dict(它的键是 Año 和 Valor)，其中将使用转义字符加载 ascii ，但随后我可以将它们编码为 unicode 值并将它们用作键... 错误!

这是我运行上面的代码时看到的:

dct (original): {'A\xc3\xb1o': '2001', 'Valor': 'Caf\xc3\xa9'}
Año: 2001
Valor: Café
{u'A\xf1o': u'2001', u'Valor': u'Caf\xe9'}
dct (original): {'A\xc3\xb1o': '2002', 'Valor': 'Le\xc3\xb3n'}
Año: 2002
Valor: León
{u'A\xf1o': u'2002', u'Valor': u'Le\xf3n'}

所以第一行显示字典“原样”(转义)。很好，这里没什么奇怪的。然后我打印所有解析为 unicode 的键/值。它以我想要的方式展示角色。也不错。但是，然后，使用与打印字符串时重新编码字符串完全相同的指令，我尝试创建一个 dict (utf_dct 变量)，当我打印时它，我再次得到转义的值。

<小时/>

编辑 1:

实际上，我认为我什至不需要 csv 文件来表达我的意思。我刚刚在控制台中尝试过:

>>> print "Año"
Año                      # Yeey!! There's hope!
>>> print {"Año": 2001}
{'A\xc3\xb1o': 2001}     # 2 chars --> Ascii, I think I get this part 
>>> print {u"Año": 2001}
{u'A\xf1o': 2001}        # What happened here? 
                         # Why am I seeing the 0x00F1 UTF-8 code 
                         # from the Latin-1 Supplement (wiki:
                         # http://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)
                         # instead of an ñ?

为什么我不能只打印显示 {u'Año': 2001} 的字典？我的终端明确接受它。这里发生了什么？

最佳答案

当您打印字符串本身时，它会使用其 str() 表示形式“很好地”打印。当您打印字典时，其内容将使用其 repr() 表示形式进行打印，该表示形式始终会转义。两种情况下字符串的内容是相同的，只是 Python 显示方式不同。与第一种情况下 Año 周围不打印引号，但第二种情况下 'A\xc3\xb1o' 周围打印引号的原因相同。只是两种不同的显示格式。

这是一个更简单的示例，可能有助于说明情况:

>>> import unicodedata
>>> unicodedata.name('\u00f1') # 00F1 is unicode code point for this character
'LATIN SMALL LETTER N WITH TILDE'
>>> print(str(u'\u00f1')) # str() gives a displayable character
ñ
>>> print repr(u'\u00f1') # repr() gives an escaped representation
u'\xf1'
>>> print repr(str(u'\u00f1')) # repr() of the str() shows the two characters in the UTF-8 encoding -- this is what happens when showing a dict
'\xc3\xb1'
>>> len(str(u'\u00f1')) # the str() is two bytes long (UTF-8 encoded)
2
>>> len(repr(u'\u00f1')) # the repr() is 7 bytes long (`u`, `'`, `\`, `x`, `f`, `1`, `'`)
7

有一个related bug report建议更改此行为，以便 repr 不会转义非 ASCII 字符。根据该错误报告，此更改是在 Python 3 中进行的，因此您见过的执行此操作的工具可能正在使用 Python 3。

各个工具也可以按照自己喜欢的方式显示任何内容。工具不必只调用 str(someDict) 并显示结果；如果需要，它可以“手动”调用 dict 内容上的 str 并从中构建自己的可显示版本。

关于python - 将 CSV 文件中的 unicode 键/值存储到字典中时出现 UTF-8 "inconsistency"，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23048430/

python - 将 CSV 文件中的 unicode 键/值存储到字典中时出现 UTF-8 "inconsistency"

上一篇：python - numpy 数组用掩码求平均值

下一篇：Python BASE_DIR 未定义