python - 我如何使用 Python 解码这个在随机网站上挑选并由 Django ORM 保存的 utf-8 字符串？

我使用 Django 解析了一个文件并将其内容保存在数据库中。该网站是 100% 的英文，所以我天真地认为它一直是 ASCII，并愉快地将文本保存为 unicode。

你猜剩下的故事:-)

打印时，出现常见的编码错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 48: ordinal not in range(128)

快速搜索告诉我 u'\u2019' 是 ' 的 UTF-8 表示。

repr(string) 显示如下:

"u'his son\\u2019s friend'"

然后我当然尝试了 django.utils.encoding.smart_str 和使用 string.encode('utf-8') 的更直接的方法，最后我得到了一些可打印的东西。不幸的是，它在我的(linux UTF-8)终端中打印如下:

In [76]: repr(string.encode('utf-8'))
Out[76]: "'his son\\xe2\\x80\\x99s friend '"

In [77]: print string.encode('utf-8')
his son�s friend

不是我所期望的。我怀疑我对某些东西进行了双重编码或错过了重要的一点。

当然，文件原始编码不会随文件一起发布。我想我可以阅读 HTTP header 或询问网站管理员，但由于\u2019s 看起来像 UTF-8，我假设它是 utf-8。我可能错得很厉害，如果我错了，请告诉我。

解决方案显然值得赞赏，但对原因以及如何避免再次发生这种情况的深入解释会更加重要。我经常被编码困扰，这表明我还没有完全掌握这门学科。

最佳答案

你很好。你有正确的数据。是的，原始数据是 UTF-8(基于上下文 u2019 作为“son”和“s”之间的撇号非常有意义)。奇怪的 ? 错误字符可能只是意味着你的终端配置的字体没有这个字符的字形(花哨的撇号)。没什么大不了。数据在重要的地方是正确的。如果您感到紧张，请尝试一些不同的终端/操作系统组合(我在 OS X 上使用 iTerm)。我花了很多时间向我的 QA 人员解释可怕的 ? 问号字符只是意味着他们没有在他们的窗口框上安装中文字体(在我的例子中，我们正在测试中文数据).这里有一些评论

#Create a Python Unicode object
#(abstract code points, independent of any encoding)
#single backslash tells python we want to represent
#a code point by its unicode code point number, typed out with ASCII numbers
>>> s1 = u'his son\u2019s friend'

#If you just type it at the prompt,
#the interpreter does the equivalent of `print repr(s1)`
#and since repr means "show it like a string typed into a python source file",
#you get your ASCII escaped version back
>>> s1
u'his son\u2019s friend'
>>> print repr(s1)
u'his son\u2019s friend'

#This isn't ASCII, so encoding into ASCII generates your original
#error as expected
>>> s1.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character
 u'\u2019' in position 7: 
ordinal not in range(128)

# Encode in UTF-8 and now we have a string,
# which gets displayed as hex escapes.     
#Unicode code point 2019 looks like it gets 3 bytes in UTF-8 (yup, it does)
>>> s1.encode('utf-8')
'his son\xe2\x80\x99s friend'

#My terminal DOES have a different glyph (symbol) to use here,
#so it displays OK for me.
#Note that my terminal has a different glyph for a normal ASCII apostrophe
#(straight vertical)
>>> print s1
his son’s friend
>>> repr(s1)
"u'his son\\u2019s friend'"
>>> str(s1.encode('utf-8'))
'his son\xe2\x80\x99s friend'

另请参阅:http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html

另请参阅字符 2019(十六进制的 e28099，在此页面上搜索“2019”):http://www.utf8-chartable.de/unicode-utf8-table.pl?start=8000

另请参阅:http://www.joelonsoftware.com/articles/Unicode.html

关于python - 我如何使用 Python 解码这个在随机网站上挑选并由 Django ORM 保存的 utf-8 字符串？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/6606291/

python - 我如何使用 Python 解码这个在随机网站上挑选并由 Django ORM 保存的 utf-8 字符串？

上一篇：python - 用于确定 k 均值中的 k 的 k 折交叉验证？

下一篇：python - Django View 测试