python - 在什么世界中\\u00c3\\u00a9 会变成é？

我有一个来 self 无法控制的来源的可能编码不正确的 json 文档，其中包含以下字符串:

d\u00c3\u00a9cor

business\u00e2\u20ac\u2122 active accounts 

the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label

由此，我了解到他们打算将 \u00c3\u00a9 变成 é，这将是 utf-8 hex C3 A9 .这是有道理的。对于其他人，我假设我们正在处理某些类型的方向引号。

我的理论是，这要么使用了一些我以前从未遇到过的编码，要么以某种方式进行了双重编码。我可以编写一些代码将他们损坏的输入转换成我能理解的东西，因为如果我引起他们的注意，他们不太可能修复系统。

有什么想法可以强制他们输入我能理解的内容吗？郑重声明，我正在使用 Python。

最佳答案

你应该试试 ftfy模块:

>>> print ftfy.ftfy(u"d\u00c3\u00a9cor")
décor
>>> print ftfy.ftfy(u"business\u00e2\u20ac\u2122 active accounts")
business' active accounts
>>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label")
the "Made in the USA" label
>>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label", uncurl_quotes=False)
the “Made in the USA” label

关于python - 在什么世界中\\u00c3\\u00a9 会变成é？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26614323/

上一篇：java - 如何查找两个数字是否是格雷码序列中的连续数字

下一篇：Python docx Lib 居中对齐图像

php - 保存 php 文件 - 我应该使用什么编码来使用 MySQL

c# - 将对象序列化为字符串 : why is my encoding adding stupid characters?

javascript - PHP 中的 HTTPPost 到 JSON

python - Pandas 分组加权累计总和

python - RRD 值错误

python - 导入错误 : No module named '_pywrap_tensorflow_internal'

python - 多个 MySQL JOIN 和重复的单元格

python - 分割字符串时丢失编码

python - Utf-8 与 sqlalchemy 在具有 init connect 的数据库上