python - 如果不是 unicode 则解码

我希望我的函数接受一个参数，该参数可以是一个 unicode 对象或一个 utf-8 编码的字符串。在我的函数中，我想将参数转换为 unicode。我有这样的东西:

def myfunction(text):
    if not isinstance(text, unicode):
        text = unicode(text, 'utf-8')

    ...

是否可以避免使用 isinstance？我正在寻找更适合鸭子打字的东西。

在我的解码实验中，我遇到了 Python 的几种奇怪行为。例如:

>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)

或者

>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported

顺便说一下。我正在使用 Python 2.6

最佳答案

您可以尝试使用“utf-8”编解码器对其进行解码，如果不起作用，则返回该对象。

def myfunction(text):
    try:
        text = unicode(text, 'utf-8')
    except TypeError:
        return text

print(myfunction(u'cer\xf3n'))
# cerón

当你获取一个 unicode 对象并使用 'utf-8' 编解码器调用它的 decode 方法时，Python 首先尝试将 unicode 对象转换为字符串对象，然后调用字符串对象的 decode('utf-8') 方法。

有时从 unicode 对象到字符串对象的转换会失败，因为 Python2 默认使用 ascii 编解码器。

所以，一般来说，永远不要尝试解码 unicode 对象。或者，如果您必须尝试，请将其困在 try..except block 中。可能有一些编解码器可以在 Python2 中解码 unicode 对象(见下文)，但它们已在 Python3 中删除。

查看此 Python bug ticket为了对这个问题进行有趣的讨论，还有Guido van Rossum's blog :

"We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API)."

关于python - 如果不是 unicode 则解码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3857763/

python - 如果不是 unicode 则解码

上一篇：python - 来自外部作用域的函数局部名称绑定(bind)

下一篇：python - 在解释器中重新加载(更新)模块文件