Python 2.7 统一码解码错误 : 'ascii' codec can't decode byte

我一直在解析一些带有特殊字符(捷克字母)的 docx 文件(UTF-8 编码的 XML)。当我尝试输出到 stdout 时，一切顺利，但我无法将数据输出到文件，

Traceback (most recent call last):
File "./test.py", line 360, in
ofile.write(u'\t\t\t\t\t\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 37: ordinal not in range(128)

尽管我明确地将 word 变量转换为 unicode 类型(type(word) 返回了 unicode)，但我尝试使用 .encode('utf -8) 我仍然被这个错误困住了。

这是现在的代码示例:

for word in word_list:
    word = unicode(word)
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

我还尝试了以下方法:

for word in word_list:
    word = word.encode('utf-8')
    #...
    ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
    #...

甚至这两者的结合:

word = unicode(word)
word = word.encode('utf-8')

我有点绝望，所以我什至尝试在 ofile.write()

中对单词变量进行编码

ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')

对于我做错了什么的任何提示，我将不胜感激。

最佳答案

ofile 是一个字节流，您正在向其中写入字符串。因此，它会尝试通过编码为字节字符串来处理您的错误。这通常只对 ASCII 字符安全。由于 word 包含非 ASCII 字符，因此失败:

>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
                    ordinal not in range(128)

通过使用 io.open 打开文件，使 ofile 成为文本流，具有类似 'wt' 的模式和显式编码:

>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L

或者，您也可以使用 codecs.open使用几乎相同的界面，或使用 encode 手动编码所有字符串.

关于Python 2.7 统一码解码错误 : 'ascii' codec can't decode byte，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13512443/

Python 2.7 统一码解码错误 : 'ascii' codec can't decode byte

上一篇：python - 词性标注——NLTK 认为名词是形容词

下一篇：python - 如何在 Python 中反转希伯来字符串？