python - Unicode解码错误: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

标签 python file-io io output


Traceback (most recent call last):
  File "", line 83, in <module>
    outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)


from collections import Counter

with open("corpus.txt") as inf:
    wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)

with open("lexic.txt", "w") as outf:
    for word,count in wordtagcount.iteritems():
        outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra  Prediccion
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
diccionario = {}

In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
for linea in data:
    aux = linea.decode('latin_1').encode('utf-8')
    sintagma = aux.split('\t')  # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
    if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
        if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
            aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
            aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
            diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
        else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
            aux_list_else = ([sintagma[1],sintagma[2]])

Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])

For retrieve the information from diccionario, we have to keep in mind:

In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:

diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
    for keyword in diccionario:
        tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
        maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
        if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
            suma = float(diccionario.get(keyword)[1])
            for i in range (2, len(diccionario.get(keyword))):
                suma += float(diccionario.get(keyword)[i][1])
                if (diccionario.get(keyword)[i][1] > maximo):
                    tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
                    maximo = float(diccionario.get(keyword)[i][1])
            probabilidad = float(maximo/suma);
            diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})

            diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})

        outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))


keyword(String)  tagSugerido(String):
Hello    NC
Friend   N
Run      V


outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))



像 zmo 建议的那样:

outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))


outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))

关于 Python 2 中 unicode 的注释

您的软件应该只在内部使用 unicode 字符串,在输出时转换为特定的编码。

一定要防止重复犯同样的错误你应该确保你理解asciiutf-8编码之间的区别以及str之间的区别 和 Python 中的 unicode 对象。


Ascii 只需要一个字节来表示 ascii 字符集/编码中的所有可能字符。 UTF-8 最多需要四个字节来表示完整的字符集。

ascii (default)
1    If the code point is < 128, each byte is the same as the value of the code point.
2    If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

utf-8 (unicode transformation format)
1    If the code point is <128, it’s represented by the corresponding byte value.
2    If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3    Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.


你可以说 str 基本上是一个字节串,unicode 是一个 unicode 字符串。两者都可以有不同的编码,如 ascii 或 utf-8。

str vs. unicode
1   str     = byte string (8-bit) - uses \x and two digits
2   unicode = unicode string      - uses \u and four digits
3   basestring
       /  \
    str    unicode

如果您遵循一些简单的规则,您应该能够很好地处理不同编码(如 ascii 或 utf-8 或您必须使用的任何编码)的 str/unicode 对象:

1    encode(): Gets you from Unicode -> bytes
     encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2    decode(): Gets you from bytes -> Unicode
     decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4    u”: Makes your string literals into Unicode objects rather than byte sequences.
5    unicode(string[, encoding, errors]) 

警告:不要对字节使用 encode() 或对 Unicode 对象使用 decode()

再说一次:软件应该只在内部使用 Unicode 字符串,在输出时转换为特定的编码。

关于python - Unicode解码错误: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128),我们在Stack Overflow上找到一个类似的问题:


python - 在标题上移动 Wagtail Admin list_filter - 列表过滤器覆盖表数据

python - 如何在文件读取期间从每一行中去除换行符?

java - 打开 PDF 只能在 netbeans 中使用

python - 将 "Slurp"所有 STDIN 转换为字符串的最有效方法

python - 如何在 python 中将列表的元素分配为文件名?

python - 性能:使用 Python 读取文件的最快方法

python - 使用 Networkx 计算顶点所属的最短路径数的更快方法

Python - 我应该使用线程还是进程来进行网络事件?

python - 将多列合并为一列 pandas

java - ObjectInputStream 抛出 EOFException