我在 python 中有以下函数，它接受一个字符串作为参数并返回相同的 ASCII 字符串(例如“alçapão” -> “alcapao”):

def filt(word):
    dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
    new = ''
    for l in word:
        new = new + dic.get(l, l)
    return new

它应该“过滤”我使用以下命令从文件中读取的列表中的所有字符串:

lines = []
with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.strip())

lines = [filt(l) for l in lines]

但我明白了:

filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert 
  both arguments to Unicode - interpreting them as being unequal 
  new = new + dic.get(l, l)

并且过滤后的字符串包含“\xc3\xb4”等字符，而不是 ASCII 字符。我该怎么办？

最佳答案

您正在混合和匹配 Unicode 字符串和常规(字节)字符串。

使用 io 模块打开文本文件并将其解码为读取的 Unicode:

with io.open("to-filter.txt","r", encoding="utf-8") as f:

这假设您的 to-filter.txt 文件是 UTF-8 编码的。

您还可以将读取的文件缩小到数组中:

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines 现在是 Unicode 字符串列表。

可选

您似乎正在尝试将非 ASCII 字符转换为其最接近的 ASCII 等效字符。简单的方法是:

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

它的作用是:

将每个字符分解为其组成部分。例如，ã 可以表示为单个 Unicode 字符 (U+00E3 'LATIN SMALL LETTER A WITH TILDE') 或两个 Unicode 字符:U+0061 'LATIN SMALL LETTER A' + U+正文正文_第0303章
将组成部分编码为 ASCII。非 ASCII 部分(代码点大于 U+007F 的部分)将被忽略。
为方便起见，解码回 Unicode 字符串。

太;博士

您的代码现在是:

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines = [filt(l) for l in lines]

Python 3.x

虽然不是绝对必要，但从 open() 中删除 io

关于python - 用python读取UTF-8字符时出错，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42302383/

python - 用python读取UTF-8字符时出错

可选

太;博士

Python 3.x

上一篇：python - Windows 版 Python 中的控制台输入历史记录存储在哪里？

下一篇：python - 在 Python 2.7 中打印带反斜杠的字符串