python - UTF-8 转换为真实字母

标签 python python-3.x utf

我的一个项目需要帮助。我正在清理大量数据以批量插入到 Microsoft SQL 中。数据就像 1000 万行。但是我创建了一个脚本,只是为了提取前 1000 个进行清洁,假设其余的都相同。我注意到有很多 UTF-8 字符,所以我将它转换为最接近的真实字符。但是在我提取它以查看前 100000 行之后,我注意到需要完成更多的 UTF-8 转换,我正在手动转换它们,这非常详尽。我想知道是否有比手动输入所有内容更简单的方法来执行此操作。这是我的代码:

import re

infile = r"C:\\Users\\Dave\\Desktop\\database\\page-links_en.txt"
outfile = r"C:\\Users\\Dave\\Desktop\\database\\Complete\\cleanedpagelinks_file.txt"

fin = open(infile)
fout = open(outfile, "w+")

rex = re.compile(r'/([^/>]+)>')

for line in fin:
#for word in delete_list:
#    line = line.replace(word, "")
line = line.replace("%C3%A9","e")
line = line.replace("%C3%B3","o")
line = line.replace("%E2%80%93","-")
line = line.replace("%C3%A6","e")
line = line.replace("%C3%A8","e")
line = line.replace("_"," ")
line = line.replace("%C3%A0","e")
line = line.replace("%C3%A1","i")
line = line.replace("%C5%82","l")
line = line.replace("%C5%84","n")
line = line.replace("%C3%BF", "y")
line = line.replace("%C3%BE", "p")
line = line.replace("%C3%BD", "y")
line = line.replace("%C3%BC", "u")
line = line.replace("%C3%BB", "u")
line = line.replace("%C3%BA", "u")
line = line.replace("%C3%B9", "o")
line = line.replace("%C3%B6", "o")
line = line.replace("%C3%B5", "o")
line = line.replace("%C3%B4", "o")
line = line.replace("%C3%B3", "o")
line = line.replace("%C3%B2", "o")
line = line.replace("%C3%B1", "n")
line = line.replace("%C3%B0", "e")
line = line.replace("%C3%AC", "i")
line = line.replace("%C3%AD", "i")
line = line.replace("%C3%AE", "i")
line = line.replace("%C3%AF", "i")
line = line.replace("%C3%81","A")
line = line.replace("%C3%82","A")
line = line.replace("%C3%83","A")
line = line.replace("%C3%84","A")
line = line.replace("%C3%85","A")
line = line.replace("%C3%86","AE")
line = line.replace("%C3%87","C")
line = line.replace("%C3%88","E")
line = line.replace("%C3%89","E")
line = line.replace("%C3%8A","E")
line = line.replace("%C3%8B","E")
line = line.replace("%C3%8C","I")
line = line.replace("%C3%8D","I")
line = line.replace("%C3%8E","I")
line = line.replace("%C3%8F","I")
line = line.replace("%C3%90","D")
line = line.replace("%C3%91","N")
line = line.replace("%C3%92","O")
line = line.replace("%C3%93","O")
line = line.replace("%C3%94","O")
line = line.replace("%C3%95","O")
line = line.replace("%C3%96","O")
line = line.replace("%C3%98","O")
line = line.replace("%C3%99","U")
line = line.replace("%C3%9A","U")
line = line.replace("%C3%9B","U")
line = line.replace("%C3%9C","U")
line = line.replace("%C3%9D","Y")
line = line.replace("%C3%9F","B")
line = line.replace("%C3%a0","a")
line = line.replace("%C3%a1","a")
line = line.replace("%C3%a2","a")
line = line.replace("%C3%a3","a")
line = line.replace("%C3%a4","a")
line = line.replace("%C3%a5","a")
line = line.replace("%C3%a6","ae")
line = line.replace("%C3%a7","c")
line = line.replace("%C3%a8","e")
line = line.replace("%C3%a9","e")
line = line.replace("%C3%aa","e")
line = line.replace("%C3%ab","e")


match = rex.search(line)
if match:
    newline = match.group(1)
else: newline = ''
fout.write(newline + '\n')
fin.close()
fout.close()

正如您在我的代码中看到的那样,我正在手动替换为真实的字符值。 这是我意识到需要转换的文本文件中的示例行。

B%E1%BA%A3o %C4%90%E1%BA%A1i

最佳答案

你可以使用 unidecodeurllib.parse.unquote :

In [8]: from unidecode import  unidecode

In [9]: from urllib.parse import unquote

In [10]: unidecode(unquote("Gotterd%C3%A4mmerung"))
Out[10]: 'Gotterdammerung'

unidecode 会将非 ascii 字符转换为它们的 ascii 等效字符。

关于python - UTF-8 转换为真实字母,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35817952/

相关文章:

python-3.x - telethon 通过 user_id 获取 access_hass

javascript - 在 Javascript (Ajax) 中检索二进制数据

Java不止一种utf-8

python - 有人可以解释 scipy 中超几何分布的这种奇怪行为吗?

javascript - django 循环中的计数器

python - 如何找到 Pandas 时间序列的事件持续时间

python - python 中的 BMI 计算器 : too many if statements how to reduce that?

python - 是否有方法/属性/var 可以获取 python3 unittest 中跳过的测试列表?

Python 处理文件中的行时出错

python - 从饼图中删除标签会移动图例框