python - 是否存在从 utf8 到 latin-1 的现有映射? Python

标签 python unicode encoding utf-8 latin1

是否存在从 utf8 到 latin-1 和 utf8 中标准化非重音字母的映射?

我收到如下错误:

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u010d' in position 4: ordinal not in range(256)

我正在通过执行以下代码手动解决每个错误。有更好的方法吗?:

def prehunpos(sentence):
    sentence = sentence.replace(u'\u2018',"'") # left single quote mark
    sentence = sentence.replace(u'\u2019',"'") # right single quote mark
    sentence = sentence.replace(u'\u201C','"') # left double quote mark
    sentence = sentence.replace(u'\u201D','"') # right double quote mark
    sentence = sentence.replace(u'\u2010',"-") # hyphen
    sentence = sentence.replace(u'\u2011',"-") # non-break hyphen
    sentence = sentence.replace(u'\u2012',"-") # figure dash
    sentence = sentence.replace(u'\u2013',"-") # dash
    sentence = sentence.replace(u'\u2014',"-") # some sorta dash
    sentence = sentence.replace(u'\u2015',"-") # long dash
    sentence = sentence.replace(u'\u2017',"_") # double underscore
    sentence = sentence.replace(u'\u2014',"-") # some sorta dash
    sentence = sentence.replace(u'\u2016',"|") # long dash
    sentence = sentence.replace(u'\u2024',"...") # ...
    sentence = sentence.replace(u'\u2025',"...") # ...
    sentence = sentence.replace(u'\u2026',"...") # ...
    sentence = sentence.replace("\xce\x9d\xce\x91\xce\xa4\xce\x9f",u'NATO') # NATO

    sentence = sentence.replace(u'\u0391',"A") # Greek Capital Alpha
    sentence = sentence.replace(u'\u0392',"B") # Greek Capital Beta
    #sentence = sentence.replace(u'\u0393',"") # Greek Capital Gamma
    #sentence = sentence.replace(u'\u0394',"") # Greek Capital Delta
    sentence = sentence.replace(u'\u0395',"E") # Greek Capital Epsilon
    sentence = sentence.replace(u'\u0396',"Z") # Greek Capital Zeta
    sentence = sentence.replace(u'\u0397',"H") # Greek Capital Eta
    #sentence = sentence.replace(u'\u0398',"") # Greek Capital Theta
    sentence = sentence.replace(u'\u0399',"I") # Greek Capital Iota
    sentence = sentence.replace(u'\u039a',"K") # Greek Capital Kappa
    #sentence = sentence.replace(u'\u039b',"") # Greek Capital Lambda
    sentence = sentence.replace(u'\u039c',"M") # Greek Capital Mu
    sentence = sentence.replace(u'\u039d',"N") # Greek Capital Nu
    #sentence = sentence.replace(u'\u039e',"") # Greek Capital Xi
    sentence = sentence.replace(u'\u039f',"O") # Greek Capital Omicron
    sentence = sentence.replace(u'\u03a1',"P") # Greek Capital Rho
    #sentence = sentence.replace(u'\u03a3',"") # Greek Capital Sigma
    sentence = sentence.replace(u'\u03a4',"T") # Greek Capital Tau
    sentence = sentence.replace(u'\u03a5',"Y") # Greek Capital Upsilon
    #ssentence = sentence.replace(u'\u03a6',"") # Greek Capital Phi
    sentence = sentence.replace(u'\u03a7',"T") # Greek Capital Chi
    #sentence = sentence.replace(u'\u03a8',"") # Greek Capital Psi
    #sentence = sentence.replace(u'\u03a9',"") # Greek Capital Omega

    sentence = sentence.replace(u'\u03b1',"a") # Greek small alpha
    sentence = sentence.replace(u'\u03b2',"b") # Greek small beta
    #sentence = sentence.replace(u'\u03b3',"") # Greek small gamma
    #sentence = sentence.replace(u'\u03b4',"") # Greek small delta
    sentence = sentence.replace(u'\u03b5',"e") # Greek small epsilon
    #sentence = sentence.replace(u'\u03b6',"") # Greek small zeta
    #sentence = sentence.replace(u'\u03b7',"") # Greek small eta
    #sentence = sentence.replace(u'\u03b8',"") # Greek small thetha
    sentence = sentence.replace(u'\u03b9',"i") # Greek small iota
    sentence = sentence.replace(u'\u03ba',"k") # Greek small kappa
    #sentence = sentence.replace(u'\u03bb',"") # Greek small lamda
    sentence = sentence.replace(u'\u03bc',"u") # Greek small mu
    sentence = sentence.replace(u'\u03bd',"v") # Greek small nu
    #sentence = sentence.replace(u'\u03be',"") # Greek small xi
    sentence = sentence.replace(u'\u03bf',"o") # Greek small omicron
    #sentence = sentence.replace(u'\u03c0',"") # Greek small pi
    sentence = sentence.replace(u'\u03c1',"p") # Greek small rho
    sentence = sentence.replace(u'\u03c2',"c") # Greek small final sigma
    #sentence = sentence.replace(u'\u03c3',"") # Greek small sigma
    sentence = sentence.replace(u'\u03c4',"t") # Greek small tau
    sentence = sentence.replace(u'\u03c5',"u") # Greek small upsilon
    #sentence = sentence.replace(u'\u03c6',"") # Greek small phi
    sentence = sentence.replace(u'\u03c7',"x") # Greek small chi
    sentence = sentence.replace(u'\u03c8',"x") # Greek small psi
    sentence = sentence.replace(u'\u03c9',"w") # Greek small omega


    sentence = sentence.replace(u'\u0103',"a") # Latin a with breve
    sentence = sentence.replace(u'\u0107',"c") # Latin c with acute
    sentence = sentence.replace(u'\u010d',"c") # Latin c with caron
    sentence = sentence.replace(u'\u0161',"s") # Lation s with caron

    return sentence.strip()

最佳答案

如果您需要一种将非拉丁文字转换为拉丁文字的通用方法,请使用 ICU transform是最好的选择。 ICU 有一个 Python 包装器,PyICU ( http://pypi.python.org/pypi/PyICU )。但是,如果您仅针对单个脚本(看起来您对希腊语特别感兴趣?),映射表是最快的解决方案。虽然你可以写得更简洁:

#!/usr/bin/python
# -*- coding: utf-8 -*-

greek_to_latin = {u"Α": u"A", u"Β": u"B", u"Γ": u"G"}  # ...
latin_string = "".join(greek_to_latin[c] for c in greek_string)

您还可以查看 unicodedata 模块,它可以识别字符的类别,识别非 ASCII 标点符号。

关于python - 是否存在从 utf8 到 latin-1 的现有映射? Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14681809/

相关文章:

Python appengine 查询在使用变量时不起作用

Python Goose 无法提取日期

java - 防止Gson转义Unicode符号

postgresql - Heroku Postgres TAPS 到 Amazon RDS MySQL 迁移编码

objective-c - Objective-C 中的转义引号

java - 如何在java中通过FTP编写 "UTF-16"编码文件

python - 无法在 MacOS 11.1 上使用 pyenv 安装带有共享库的 Python 构建

python - "retval"是好的 python 风格吗?

用于 unicode 的 Java SHA1?

php - PHP字符串中的Unicode字符