python - 获取拉丁字符的所有 unicode 变体

例如，对于字符 "a"，我想得到一个像 "aàáâãäåāăą" 这样的字符串(字符列表)(不确定该示例列表是否完整...)(基本上所有名称为 "Latin Small Letter A with *" 的 unicode 字符)。

是否有通用的方法来获取它？

我要的是 Python，但如果答案更通用，这也很好，尽管无论如何我都希望有 Python 代码片段。 Python >=3.5 没问题。但我猜你需要访问 Unicode 数据库，例如Python 模块 unicodedata，与其他外部数据源相比，我更喜欢它。

我可以想象这样的解决方案:

def get_variations(char):
   import unicodedata
   name = unicodedata.name(char)
   chars = char
   for variation in ["WITH CEDILLA", "WITH MACRON", ...]:
      try: 
          chars += unicodedata.lookup("%s %s" % (name, variation))
      except KeyError:
          pass
   return chars

最佳答案

首先，获取结合变音字符的 Unicode 集合； they're contiguous, so this is pretty easy ，例如:

# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))

现在定义一个函数，尝试用一个基本的 ASCII 字符组合每个字符；当组合范式长度为1时(意味着ASCII +组合成为单个Unicode序数)，保存它:

import unicodedata

def get_unicode_variations(letter):
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = []
    # We could just loop over map(chr, range(768, 880)) without caching
    # in combining_chars, but that increases runtime ~20%
    for combiner in combining_chars:
        normalized = unicodedata.normalize('NFKC', letter + combiner)
        if len(normalized) == 1:
            variations.append(normalized)
    return ''.join(variations)

这样做的好处是无需尝试在 unicodedata 数据库中手动执行字符串查找，也无需对组合字符的所有可能描述进行硬编码。包含单个字符的任何内容；在我的机器上检查的运行时间不到 50 µs，所以如果你不经常这样做，成本是合理的(如果你打算重复调用它，你可以用 functools.lru_cache 装饰使用相同的参数并希望避免每次都重新计算它)。

如果您想从这些字符中获取构建的所有内容，更详尽的搜索可以找到它，但需要更长的时间(functools.lru_cache 会几乎是强制性的，除非每个参数只调用一次):

import functools
import sys
import unicodedata

@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter): 
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = [] 
    for testlet in map(chr, range(sys.maxunicode)): 
        if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter: 
            variations.append(testlet) 
    return ''.join(variations)

这会查找任何分解成包含目标字母的形式的字符；这确实意味着第一次搜索大约需要三分之一秒，结果包括的内容不仅仅是字符的修改版本(例如 'L' 的结果将包括℡，这并不是真正的“修改后的 'L'”，但它已尽您所能。

关于python - 获取拉丁字符的所有 unicode 变体，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57169516/

python - 获取拉丁字符的所有 unicode 变体

上一篇：python - 如何在 python 中将两个列表转换为没有 ['xxx' ] 标签的数据框？

下一篇：python - 如何找出列表中具有相同内容的元素？