python - 正则表达式以获取具有特定字母的所有单词列表(unicode 字素)

我正在为 FOSS 语言学习计划编写 Python 脚本。假设我有一个 XML 文件(或者为了简单起见，一个 Python 列表)，其中包含特定语言的单词列表(在我的例子中，这些单词是泰米尔语，它使用基于婆罗米语的印度脚本)。

我需要画出仅使用这些字母即可拼写的那些单词的子集。

一个英文例子:

words = ["cat", "dog", "tack", "coat"] 

get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]

泰米尔语示例:

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]

get_words([u'ம', u'ப', u'ட', u'ம்')  should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"]

返回单词的顺序或输入字母的顺序应该没有区别。

虽然我理解 unicode 代码点和字素之间的区别，但我不确定它们在正则表达式中是如何处理的。

在这种情况下，我只想匹配那些由输入列表中的特定字素组成的词，而不是其他任何东西(即字母后面的标记应该只跟在那个字母后面，但字素本身可以以任何顺序出现)。

最佳答案

要支持可以跨越多个 Unicode 代码点的字符:

# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial

NFKD = partial(unicodedata.normalize, 'NFKD')

def match(word, letters):
    word, letters = NFKD(word), map(NFKD, letters) # normalize
    return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]

print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்

它假定同一个字符可以在一个单词中使用零次或多次。

如果您只想要包含完全给定字符的单词:

import regex # $ pip install regex

chars = regex.compile(r"\X").findall # get all characters

def match(word, letters):
    return sorted(chars(word)) == sorted(letters)

words = ["cat", "dog", "tack", "coat"]

print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack

注意:在这种情况下，输出中没有 cat，因为 cat 没有使用所有给定的字符。

What does it mean to normalize? And could you please explain the syntax of the re.match() regex?

>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç

没有规范化 c 和 cc 不匹配。这些字符来自 unicodedata.normalize() docs .

关于python - 正则表达式以获取具有特定字母的所有单词列表(unicode 字素)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14544071/

python - 正则表达式以获取具有特定字母的所有单词列表(unicode 字素)

上一篇：python - 我试图在两次之间循环，从 8 :00 to 17:00 for every 15 mins

下一篇：python - 如何在 Python 中删除 Riak 存储桶？