python - 在 Python 文本语料库中优化正则表达式搜索

这是我在 python 中的要求:

我从字面上加载字典 - 比如说从/usr/share/dict/words (不要与字典类型混淆)并用它来搜索有效的单词。目前我正在做如下:

dict_list  = open('dictionary', 'r').read().split()

def search_dictionary(key):
    p=re.compile(key)
    # Comment: Is 'key' a prefix for a valid word in dictionary?
    # If yes, return True, else return False
    tmp_list = [x for x in dict_list if bool(p.match(x))]
    if not tmp_list:
        ....
    else:
        ....

注意 search_dictionary 可以被调用很多次，这是目前的瓶颈。有没有更有效的方法来执行此字符串搜索？就像说预编译字典。人们可以想到一个字典攻击用例。我比较入门。

编辑:我已经用注释更新了代码。正如评论中所建议的，我可能做的工作比需要的多。

最佳答案

您的算法运行时间复杂度为 O(n)，且常量很大。这似乎是错误的，当一个简单的二进制搜索可以做 O(lg n) 时。如果您的正则表达式不包含特殊字符，为什么不:

import bisect
with open('dictionary') as f:
    dictionary = f.read().split()

    # .sort is slow, so better to sort
    # words on disk! And/or run many searches
    # for one invocation
    dictionary.sort()

def bisect_search(key):
    i = bisect.bisect_left(dictionary, key)
    if i != len(dictionary):
        return dictionary[i].startswith(key)

    return False

对数组进行排序，然后找到 word >= key 的字典序“最小”单词，并查看它是否是给定键的前缀。

针对 Padraic 的字典中的第一个字母进行速度测试，然后进行线性搜索:

In [1]: %timeit bisect_search('thu')
1000000 loops, best of 3: 1.07 µs per loop

In [2]: %timeit search_dictionary('thu')
1000 loops, best of 3: 595 µs per loop

146880个单词，5760个以t开头的单词。

关于python - 在 Python 文本语料库中优化正则表达式搜索，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28532487/

python - 在 Python 文本语料库中优化正则表达式搜索

上一篇：python - 在 Python 中的文本文件中间插入一行

下一篇：python - 检测悬停在 PyQt4 中的特定小部件上