Python-查找可以在单词内部找到的所有子词

最终，我想找出英语词典中哪个单词包含最多三个字母的子词。我写了这个算法，但它太慢了，没有用。想知道如何优化它

def subWords(word):
    return set((word[0:i] for i in range(2, len(word)+1))) #returns all subWords of length 2 or greater

def checkDict(wordList, dictList):
    return set((word for word in wordList if word in dictList))

def main():
    dictList = [i.strip() for i in open('wordlist.txt').readlines()]
    allwords = list()
    maximum = (0, list())

    for dictWords in dictList:
        for i in range (len(dictWords)):
            for a in checkDict(subWords(dictWords[i: len(dictWords) + 1]), dictList):
                allwords.append(a)

        if len(allwords) > maximum[0]:
            maximum = (len(allwords), allwords)

        print maximum
        allwords = list()

    print maximum 
main()

最佳答案

您的算法的主要弱点是，对于每个子词，您需要将其与字典中的每个其他词进行比较。你不需要这样做，真的——如果你的单词以“a”开头，你真的不需要查看它是否与以“b”开头的单词匹配。如果下一个字母是“c”，那么你并不真的在意将它与以“d”开头的单词进行比较。那么问题就变成了:“我如何有效地实现这个想法？”

为此，我们可以创建一棵树来表示字典中的所有单词。我们通过获取字典中的每个单词并用它扩展树并在最后一个节点中添加阴影来构建这棵树。

Example Tree

当我们想要测试一个子词是否在这棵树中时，我们只需逐个字母地检查该词并使用这些字母来确定树中的下一个位置(从顶部开始)。如果我们发现我们无处可去，或者在遍历整个子词后我们落在了一个没有阴影的树节点上，那么它就不是一个词。否则，如果我们落在阴影节点上，它就是一个词。这样做的效果是我们可以一次搜索整个词典，而不是一次搜索一个词。当然，这样做的代价是一开始需要进行一些设置，但如果字典中有很多单词，那么付出的代价并不是很大。

嗯，这一切都太棒了!让我们尝试实现它:

class Node:
    def __init__( self, parent, valid_subword ):
        self.parent = parent
        self.valid_subword = valid_subword
        self.children = {}

    #Extend the tree with a new node
    def extend( self, transition, makes_valid_word ):
        next_node = None
        if transition in self.children:
            if makes_valid_word:
                self.children[transition].makes_valid_word = True
        else:
            self.children[transition] = Node( self, makes_valid_word )
        return self.children[transition]

def generateTree( allwords ):
  tree = Node( None, False )
    for word in allwords:
      makes_valid_word = False
      current_node = tree
      for i in range(len(word)):
        current_node = current_node.extend( word[i], True if i == len(word) - 1 else False )
  return tree

def checkDict( word, tree ):
    current_node = tree
    for letter in word:
        try:
            current_node = current_node.children[letter]
        except KeyError:
            return False

    return current_node.valid_subword

然后，稍后:

for word in allWords:
  for subword in subWords(word):
    checkDict(subword)
    #Code to keep track of the number of words found, like you already have

此算法允许您在 O(m) 时间内检查一个单词是否在您的字典中，其中 m 是字典中最长单词的长度。请注意，对于包含任意数量单词的字典，这大致保持不变。您的原始算法是每次检查 O(n)，其中 n 是字典中的单词数。

关于Python-查找可以在单词内部找到的所有子词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/6933741/

Python-查找可以在单词内部找到的所有子词

上一篇：python - 使用全局标志进行 python RegExp 编译

下一篇：python - 如何在 Python 中比较数组中的值 - 找出 2 个值是否相同