Python-需要快速算法来删除文件中所有派生词，换句话说

我们有一个名为 wordlist 的文件，其中包含值(value) 1,876 KB 的按字母顺序排列的单词，所有这些单词的长度都超过 4 个字母，并且在每个新的双字母结构(ab、ac、ad 等，单词)之间包含一个回车符都包含它们之间的返回):

 wfile = open("wordlist.txt", "r+")

我想创建一个新文件，其中只包含不是其他更小单词的派生词。例如，单词列表包含以下词 ["abuser, abused, abusers, abuse, abuses, etc.] 创建的新文件应该只保留单词"abuse"，因为它是"最小公分母"(如果你will) 在所有这些单词之间。同样，单词“rodeo”将被删除，因为它包含单词 rode。

我试过这个实现:

def root_words(wordlist):
    result = []
    base = wordlist[1]
    for word in wordlist:
        if not word.startswith(base):
            result.append(base)
            print base
            base=word
    result.append(base)
    return result;


def main():
    wordlist = []
    wfile = open("wordlist.txt", "r+")

    for line in wfile:
        wordlist.append(line[:-1])

    wordlist = root_words(wordlist)
    newfile = open("newwordlist.txt", "r+")    
    newfile.write(wordlist)

但它总是卡住我的电脑。任何解决方案？

最佳答案

我会做这样的事情:

def bases(words):
    base = next(words)
    yield base
    for word in words:
        if word and not word.startswith(base):
            yield word
            base = word


def get_bases(infile, outfile):
    with open(infile) as f_in:
        words = (line.strip() for line in f_in)
        with open(outfile, 'w') as f_out:
            f_out.writelines(word + '\n' for word in bases(words))

这通过 corncob list在我相当旧的笔记本电脑上，五分之一秒内读了 58,000 个单词。它的年龄足以拥有 1 GB 的内存。

$ time python words.py

real        0m0.233s
user        0m0.180s
sys         0m0.012s

它在所有可能的地方都使用迭代器来简化内存。您可以通过切掉行尾而不是使用 strip 来去除换行符来提高性能。

另请注意，这取决于您的输入是否已排序且非空。不过，这是规定的先决条件的一部分，所以我对此并不感到太 ;)

关于Python-需要快速算法来删除文件中所有派生词，换句话说，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/4791567/

Python-需要快速算法来删除文件中所有派生词，换句话说

上一篇：python - 如何在 python 中导出给定(iso)周数/年的周开始

下一篇：python - 将 Perl 正则表达式转换为 Python 正则表达式