python - 从文本文件输入中删除重复的单词？

我正在玩一个需要 3 个参数的函数，即文本文件的名称、substring1 和 substring2。它将搜索文本文件并返回包含两个子字符串的单词:

def myfunction(filename, substring1, substring2)
    result = ""
    text=open(filename).read().split()
    for word in text:
        if substring1 in word and substring2 in word:
            result+=word+" "
    return result

此函数有效，但我想删除重复的结果。例如，对于我的特定文本文件，如果 substring1 是“at”而 substring2 是“wh”，它将返回“what”，但是，因为我的文本文件中有 3 个“what”，所以它会返回所有这些。我正在寻找一种不返回重复项，只返回唯一单词的方法，我也想保留顺序，那么这是否算“集合”？

我想也许对“文本”做一些事情会起作用，以某种方式删除循环之前的重复项。

最佳答案

这是一个使用很少的内存(在文件行上使用迭代器)并且具有良好的时间复杂度(当返回的单词列表很大，就像 substring1 是“a”而 substring2 是“e”(英语)的情况一样):

import collections

def find_words(file_path, substring1, substring2)
    """Return a string with the words from the given file that contain both substrings."""
    matching_words = collections.OrderedDict()
    with open(file_path) as text_file:
        for line in text_file:
            for word in line.split():
                if substring1 in word and substring2 in word:
                    matching_words[word] = True
    return " ".join(matching_words)

OrderedDict 保留了键首次使用的顺序，因此这使单词保持它们被发现的顺序。由于是映射，所以不存在重复的单词。之所以获得良好的时间复杂度，是因为在 OrderedDict 中插入键是在恒定时间内完成的(而不是许多 if word in result_list 的线性时间)其他解决方案)。

关于python - 从文本文件输入中删除重复的单词？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23539537/

python - 从文本文件输入中删除重复的单词？

上一篇：python - 在 Python 中模拟全功能交换机

下一篇：python - Sqlite3 with Python - 从外部文件查询