python - 在 100k len 的单词列表中查找 4k 个单词

我有一个 txt 文件中的单词列表，以及一个列表中的数千个单词。我如何搜索它们并将它们添加到新列表中(如果找到)

words.txt 文件包含 100k 单词。 my_list 包含 4k 个单词。

目前我正在这样做。

    my_list = [hello, hi, hey, ho, wow, .....] 

    with open("words.txt") as f:
        lines = [line.rstrip() for line in f]
    
    words_in_lines = []

    for i in my_list: 
        if i in lines:
            words_in_lines.append(i)

这是没有结束的，它不会执行，因为单词列表中的单词太多。

最佳答案

将 my_list 从列表转换为集合以加快查找速度
不要从 my_list 中逐行查找单词，而是在 my_list 内的 line 中搜索单词

my_list = set([hello, hi, hey, ho, wow, .....])
words_in_lines = []

with open("words.txt") as f:
    for line in f:
        words = line.strip().split()
        for word in words:
            if word in my_list:
                words_in_lines.append(word)

时间复杂度应为O(文件中的单词数)

编辑:正如@greybeard 所指出的，这种方法

words_in_lines 中的单词顺序不同
将 my_list 从列表修改为集合
跳过 10 万单词列表的创建

关于python - 在 100k len 的单词列表中查找 4k 个单词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71375382/

上一篇：r - scale_x_reordered 在facet_grid 中不起作用

下一篇：python - 如何根据句号后面的行将合并为列表的条件将 pandas 数据帧的行组合为列表？

相关文章：

algorithm - 是否可以开发一种算法来解决图同构问题？

c++ - 合并两个数组后得到不正确的数组大小

objective-c - Cocoa Touch 有搜索树数据结构吗？

algorithm - 线段树 : Lazy propagation

python - 当存在 unicode 数据时，Json 解码器不一致

python - Openmesh:使用 Python 更新面法线比使用 C++ 更快？

python - Python中list类的__contains__方法是如何实现的？

python - 是什么导致 ColumnTransformer 出现这种奇怪的行为？ [Python/sklearn]

找到相距最远的点的算法——比 O(n^2) 更好？

python - 总和等于 0 的数字集(也为负数)中最大的子集