python - 比较文本文件内容的最快方法

我有一个问题可以帮助简化我的编程。所以我有这个文件 text.txt 并且我想在其中查看它并将它与单词列表进行比较 words 并且每次找到单词时它都会添加 1 到一个整数。

words = ['the', 'or', 'and', 'can', 'help', 'it', 'one', 'two']
ints = []
with open('text.txt') as file:
    for line in file:
        for part in line.split():
            for word in words:
                if word in part:
                    ints.append(1)

我只是想知道是否有更快的方法来做到这一点？文本文件可能会更大，单词列表也会更大。

最佳答案

您可以将words 转换为set，这样查找会更快。这应该会给您的程序带来良好的性能提升，因为在列表中查找值必须一次遍历列表一个元素(O(n) 运行时复杂度)，但是当您将列表转换为集合时，运行时复杂度将降低到 O(1)(常数时间)。因为集合使用散列来查找元素。

words = {'the', 'or', 'and', 'can', 'help', 'it', 'one', 'two'}

然后每当有匹配时，你可以使用sum函数来计算它

with open('text.txt') as file:
    print(sum(part in words for line in file for part in line.split()))

bool 值及其等价整数

在 Python 中，对于 False 和 True， bool 表达式的结果将等于 0 或 1 > 分别。

>>> True == 1
True
>>> False == 0
True
>>> int(True)
1
>>> int(False)
0
>>> sum([True, True, True])
3
>>> sum([True, False, True])
2

所以每当你检查 part in words 时，结果将是 0 或 1 我们求和 所有这些值(value)观。

上面看到的代码在功能上等同于

result = 0
with open('text.txt') as file:
    for line in file:
        for part in line.split():
            if part in words:
                 result += 1

注意:如果您真的想在有匹配项时在列表中获取 1，那么您可以简单地将生成器表达式转换为 将 加到一个列表理解中，就像这样

with open('text.txt') as file:
    print([int(part in words) for line in file for part in line.split()])

词频

如果你真的想找到words中个别单词的频率，那么你可以使用collections.Counter像这样

from collections import Counter
with open('text.txt') as file:
    c = Counter(part for line in file for part in line.split() if part in words)

这将在内部计算 words 中的每个单词在文件中出现的次数。

根据 the comment ，你可以有一个字典，你可以在其中存储具有正分的肯定词和具有负分的否定词并像这样计算它们

words = {'happy': 1, 'good': 1, 'great': 1, 'no': -1, 'hate': -1}
with open('text.txt') as file:
    print(sum(words.get(part, 0) for line in file for part in line.split()))

在这里，我们使用 words.get 字典来获取针对单词存储的值，如果在字典中找不到该单词(既不是好词也不是坏词)，则返回默认值值 0。

关于python - 比较文本文件内容的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30694928/

python - 比较文本文件内容的最快方法

上一篇：python - 在 Python 中模拟 assert_called_with

下一篇：python 3.4.2 urlib 无属性 'pathname2url'