python - 从文件中有效地删除包含字符串的行

FileA 包含行 FileB 包含单词

如何高效从 FileB 中删除包含在 FileA 中找到的单词的行？

我尝试了以下方法，但我什至不确定它们是否有效，因为运行时间太长了。

尝试过grep:

grep -v -f <(awk '{print $1}' FileB.txt) FileA.txt > out

还尝试了 python:

f = open(sys.argv[1],'r')
out = open(sys.argv[2], 'w')
bad_words = f.read().splitlines()

with open('FileA') as master_lines:
  for line in master_lines:
    if not any(bad_word in line for bad_word in bad_words):
      out.write(line)

文件A:

abadan refinery is one of the largest in the world.
a bad apple spoils the barrel.
abaiara is a city in the south region of brazil.
a ban has been imposed on the use of faxes

文件B:

abadan
abaiara

期望的输出:

a bad apple spoils the barrel.
a ban has been imposed on the use of faxes

最佳答案

我拒绝相信 Python 至少不能与 Perl 在这一方面的性能相媲美。这是我对用 Python 解决这个问题的更有效版本的快速尝试。我正在使用 sets优化这个问题的搜索部分。 & 运算符返回一个新集合，其中包含两个集合共有的元素。

此解决方案在我的机器上运行 3M 行的 fileA 和 200k 字的 fileB 需要 12 秒，perl 需要 9 秒。最大的减慢似乎是 re.split，它似乎比字符串快。在这种情况下 split 。

如果您有任何提高速度的建议，请评论此答案。

import re

filea = open('Downloads/fileA.txt')
fileb = open('Downloads/fileB.txt')

output = open('output.txt', 'w')
bad_words = set(line.strip() for line in fileb)

splitter = re.compile("\s")
for line in filea:
    line_words = set(splitter.split(line))
    if bad_words.isdisjoint(line_words):
        output.write(line)

output.close()

关于python - 从文件中有效地删除包含字符串的行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22493577/

python - 从文件中有效地删除包含字符串的行

上一篇：python - 使用python加速将大型数据集从txt文件插入到mySQL

下一篇：python - 在特定值上打勾