python - 搜索大文件中的文本并将结果写入文件

我的文件一有 240 万行 (256mb)，文件二有 32000 行 (1.5mb)。

我需要逐行浏览文件二并打印文件一中的匹配行。

伪代码:

open file 1, read
open file 2, read
open results, write

for line2 in file 2:
    for line1 in file 1:
        if line2 in line1:
            write line1 to results
            stop inner loop

我的代码:

p = open("file1.txt", "r")
d = open("file2.txt", "r")
o = open("results.txt", "w")

for hash1 in p:
    hash1 = hash1.strip('\n')
    for data in d:
        hash2 = data.split(',')[1].strip('\n')
        if hash1 in hash2:
            o.write(data)

o.close()
d.close()
p.close()

我期待 32k 个结果。

最佳答案

您的 file2 不太大，因此将其加载到内存中是完全可以的。

将 file2.txt 加载到集合中以加快搜索过程并删除重复项；
从集合中删除空行；
逐行扫描 file1.txt 并将找到的匹配项写入 results.txt。

<小时/>

with open("file2.txt","r") as f:
    lines = set(f.readlines())

lines.discard("\n")

with open("results.txt", "w") as o:
    with open("file1.txt","r") as f:
        for line in f:
            if line in lines:
                o.write(line)

如果 file2 更大，我们可以将其分割成 block ，并对每个 block 重复相同的操作，但在这种情况下，将结果编译在一起会更困难

关于python - 搜索大文件中的文本并将结果写入文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56192724/

上一篇：python - 使用 MeshPy 进行 Delaunay 三角剖分

下一篇：python - 通过共享外键嵌套序列化器

相关文章：

python - 用 Django 过滤

python - Numpy:条件搜索排序

image - 在 Canvas 中编辑图像时将文本放在数据库中的图像上

R正则表达式: how to extract elements that contains two character in a certain order?

python - 为什么不能就地编辑文件？

python - 如何将 bugzilla 的 webservice xml-rpc 与 python 一起使用？

C# 查找相关文档片段用于搜索结果显示

apache solr 作为服务托管

python - 在 Python 中将纯文本转换为 PDF

python - Django-CMS apphooks 菜单和反向