python - 使用 python 加速匹配两列的过程

如果我有这样的 FILE1:

1    56903
1    293943
1    320022
2    24050
2    404999
2    1003093

和 FILE2 是这样的:

1    rs40209    56903
1    rs209485   79382
1    rs392392   320022
2    rs30302    504922
2    rs3202309  707899
2    rs39339    1003093

如果 FILE1 的第一列和第二列与 FILE2 匹配，我想拉出第二列，我可以使用嵌套循环来读取 FILE1 这个 awk 命令:

while read COL1 COL2; do

    awk -v COL1=$COL1  -v COL2=$COL2  '($1==COL1 && $3==COL2){print $2}' $FILE2

done < "$FILE1"

哪个会返回

rs40209
rs392392
rs39339

作为输出。

但是，使用嵌套循环来执行此操作非常非常慢。

大部分数据已排序，但不是全部，我无法排序，因为其他文件取决于这些文件的当前顺序。

使用 python，对于具有 ~2M 条目的 FILE1 和具有 ~1M 条目的 FILE2，什么是快速完成此操作的方法？

最佳答案

在 Python 中，您将读取 set 中第一个文件的内容。 set 由哈希表支持，该哈希表平均 O(1)(恒定时间)查找:

with open('FILE1') as file:
    file1_contents = { tuple(line.split()) for line in file }

然后过滤第二个文件:

with open('FILE2') as file2:
    for line in file2:
        c1, c2, c3 = line.split()
        if (c1, c3) in file1_contents:
            print(c2)

其中 FILE1 和 FILE2 具有问题中的内容，导致输出

rs40209
rs392392
rs39339

如果有 200 万个条目，该集合将消耗相当多的内存(接近 1 GB)，但具有最快的渐近时间复杂度。在我的笔记本电脑上，FILE1 中的 200 万个条目和 FILE2 中的 100 万个条目的总时间是 5 秒。它也比 Fabricator 的 AWK 脚本快 4 倍:

% time awk -f script.awk FILE2 > /dev/null
awk -f script.awk FILE2 > /dev/null  17.47s user 0.14s system 99% cpu 17.606 total
% time python filter.py > /dev/null     
python filter.py > /dev/null  4.32s user 0.20s system 99% cpu 4.526 total

关于python - 使用 python 加速匹配两列的过程，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24979655/

python - 使用 python 加速匹配两列的过程

上一篇：python - python中的素数生成代码

下一篇：python - timeit 偶尔会返回一个负值，可能是 timeit 的 bug