perl - 根据第一列查找两个大文件之间的异同

我有两个超过一百万行的制表符分隔文件，我需要根据第一列查找有多少个值是通用的以及有多少个值特定于其中一个文件。

我尝试使用以下代码在 Perl 中执行此操作，但它无法正常工作。

我需要考虑给定文件大小的计算时间。

有人可以帮我纠正这个问题，或者建议一个更有效的方法吗？

左.txt

K00134:78_1 272 1   3057610
K00134:78_0 272 1   3057610
K00134:78_2 272 1   3057610
K00134:78_3 272 1   3057610

右.txt

K00134:78_1 272 1   3057610
K00134:78_5 272 1   3057610
K00134:78_6 272 1   3057610
K00134:78_3 272 1   3057610

Perl 代码

use strict;
use warnings;

my %Set;

open (SET1, "<", "left.txt") or die "cannot open file";

while (<SET1>) {
    my @line = split (/\t/, $_);
    $Set{$line[0]} = $line[1];
}

my @k = keys %Set;
foreach my $key (@k) {
    print "$key, $Set{$key}\n";
}
close SET1;

open (SET2, "<", "right.txt") or die "cannot open file";
print "common:\n";

while (<SET2>) {
    chomp;

    if ( exists $Set{"$_"} ) {
        print "$Set{$_}\n";
    }
}

close SET2;

输出应如下所示，列出基于第一列的公共(public)字段 -

common lines - 
K00134:78_1 272 1   3057610
K00134:78_3 272 1   3057610

不常见的行 - left.txt

K00134:78_0 272 1   3057610
K00134:78_2 272 1   3057610

不常见的行 - right.txt

K00134:78_5 272 1   3057610
K00134:78_6 272 1   3057610

此外，我也尝试将每个文件中的不匹配项添加为输出，但我不确定是否可能给定文件大小。谢谢!

最佳答案

您的第二个读取循环代码是错误的。它应该按选项卡拆分并检查。将其更改为:

while (<SET2>) {
    my @line = split (/\t/, $_);
    print $_ if exists $Set{$line[0]};
}

它会起作用的。你的方法还可以。由于您只想比较第一列，因此不必将 $Set{} 的值设置为第二列 ($line[1])，您只需将其设置为 '' 以尝试节省内存。另外，为了节省内存，请确保 left.txt 是两者中最小的。这是一个工作示例:

use strict;
use warnings;

my %Set;

open (SET1, "<", "left.txt") or die "cannot open file";

while (<SET1>) {
    my @line = split (/\t/, $_);
    $Set{$line[0]} = '';
}

close SET1;

open (SET2, "<", "right.txt") or die "cannot open file";
print "common:\n";

while (<SET2>) {
    my @line = split (/\t/, $_);
    print $_ if exists $Set{$line[0]};
}

close SET2;

编辑 - 这是另一种方法，可以满足您的需求

use strict;
use warnings;

my %Set;

sub readFile {
    my ($fn, $bit) = @_;
    open (F, "<" ,$fn) or die "can't open file";
    while (<F>) {
        my ($k) = split (/\t/, $_);
        $Set{$k} = $Set{$k} || [0, $_];
        $Set{$k}[0] |= $bit;
    }
    close F;
}

sub showByBit {
    my ($k, $bit) = @_;
    foreach my $key (@{$k}) {
        my $a = $Set{$key};
        print $a->[1] if $a->[0] == $bit;
    }
}

readFile('left.txt', 1);
readFile('right.txt', 2);
my @k = keys %Set;

print "common lines -\n";
showByBit(\@k, 3);

print "uncommon lines - left.txt\n";
showByBit(\@k, 1);

print "uncommon lines - right.txt\n";
showByBit(\@k, 2);

关于perl - 根据第一列查找两个大文件之间的异同，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40516701/

perl - 根据第一列查找两个大文件之间的异同

左.txt

右.txt

Perl 代码

上一篇：amazon-web-services - Amazon S3 静态托管自定义域

下一篇：scala - Spark 斯卡拉: Pass a sub type to a function accepting the parent type