比较两个文件。如此简单,但比较两个文件,其中一条信息可以很灵活,这对我来说非常具有挑战性。
fileA
4 "dup" 37036335 37044984
3 "dup" 100146708 100147504
7 "del" 100 203
2 "dup" 34 89
fileB
4 "dup" 37036335 37036735
3 "dup" 100146708 100147504
4 "dup" 68 109
预期输出:
output_file1 (matching hits)
fileA: 4 "dup" 37036335 37044984
fileB: 4 "dup" 37036335 37036735
fileA: 3 "dup" 100146708 100147504
fileB: 3 "dup" 100146708 100147504
output_file2 (found in fileA, but not in FileB including non-overlap)
7 "del" 100 203
2 "dup" 34 89
output_file3 (found in fileB, but not in FileA including non-overlap)
4 "dup" 68 109
凭据是... 我需要第一个文件中的字段 1 和字段 2 与第二个文件完全匹配,并且字段 3 中的坐标准确或重叠。
This would mean these are the same.
fileA :4 "dup" 37036335 37044984
fileB :4 "dup" 37036335 37036735
我还需要找出两个文件之间的差异。 (无重叠,一个文件中不存在 1 行,但另一个文件中不存在,等等)
这是我尝试过的要点。我已经用大概 4 种不同的方式编写了这段代码,唉,还是没有成功。我已将两个文件放入数组中(我也尝试过哈希...idk)
## if no hits in original, but hits in calculated
if((! @ori) && (@calc)){}
## if CNV calls in original, but none in calculated
if((@ori) && (! @calc)){}
## if CNV calls in both
if((@ori) && (@calc)){
## compare calls with double 'for' loop
foreach my $l (@ori){
my @l = split(/\s/,$l);
my $Ochromosome = $l[0];
my $Ostart = $l[2];
my $Oend = $l[3];
my $Otype = $l[1];
foreach my $l (@calc){
my @l = split(/\s/,$l);
my $Cchromosome = $l[0];
my $Cstart = $l[2];
my $Cend = $l[3];
my $Ctype = $l[1];
## check chromosome and type here
if(($Ochromosome eq $Cchromosome) && ($Otype eq $Ctype)){ ## what if there are two duplications on the same chromosome?
## check coordinates
if(($Ostart <= $Cend) && ($Cstart <= $Oend)){
## overlap
}else{
## noOverlap
}
}else{
## what if there is something found in one, but not in the other and they both have calls?
## ahhhh
}
}
}
最佳答案
这是一个简单的解决方案,也相当有效。
遍历一个文件的行,检查每一行与另一行的所有行(直到找到匹配项)。考虑到需要收集的所有信息,这是我们必须在复杂性方面做出的最起码的努力。
如果 A
中的一行在 B
中找不到,则将其添加到 @not_in_B
。为了确定 B
中的哪些行不在 A
中,我们准备了一个散列,其中 B
的每个元素都是一个具有值 的键0
。一旦/如果找到 B
的元素,其在散列中的键值将设置为 1
。那些最后不是 1
的元素永远不会被 A
的元素找到,多余的也是如此。他们进入 @not_in_A
。
为简单起见,首先将两个文件读入数组(但内部循环需要此)。
use warnings;
use strict;
use feature 'say';
my $f1 = 'f1.txt';
my $f2 = 'f2.txt';
open my $fh, '<', $f1;
my @a1 = <$fh>; chomp(@a1);
open $fh, '<', $f2;
my @a2 = <$fh>; chomp(@a2);
close $fh;
my (@not_in_A, @not_in_B);
my %Bs_in_A = map { $_ => 0 } @a2;
foreach my $e1 (@a1)
{
my $match = 0;
foreach my $e2 (@a2)
{
if ( lines_match($e1, $e2) ) {
$match = 1;
say "Match:\n\tf1: $e1\n\tf2: $e2";
$Bs_in_A{$e2} = 1;
last;
}
}
push @not_in_B, $e1 if not $match;
}
@not_in_A = grep { $Bs_in_A{$_} == 0 } keys %Bs_in_A;
say '---';
say "Elements of A that are not in B:";
say "\t$_" for @not_in_B;
say "Elements of B that are not in A:";
say "\t$_" for @not_in_A;
sub lines_match
{
my ($l1, $l2) = @_;
my @t1 = split ' ', $l1;
my @t2 = split ' ', $l2;
# First two fields must be the same
return if $t1[0] ne $t2[0] or $t1[1] ne $t2[1];
# Third-to-fourth-field ranges must overlap
return
if ($t1[2] < $t2[2] and $t1[3] < $t2[2])
or ($t1[2] > $t2[3] and $t1[3] > $t2[3]);
return 1; # match
}
输出
Match: f1: 4 "dup" 37036335 37044984 f2: 4 "dup" 37036335 37036735 Match: f1: 3 "dup" 100146708 100147504 f2: 3 "dup" 100146708 100147504 --- Elements of A that are not in B: 7 "del" 100 203 2 "dup" 34 89 Elements of B that are not in A: 4 "dup" 68 109
请注意,我用 1
代替了 A
,用 2
代替了 B
。
关于perl - 比较两个文件,其中一条信息可以灵活,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41494747/