arrays - Perl:如果值数组相交，则打印 2 个哈希值中的键

我有 2 个文件，如下所示:

文件1:除最后一列之外的所有列均以制表符分隔

space   start   end width   names   score.data  
1   1   1873    24409   22537   DDX11L1 NA
2   1   4361    39370   35010   WASH7P  NA
23  1   690244  724068  33825   LOC100288069    NA
24  1   742750  765214  22465   FAM87B  "rs1;rs2;rs3,"
25  1   751585  772902  21318   LINC00115   "rs3;rs4"
26  1   752970  804826  51857   LINC01128   "rs5;rs6;rs7;rs8;rs9"
27  1   793450  822182  28733   FAM41C  "rs9;rs10;rs11"
28  1   842197  865072  22876   LOC100130417    "rs12;rs13;rs14;rs15;rs16"
29  1   851120  889961  38842   SAMD11  "rs14;rs15;rs16;rs17"
30  1   869582  904679  35098   NOC2L   "rs13;rs17;rs20;rs25;rs27"  
31  1   885966  911099  25134   KLHL17  "rs23;rs25;rs34;rs49"
78  1   1582938 1634243 51306   SLC35E2B    rs45

文件2:除最后之外的所有列均以制表符分隔

space   start   end width   names   score.data  
1   1   1096679 1097517 839     DMR1 rs2;rs3  
2   1   1229025 1229590 566     DMR2 rs4  
3   1   1267955 1269432 1478    DMR3 rs7;rs8;rs9  
4   1   1279248 1279795 548     DMR4 rs9;rs10  
5   1   1372628 1374653 2026    DMR5 rs11;rs12;rs14;rs18  
6   1   1842116 1842456 341     DMR6 NA  
7   1   1896556 1897211 656     DMR7 rs13;rs17;rs20

所需输出:所有列均以制表符分隔

DMR1 FAM87B LINC00115   
DMR2 LINC00115    
DMR3 LINC01128 FAM41C    
DMR4 LINC01128 FAM41C
DMR5 FAM41C LOC100130417 SAMD11
DMR7 SAMD11 NOC2L

所以基本上，我需要检查 score.data 中的任何 file2 条目(rs2,rs3...)是否与 score.data 的 file1 条目相交。如果他们这样做，我应该从 names column 获取 key ( file2 )并从 names column 获取相应的 key ( file1 )。

例如DMR1中的file2有score.data rs2;rs3，它与rs1;rs2;rs3的score.data FAM87B以及rs3;rs4中LINC00115的file 1相交

到目前为止，我编写的大部分代码都涉及清理第一个文件“”和 NA 条目并创建哈希:

use 5.014;
use warnings;

my $file1 = '/path/to/file1';
my $file2 = '/path/to/file2';

#Open files
open my $fh1 , '<', $file1 or die $!;
open my $fh2, '<', $file2 or die $!;


#Read file1
my %gene_hash;
while(<$fh1>){
    chomp;
    my @arr = split; 
    next if $arr[0] eq "space";
    next if $arr[6] eq 'NA';

    my $key = $arr[5]; #Hash key

    my @snps = split /;/, $arr[6]; #to be used as value in hash
    my $first_snp = shift @snps; #remove 1st element from start

    my @first_snp = split /"/, $first_snp; #remove " from start
    unshift @snps, $first_snp[1]; #add 1st element back to beginning


    my $last_snp = pop @snps; #remove last element
    my @last_snp = split /"/, $last_snp; #remove " from end

    push @snps, $last_snp[0];# add last element back to the end
    push @snps, $arr[6] if $arr[6] =~/^rs.*/; #add element even if there are no "" eg SLC35E2B

    push @{ $gene_hash{$key} }, @snps; #assign values to hash
}



my %dmr_hash;
while(<$fh2>){
   chomp;
   my @arr = split;

   next if $arr[0] eq "space";
   next if $arr[6] eq 'NA';

   my $key = $arr[5]; #Hash key

   my @snps = split /;/, $arr[6];#to be used as value in hash
   push @{ $dmr_hash{$key} }, @snps; #assign values to hash

}

我尝试在 stackoverflow 上搜索其他哈希比较问题，所有这些问题在两个哈希中都有相同的键。我还发现 Array::Utils 工具可以使两个数组相交，但我真的不确定如何在我的问题中实现它。

感谢您花时间解决我的问题，我将不胜感激您的想法和解决方案。

最佳答案

这将按照您的要求进行。它构建一个哈希%mapping，将每个分数条目与它们在 File1 中对应的所有名称相关联，然后在读取 File2 时询问该哈希，以构建由分数条目连接的名称列表

程序期望两个输入文件的路径作为命令行上的参数，例如

请注意，我刚刚用空格分割了每条记录，因为示例数据中的分隔符不一致

DRM7 的输出包括您所需的输出中缺少的 LOC100130417。这是正确的，因为文件 2 中的 DRM7 具有分数条目 rs13，该条目也出现在文件 1 中的 LOC100130417 行中

perl find_joined.pl path/to/file1 path/to/file2

use strict;
use warnings;
use v5.10.1;
use autodie;

my %mapping;

{
    open my $fh, '<', $ARGV[0];
    <$fh>; # Drop the header line

    while ( <$fh> ) {
        my @fields = split;
        my $name = $fields[-2];
        my @entries = $fields[-1] =~ /[^";]+/g;
        push @{ $mapping{$_} }, $name for @entries;
    }

    delete $mapping{NA};
}

open my $fh, '<', $ARGV[1];
<$fh>; # Drop the header line

while ( <$fh> ) {
    my @fields = split;
    my $name = $fields[-2];
    my @entries = $fields[-1] =~ /[^";]+/g;

    my %matching;
    @matching{@$_} = () for grep defined, @mapping{@entries};

    if ( keys %matching ) {
        print join(' ', $name, sort keys %matching), "\n"
    }
}

输出

DMR1 FAM87B LINC00115
DMR2 LINC00115
DMR3 FAM41C LINC01128
DMR4 FAM41C LINC01128
DMR5 FAM41C LOC100130417 SAMD11
DMR7 LOC100130417 NOC2L SAMD11

关于arrays - Perl:如果值数组相交，则打印 2 个哈希值中的键，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32050898/

arrays - Perl:如果值数组相交，则打印 2 个哈希值中的键

输出

上一篇：user-interface - 用于基于 MFC 的 GUI 应用程序的免费自动化测试工具

下一篇：cakephp - 强制下载由html2pdf生成的pdf文件