Perl:对其他散列使用引用 has

首先，我什至不确定要搜索什么，因为我知道如何在 excel 中执行此操作，但找不到一种简单的方法(以我有限的知识)在 perl 中执行此操作。我需要对谱系文件(超过 140 万条记录)重新编号，不幸的是，由于 PC 的能力和 excel 电子表格的能力，excel vlookup 是不够的。

文件需要重新编号，这样个人的编号就不会比 parent 的编号低，所以我的测试文件看起来是这样的:

Ani | Sire | Dam
----------------
15  |   1  | 2
12  |   1  | 2
30  |  15  | 12
18  |  15  | 2
26  |  15  | 30
48  |  18  | 30
32  |  26  | 48
50  |  26  | 30

1 和 2 表示未知的父代(我将它们保留为 1/2)，重新编号从 10 开始，因此“新 ID”如下:

Old_ID | New_ID
---------------
 15    | 10
 12    | 11
 30    | 12
 18    | 13
 26    | 14
 48    | 15
 32    | 16
 50    | 17

所以我希望看到的输出是

new_ani | new_sire | new_dam
----------------------------
   10   | 1        | 2
   11   | 1        | 2
   12   | 10       | 11
   13   | 10       | 2
   14   | 10       | 12
   15   | 13       | 30
   16   | 14       | 15
   17   | 14       | 12

使用两个散列，我尝试(未成功)首先将第一列链接到新 ID(我可以这样做)，然后是父亲和母亲列(我不能这样做)。

为了稍微减少代码，我省略了计算新水坝 ID 的 block ，因为它是父本的复制品。到目前为止，我的代码如下:

use strict;
use warnings;

my $input_file = .../pedigree.csv;
open (INPUT, "<", $input_file) or die "Cant open $input_file: $!";

my new_id = 0;

my %old_ped = ();
my %new_id = ();

while (<INPUT>){

        my $line = $_;
           $line =~ s/\s*$//g;

        my ($ani,$sire,$dam) = split('\,',$line);

        next if $ani eq 'db_animal' or !$ani or $ani eq 'ani';

        $old_ped{$ani}[0] = $ani;
        $old_ped{$ani}[1] = $sire;
        $old_ped{$ani}[2] = $dam;

        $new_id++;

        $new_id{$ani}[0] = $ani;
        $new_id{$ani}[1] = $new_id;

}
close INPUT;

foreach my $tt (sort keys %old_ped){

        #animal
        if ($old_ped{$tt}[0] == $new_id{$tt}[0]){
                print "$new_id{$tt}[1],";

                #sires
                if ($old_ped{$tt}[1] == 1){
                       print " 1,";
                }
                else{
                        foreach my $tt (sort keys %new_id) {
                                if ($old_ped{$tt}[1] == $nuwe_id{$tt}[0]){
                                       print "$new_id{$tt}[1],";                                           
                                }
                        }
                }
        }

# AND REPEAT SIRE BLOCK FOR DAM

print "\n";
}

但是...我显然得到了错误的输出，因为引用文献没有连接，所以没有父系(或母系)的匹配项。

我尝试创建 2 个额外的哈希值，一个用于父亲和母亲，使用父亲和母亲的 ID 作为引用:

$sire{$sire}[0] = $sire;
$sire{$sire}[1] = $dierid;

$dam{$dam}[0] = $dam;
$dam{$dam}[1] = $dierid;

并在 foreach 中使用它们如下:

foreach my $tt (sort keys %old_ped){

        #animal
        if ($old_ped{$tt}[0] == $new_id{$tt}[0]){
                print "$new_id{$tt}[1],";

                #sires
                if ($old_ped{$tt}[1] == 1){
                       print " 1,";
                }
                else{
                        foreach my $tt (sort keys %sire) {
                                if ($sire{$tt}[0] == $nuwe_id{$tt}[0]){
                                       print "$new_id{$tt}[1],";                                           
                                }

                        }
                }
        }

# AND REPEAT SIRE BLOCK FOR DAM

print "\n";
}

我猜我没有正确使用哈希值，或者我可能需要使用不同的循环？但是，我的 perl 知识仍然非常基础和缺乏。

任何帮助将不胜感激!!

最佳答案

您的方法很复杂。我将首先关注一种不同的方法，我将对此进行解释。

您需要对数据进行两次传递。在第一遍中，您生成旧 ID 和新 ID 的映射。创建新 id 的算法只是从 10 开始递增，所以很简单。我们可以使用以旧 ID 为键、以新 ID 为值的常规哈希。

在我的方法中，我们还将第一次传递中的行数据保存到数组引用数组中。这样我就可以在第二遍中重复使用它。如果您有很多记录，那可能不明智，因为它会占用大量内存。在那种情况下，您将重新读取数据并打印，而不是像我那样更改值。

在第二遍中，我们迭代行并简单地从查找哈希中替换所有行。

ani 的值(value)很简单。获取当前值并查找它。
sire 的值只有在不是 1 或 2 时才应该被替换。在 Perl 中，它可以转换为 unless 它小于 3。在这种情况下查找它，否则什么也不做。
dam 的值以同样的方式工作。

use strict;
use warnings;
use Data::Printer;

my $new_id = 10;

my %new_ids;
my @rows;
while (my $line = <DATA>) {
    $line =~ s/\s*$//g;

    my ( $ani, $sire, $dam ) = split( '\,', $line );

    # map old -> new
    $new_ids{$ani} = $new_id;

    # save row
    push @rows, [$ani, $sire, $dam];

    ++$new_id;
}

# iterate all rows and replace the ids
foreach my $row (@rows) {
    $row->[0] = $new_ids{$row->[0]};
    $row->[1] = $new_ids{$row->[1]} unless $row->[1] < 3;
    $row->[2] = $new_ids{$row->[2]} unless $row->[2] < 3;
}

p @rows;
__DATA__
15,1,2
12,1,2
30,15,12
18,15,2
26,15,30
48,18,30
32,26,48
50,26,30

我的程序用 Data::Printer 打印结果.

[
    [0] [
        [0] 10,
        [1] 1,
        [2] 2
    ],
    [1] [
        [0] 11,
        [1] 1,
        [2] 2
    ],
    [2] [
        [0] 12,
        [1] 10,
        [2] 11
    ],
    [3] [
        [0] 13,
        [1] 10,
        [2] 2
    ],
    [4] [
        [0] 14,
        [1] 10,
        [2] 12
    ],
    [5] [
        [0] 15,
        [1] 13,
        [2] 12
    ],
    [6] [
        [0] 16,
        [1] 14,
        [2] 15
    ],
    [7] [
        [0] 17,
        [1] 14,
        [2] 12
    ]
]

在执行时间方面，我用这个程序随机创建了一个包含 1.5M 记录的文件。

$ perl -E 'say join ",", int rand 10000, int rand 10000, int rand 10000 for 1 .. 1_500_000' > animals.csv

在我的 Core i7 四核笔记本电脑和 Perl 5.20.1 上运行我的代码(更改为打开文件)大约需要 8 秒。

$ time perl scratch.pl 
real    0m7.863s
user    0m7.260s
sys     0m0.436s

关于Perl:对其他散列使用引用 has，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44678548/

Perl:对其他散列使用引用 has

上一篇：image-processing - 测量图像中弯管长度的特征检测技术

下一篇：.NET - 使用 .NET react 的多线程生产者消费者