arrays - 如何使用 Perl 更有效地处理大型文本文件

<分区>

问题:

我需要搜索一个巨大的文本文件(包含大约 150 万行数据)，提取与唯一 ID 匹配的那些行。我已将我的唯一 ID 存储在一个数组中，并为每个数组元素遍历整个文件一次。

虽然此方法适用于小型数组，但如果数组非常大，它会大大降低我的程序速度，因为有太多操作要做。

我的数组最多可以包含 10,000 个以下形式的唯一标识符:

DC888U1
DC888U2
DC888U3 
... 
...

我的数据文件中的行将始终以唯一标识符开头。

DC888U1 Apples 0.99 75
DC888U2 Oranges 0.75 1002
DC888U3 Bread 1.35 100
... ... ... ...
... ... ... ...

我的代码如下:

#array containing identifiers
open (IDENTIFIERS "< keywords.txt") or die "Cannot open file: $!";
    chomp(my @keywords = <IDENTIFIERS>);
close (IDENTIFIERS);

#iterate through the array element by element
foreach my $element (@keywords) {
    open (FH "< inventory.txt") or die "cannot open file: $!";
    while (<FH>) {
        if ($_ =~ /^\Q$element\E/) {
            print $_;
        }
    }
close (FH);
}

我查看了 Tie::File 以查看它是否可以加快我的处理速度，但没有成功。我想知道有没有一种方法可以缓存已经打印的行，这样当我下次通过文件时，每次要搜索的数据量都会减少。

有吗？

最佳答案

关键是将您的 O(N*M) 代码转换为 O(N+M):

use strict;
use warnings;
use v5.10;  # For autodie
use autodie;

die <<ERROR unless @ARGV > 1;
Identifiers file missing.
Usage: $0 identifiers_file [ inventory_file ]
ERROR

my $keywords_re = do {
    my $keywords_file = shift;
    open my $fh, '<', $keywords_file;
    my @keywords = <$fh>;
    chomp @keywords;
    my $re = join '|', map quotemeta, @keywords;
    qr/$re/;
};

while (<>) { print if /^$keywords_re\s/ }

如果您确定您的关键字不能包含受 Sinan Ünür 启发的空格，则可以选择一种替代方法的 solution .

my %keywords;
{
    my $keywords_file = shift;
    open my $fh, '<', $keywords_file;
    @keywords{ map s/\s//gr, <$fh> } = (); #/ make syntax highlight happy
};

while (<>) { print if /^(\S+)/ and exists $keywords{$1} }

关于arrays - 如何使用 Perl 更有效地处理大型文本文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31557392/

上一篇：erlang - 列出 :usort for nth element in tuple

下一篇：debugging - 将一对 int 列表转换为 OCaml 中的 boolean 矩阵

相关文章：

regex - 对 Perl 中正则表达式基本规则的困惑

python - 将图像添加到数组中，该数组返回Python中图像的数量和图像的尺寸

c++ - `std::array<T, 0>` 默认可构造，而 `T` 不可默认构造？

regex - Perl 正则表达式从替换中返回匹配项

perl - 如何从 Perl 系统命令中查看错误？

linux - grep 命令来计算 perl 中的警报

php - 将对象数组合并到具有唯一对象的数组中

c - 如何从c中的null分隔的char数组中读取字符串？

arrays - 为什么点分配不更新多个数组？

linux - 将 Linux 用户的当前密码与在 Perl 中输入的密码进行比较