PERL 计算不可打印字符

我有 100,000 个文件要分析。具体来说，我想从任意大小的文件样本中计算可打印字符的百分比。其中一些文件来自大型机、Windows、Unix 等，因此很可能包含二进制和控制字符。

我开始使用 Linux 的"file"命令，但它没有为我的目的提供足够的细节。以下代码传达了我正在尝试做的事情，但并不总是有效。

    #!/usr/bin/perl -n

    use strict;
    use warnings;

    my $cnt_n_print = 0;
    my $cnt_print = 0;
    my $cnt_total = 0;
    my $prc_print = 0;

    #Count the number of non-printable characters
    while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};

    #Count the number of printable characters
    while ($_ =~ m/[[:print:]]/g) {$cnt_print++};

    $cnt_total = $cnt_n_print + $cnt_print;
    $prc_print = $cnt_print/$cnt_total;

    #Print the # total number of bytes read followed by the % printable
    print "$cnt_total|$prc_print\n"

这是一个有效的测试调用:

    echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl

这是我打算如何称呼它，并且适用于一个文件:

    find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

这不能正常工作:

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

这也不行:

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl

它不是为 find 返回的每一行执行一次脚本，而是为所有结果执行一次。

提前致谢。

迄今为止的研究:

管道和 XARGS 和分隔符

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem

说明:
1.) 期望的输出:如果一个目录中有 932 个文件，输出将是一个 932 行的文件名列表，从文件中读取的总字节数和可打印字符的百分比。
2.) 许多文件是二进制的。脚本需要处理嵌入的二进制文件 eol或 eof序列。
3.) 许多文件很大，所以我只想读取第一个/最后一个 xx 字节。我一直在尝试使用 head -c 256或 tail -c 128分别读取前 256 个字节或后 128 个字节。解决方案可以在管道中工作或在 perl 脚本中限制字节。

最佳答案

-n选项将您的整个代码包装在 while(defined($_=<ARGV>) { ... } 中堵塞。这意味着您的 my $cnt_print和其他变量声明对每一行输入重复，基本上重置所有变量值。

解决方法是使用全局变量(如果你想继续使用 our 用 use strict 声明它们)，而不是将它们初始化为 0 ，因为它们会为每一行输入重新初始化。你可以这样说

our $cnt_print //= 0;

如果你不想要 $cnt_print和它的 friend 对于第一行输入是未定义的。

见 this recent question有类似的问题。

关于PERL 计算不可打印字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13483290/

PERL 计算不可打印字符

上一篇：vagrant - 如何使用 Vagrant、Puppet 和 Hiera 配置时区？

下一篇：javascript - 带有 ui.router 的 ng-template，单独的文件