perl - 为什么我不能在 Perl 中使用 map 函数从一个简单的数据文件创建一个好的散列?

标签 perl hash byte-order-mark

帖子已更新。如果您已经阅读过发布的问题,请跳到解决方案部分。谢谢!

这是展示我的问题的最小化代码:

用于测试的输入数据文件已被 Window 的内置记事本保存为 UTF-8 编码。 它有以下三行:

abacus  æbәkәs
abalone æbәlәuni
abandon әbændәn

The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding. It contains the following code:

#!perl -w

use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";

在输出中,哈希表似乎没问题:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

But it is actually not, because I only get two values instead of three:

æbәlәuni
әbændәn

Perl gives the following warning message:

Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i n> line 3.

where's the problem? Can someone kindly explain? Thanks.

The Solution

Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :) As @Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.

To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:

#!perl -w

use Data::Dumper;
use strict;
use autodie;

open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

现在,输出完全符合我的预期:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };
æbәkәs
æbәlәuni
әbændәn

Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.

Finally thanks again to all of you. And thank you, @Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.

Note To clarify a little more, if I use:

open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

输出是这样的:

$VAR1 = {
          'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
          'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
          "\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
        };
æbәlәuni
әbændәn

还有警告信息:

Use of uninitialized value in print at C:\hash_test.pl line 13,  line 3.

最佳答案

我发现警告信息有点可疑。它告诉你 $in文件句柄在第 3 行,而在读取最后一行后它应该在第 4 行。

当我尝试您的代码时,我使用系统上配置的 GVim 保存输入文件以保存为 UTF-8,我没有发现问题。现在我用记事本试了一下,查看输出文件,我看到了:

"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"

在哪里 \x{feff}BOM .

在您的 Dumper 输出中,abacus 之前存在虚假空白(您没有为输出句柄指定 :utf8)。

正如我最初提到的(在这篇文章的无数次编辑中迷失了——感谢霍布斯的提醒),指定 '<:utf8'当您打开输入文件时。

关于perl - 为什么我不能在 Perl 中使用 map 函数从一个简单的数据文件创建一个好的散列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1762977/

相关文章:

perl - Perl 中的 "Not an ARRAY reference"

perl - 如果我在此之前阅读用户输入,为什么 perl 不能就地编辑工作?

encoding - 如果 SHA-1 哈希值只有 160 位,为什么它却有 40 个字符长?

Python:模式 'wt' 中的 bz2 和 lzma 不编写 BOM(而 gzip 编写)。为什么?

Java:比较字符串

perl - 在 @INC 中找不到 Email/Sender/Transport/SMTP/TLS.pm

perl - 为什么 '$_' 与 Perl 单行代码中的 $ARGV 相同?

c++ - unordered_map 仅使用 16 个字符串来最大化唯一键

nginx - 选择实时读取哪个 Redis 服务器的最佳实践

twitter-bootstrap - 由于变量,less 转换为 css 出错