json - perl:未捕获的异常:JSON 字符串中格式错误的 UTF-8 字符

标签 json perl unicode utf-8

相关 this questionthis answer (另一个问题)我仍然无法使用 JSON 处理 UTF-8。

我已尝试确保根据最优秀专家的建议调用所有必需的 voodoo,并且据我所知,该字符串尽可能有效、标记和标记为 UTF-8。但是仍然 perl 死了

Uncaught exception: malformed UTF-8 character in JSON string

或者
Uncaught exception: Wide character in subroutine entry

我在这里做错了什么?
(hlovdal) localhost:/work/2011/perl_unicode>cat json_malformed_utf8.pl 
#!/usr/bin/perl -w -CSAD

### BEGIN ###
# Apparently the very best perl unicode boiler template code that exist,
# https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
# Slightly modified.

use v5.12; # minimal for unicode string feature
#use v5.14; # optimal for unicode string feature

use utf8;                                                 # Declare that this source unit is encoded as UTF‑8. Although
                                                          # once upon a time this pragma did other things, it now serves
                                                          # this one singular purpose alone and no other.
use strict;
use autodie;

use warnings;                                             # Enable warnings, since the previous declaration only enables
use warnings    qw< FATAL  utf8     >;                    # strictures and features, not warnings. I also suggest
                                                          # promoting Unicode warnings into exceptions, so use both
                                                          # these lines, not just one of them. 

use open        qw( :encoding(UTF-8) :std );              # Declare that anything that opens a filehandles within this
                                                          # lexical scope but not elsewhere is to assume that that
                                                          # stream is encoded in UTF‑8 unless you tell it otherwise.
                                                          # That way you do not affect other module’s or other program’s code.

use charnames   qw< :full >;                              # Enable named characters via \N{CHARNAME}.
use feature     qw< unicode_strings >;

use Carp                qw< carp croak confess cluck >;
use Encode              qw< encode decode >;
use Unicode::Normalize  qw< NFD NFC >;

END { close STDOUT }

if (grep /\P{ASCII}/ => @ARGV) { 
   @ARGV = map { decode("UTF-8", $_) } @ARGV;
}

$| = 1;

binmode(DATA, ":encoding(UTF-8)");                        # If you have a DATA handle, you must explicitly set its encoding.

# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
    confess "Uncaught exception: @_" unless $^S;
};

# now promote run-time warnings into stackdumped exceptions
#   *unless* we're in an try block, in which 
#   case just generate a clucking stackdump instead
local $SIG{__WARN__} = sub {
    if ($^S) { cluck   "Trapped warning: @_" } 
    else     { confess "Deadly warning: @_"  }
};

### END ###


use JSON;
use Encode;

use Getopt::Long;
use Encode;

my $use_nfd = 0;
my $use_water = 0;
GetOptions("nfd" => \$use_nfd, "water" => \$use_water );

print "JSON->backend->is_pp = ", JSON->backend->is_pp, ", JSON->backend->is_xs = ", JSON->backend->is_xs, "\n";

sub check {
        my $text = shift;
        return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (Encode::is_utf8($text, 1) ? "1" : "0"). ". ";
}

my $json_text = "{ \"my_test\" : \"hei på deg\" }\n";
if ($use_water) {
        $json_text = "{ \"water\" : \"水\" }\n";
}
if ($use_nfd) {
        $json_text = NFD($json_text);
}

print check($json_text), "\$json_text = $json_text";

# test from perluniintro(1)
if (eval { decode_utf8($json_text, Encode::FB_CROAK); 1 }) {
        print "string is valid utf8\n";
} else {
        print "string is not valid utf8\n";
}

my $hash_ref1 = JSON->new->utf8->decode($json_text);
my $hash_ref2 = decode_json( $json_text );

__END__

运行这个给出
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl 
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei på deg" }
string is valid utf8
Uncaught exception: malformed UTF-8 character in JSON string, at character offset 20 (before "\x{5824}eg" }\n") at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('malformed UTF-8 character in JSON string, at character offset...') called at ./json_malformed_utf8.pl line 96
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl | ./uniquote 
Uncaught exception: malformed UTF-8 character in JSON string, at character offset 20 (before "\x{5824}eg" }\n") at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('malformed UTF-8 character in JSON string, at character offset...') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei p\N{U+E5} deg" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -nfd | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei pa\N{U+30A} deg" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water 
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "水" }
string is valid utf8
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "\N{U+6C34}" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water --nfd | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "\N{U+6C34}" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>rpm -q perl perl-JSON perl-JSON-XS
perl-5.12.4-159.fc15.x86_64
perl-JSON-2.51-1.fc15.noarch
perl-JSON-XS-2.30-2.fc15.x86_64
(hlovdal) localhost:/work/2011/perl_unicode>

单引号来自 http://training.perl.com/scripts/uniquote

更新:

感谢布赖恩强调解决方案。更新源以使用 json_text对于所有普通字符串和 json_bytes对于将要传递给 JSON 的内容,如下所示现在可以按预期工作:
my $json_bytes = encode('UTF-8', $json_text);
my $hash_ref1 = JSON->new->utf8->decode($json_bytes);

我必须说,我认为 JSON 模块的文档非常不清楚并且部分具有误导性。

短语“文本”(至少对我而言)意味着一串字符。
所以当阅读 $perl_scalar = decode_json $json_text我有一个
期望 json_text 是 UTF-8 编码的字符串。
彻底重新阅读文档,知道要寻找什么,
我现在看到它说:“decode_json ... 需要一个 UTF-8(二进制)字符串并尝试解析
作为 UTF-8 编码的 JSON 文本”,但我认为这仍然不清楚。

从我的背景来看,使用一种具有一些额外非 ASCII 的语言
字符,我记得在你不得不猜测代码的日子里
正在使用的页面,电子邮件过去常常通过剥离
第 8 位等。字符串上下文中的“二进制”表示字符串
包含 7 位 ASCII 域之外的字符。但是什么是
“二进制”真的吗?在核心级别不是所有字符串都是二进制的吗?

该文档还说“简单而快速的接口(interface)(期望/生成 UTF-8)”和“正确的 unicode 处理”,在“功能”下的第一点,两者都没有提到它不需要字符串而是字节序列。我会要求
作者至少让这一点更清楚。

最佳答案

如果您的数据中存在格式错误的 UTF-8 字符,您可以通过以下方式将其删除(假设数据包含在 data.txt 中):

iconv -f utf-8 -t utf-8 -c < data.txt > clean-data.txt
-c iconv 的选项将默默地删除所有格式错误的字符。

关于json - perl:未捕获的异常:JSON 字符串中格式错误的 UTF-8 字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6905164/

相关文章:

javascript - 如何在 Rails 中将 json 呈现为不带元标记的数组数组——(将数据传递到 google charts api javascript)

ios - 当 Restkit 尝试映射它时,带有字符串字段(包含 JSON 字符串)的 Json 崩溃

javascript - 2D JSON 字符串数组反序列化为 JavaScript 对象

java - Perl 或 Java 端口上运行的应用程序的名称

Python strip()unicode字符串?

java - 无效定义异常 : No serializer found for inner class

perl - 为什么 Email::Stuffer base64 编码与 MIME::Base64 不同以及 utf8 如何适应?

perl - perl 中来自 redis 的非阻塞 blpop

python - 如何使用 unicode 正则表达式范围替换字符

linux - 使用linux或python删除特殊字符