perl - Text::SpellChecker 模块和 Unicode

标签 perl unicode utf-8 perl-module

#!/usr/local/bin/perl
use strict;
use warnings;

use Text::SpellChecker;

my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );

while ( my $word = $checker->next_word ) {
    print "Bad word is $word\n";
}

输出:Bad word is rdinator
所需:Bad word is coördinator
如果我在 $text 中有 Unicode，则模块会中断.知道如何解决吗？

我安装了该模块正在使用的 Aspell 0.50.5。我认为这可能是罪魁祸首。

编辑:作为 Text::SpellChecker 需要 Text::Aspell 或 Text::Hunspell , 我删除了 Text::Aspell并安装Hunspell , Text::Hunspell ，然后:

$ hunspell -d en_US -l < badword.txt
coördinator

显示正确的结果。这意味着我的代码或 Text::SpellChecker 有问题。

考虑到米勒的建议，我做了以下

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text =  "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
    print "Bad word is $word\n";
}

输出:

Flag is 1
Text is coördinator
Bad word is rdinator

这是否意味着模块无法正确处理 utf8 字符？

最佳答案

这是 Text::SpellChecker 错误 - 当前版本假定只有 ASCII 单词。

http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm

#
# next_word
# 
# Get the next misspelled word. 
# Returns false if there are no more.
#
sub next_word {
    ...
    while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {

恕我直言，最好的解决方法是使用每种语言/语言环境的分词正则表达式或将分词留给使用的底层库。 aspell list报告 coördinator作为一个词。

关于perl - Text::SpellChecker 模块和 Unicode，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26707917/

上一篇：osx-yosemite - Yosemite SSL : CA certificate set, 但证书验证被禁用

下一篇：iis - 添加通配符 '*' Mime 类型 (IIS 7) 后显示 404.17 错误

相关文章：

utf-8 - VS Code 无法读取其他编辑器编写的 unicode 语言

python - 匹配区域指示符字符类的 python 正则表达式

python - 在 python 正则表达式中使用 unicode 字符的正确方法是什么

php - 我应该对英语和西类牙语使用排序规则/字符集吗？

Perl 按模式匹配对数组进行排序

ruby-on-rails - ruby 正则表达式错误 : incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)

php - iconv() : Wrong charset, 不允许从 `auto' 转换为 `utf-8//IGNORE'

Perl Moose 增强对比

perl - Graph.pm - 如何获取具有特定长度的所有路径？

xml - 在 Perl 中，如何在不更改 XML 文件格式的情况下更改 XML 文件中的元素？