perl - 在 Perl 中使用 "sort"作为 utf8 字符串

标签 perl utf-8

我正在尝试找出如何在 Perl 中按字母顺序对数组进行排序。这是我所拥有的用英语工作得很好的东西:

   # List of countries (kept like this to keep clean, as its re-used in other places)
    my $countries = {
        'AT' => "íAustria",
        'AU' => "Australia",
        'BE' => "Belgium",
        'BG' => "Bulgaria",
        'CA' => "Canada",
        'CY' => "Cyprus",
        'CZ' => "Czech Republic",
        'DK' => "Denmark",
        'EN' => "England",
        'EE' => "Estonia",
        'FI' => "Finland",
        'FR' => "France",
        'DE' => "Germany",
        'GB' => "Great Britain",
        'GR' => "Greece",
        'HU' => "Hungary",
        'IE' => "Ireland",
        'IT' => "Italy",
        'LV' => "Latvia",
        'LT' => "Lithuania",
        'LU' => "Luxembourg",
        'MT' => "Malta",
        'NZ' => "New Zealand",
        'NL' => "Netherlands",
        'PL' => "Poland",
        'PT' => "Portugal",
        'RO' => "Romania",
        'SK' => "Slovakia",
        'SI' => "Slovenia",
        'ES' => "Spain",
        'SE' => "Sweden",
        'CH' => "Switzerland",
        'SC' => "Scotland",
        'UK' => "United Kingdom",
        'US' => "USA",
        'TK' => "Turkey",
        'NO' => "Norway",
        'MX' => "Mexico",
        'IL' => "Israel",
        'IN' => "India",
        'IS' => "Iceland",
        'CN' => "China",
        'JP' => "Japan",
        'VN' => "áVietnamí"
    };
   # Populate the original loop with "name" and "code"
    my @country_loop_orig;
    print $IN->header;
    foreach (keys %{$countries}) {
      push @country_loop_orig, {
        name => $countries->{$lang}->{$_},
        code => $_
      }
    }

   # sort it alphabetically
   my @country_loop = sort { lc($a->{name}) cmp lc($b->{name})  } @country_loop_orig;

这适用于英文版本:

Australia
Austria
Belgium
Bulgaria
Canada
China
Cyprus
Czech Republic
Denmark
England
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
Iceland
India
Ireland
Israel
Italy
Japan
Latvia
Lithuania
Luxembourg
Malta
Mexico
Netherlands
New Zealand
Norway
Poland
Portugal
Romania
Scotland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
USA
Vietnam

...但是当您尝试使用 íéó 等 utf8 来执行此操作时,它不起作用:

Australia
Belgium
Bulgaria
Canada
China
Cyprus
Czech Republic
Denmark
England
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
Iceland
India
Ireland
Israel
Italy
Japan
Latvia
Lithuania
Luxembourg
Malta
Mexico
Netherlands
New Zealand
Norway
Poland
Portugal
Romania
Scotland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
USA
áVietnam
íAustria

你是如何做到这一点的?我找到了 Sort::Naturally::XS,但无法正常工作。

最佳答案

Unicode::Collate 应该对此有所帮助。

对最后一个列表进行排序的简单示例

use warnings;
use strict;
use feature 'say';

use Unicode::Collate;

use open ":std", ":encoding(UTF-8)";

open my $fh, '<', "country_list.txt";
my @list = <$fh>;
chomp @list;

my $uc  = Unicode::Collate->new();
my @sorted = $uc->sort(@list);

say for @sorted;

但是,在某些语言中,非 ascii 字符可能有一个非常特殊的接受位置,并且该问题没有提供任何详细信息。那么也许Unicode::Collate::Locale可以提供帮助。

参见(研究)this perl.com articlethis post (T. Christiansen)和 this Effective Perler article .


如果待排序的数据是复杂的数据结构,cmp方法用于单独比较

my @sorted = map { $uc->cmp($a, $b) } @list;

对于 $a$b,您将从复杂的数据结构中提取需要比较的内容。

关于perl - 在 Perl 中使用 "sort"作为 utf8 字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46617113/

相关文章:

c++ - 在 Visual Studio 中将字符串转换为 wstring 失败

c++ - 在 sfml 中使用 UTF8

python - 在 Python3 中评估字符串中的 UTF-8 文字转义序列

utf-8 - 值 NaN 是否有单个 UTF-8 字符表示?

perl - 将socks5代理与Net::SMTP一起使用

arrays - Perl:取消引用哈希的哈希

Perl:从 1 到 n 提取行(Windows)

java - 为什么调用 .getBytes() 时字符串 "¿"会被转换为 "¿"

perl - 如何在不丢失冗余值的情况下将两个单独的数组合并到 perl 中的散列中?

使用 WWW::Mechanize 更新 Facebook 状态