postgresql - 是否有多字节感知的 Postgresql Levenshtein？

当我将 fuzzystrmatch levenshtein 函数与变音符号一起使用时，它会返回错误的/多字节无知的结果:

select levenshtein('ą', 'x');
levenshtein 
-------------
       2

(注意:第一个字符是'a'，下面有变音符号，我在这里复制后显示不正确)

fuzzystrmatch 文档 ( https://www.postgresql.org/docs/9.1/fuzzystrmatch.html ) 警告:

At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).

但由于它没有命名 levenshtein 函数，我想知道是否有 levenshtein 的多字节感知版本。

我知道我可以使用 unaccent 函数作为解决方法，但我需要保留变音符号。

最佳答案

Note: This solution was suggested by @Nick Barnes in his answer to a related question.

带有变音符号的 'a' 是一个字符序列，即 a 和组合字符的组合，变音符号 ̨ : E'a\u0328'

有一个等效的预组合字符±:E'\u0105'

解决方案是 normalise Unicode 字符串，即在比较它们之前将组合字符序列转换为预组合字符。

不幸的是，Postgres 似乎没有内置的 Unicode 规范化功能，但您可以通过 PL/Perl 轻松访问一个。或 PL/Python语言扩展。

例如:

create extension plpythonu;

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;

现在，由于字符序列 E'a\u0328' 使用 unicode_normalize 映射到等效的预组合字符 E'\u0105' >，编辑距离是正确的:

select levenshtein(unicode_normalize(E'a\u0328'), 'x');
levenshtein
-------------
           1

关于postgresql - 是否有多字节感知的 Postgresql Levenshtein？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56676187/

postgresql - 是否有多字节感知的 Postgresql Levenshtein？

上一篇：postgresql - 特定条件下查询执行缓慢

下一篇：sql - 需要为 2 个 ID 选择 2 个最近的日期