javascript - 在什么情况下规范化 ('NFKC' ) 方法有效?

标签 javascript unicode normalization

我尝试使用具有不同字符的 normalize('NFKC') 方法,但它不起作用。幸运的是,不能对 NFC 这么说。如果可能,normalize('NFC') 始终将多个代码点替换为单个代码点。例如:

let t1 = `\u00F4`; //ô
let t2 = `\u006F\u0302`; //ô
console.log(t2.normalize('NFC') == t1); //true

下面是 NFKC 的示例,但它永远不起作用:

let s1 = '\uFB00'; //"ff"
let s2 = '\u0066\u0066'; //"ff"
console.log(s2.normalize('NFKC') == s1); //false

我之前认为NFKC用表示兼容字符的单个代码点替换了多个代码点。简单来说,我认为NFKC会将\u0066\u0066替换为\uFB00

如果 NFKC 不能那样工作,那么...它是如何工作的?

最佳答案

问题是NFKC(以及NFKD)支持兼容且规范等效的规范化。

Unicode

The type of full decomposition chosen depends on which Unicode Normalization Form is involved. For NFC or NFD, one does a full canonical decomposition, which makes use of only canonical Decomposition_Mapping values. For NFKC or NFKD, one does a full compatibility decomposition, which makes use of canonical and compatibility Decomposition_Mapping values.

这是完全可以理解的,因为 MDN说:

All canonically equivalent sequences are also compatible, but not vice versa.

但还值得注意的是,NFKC 以不同的方式进行兼容且规范等效的规范化。 NFKC 的规范等效标准化的生成方式与 NFC 相同。例如:

//"ô" (U+00F4) -> "a" (U+006F) + " ̂" (U+0302) -> "â" (U+00F4)
let c1 = `\u006F\u0302`; //ô
console.log(c1.normalize('NFKC').length); //1

但是此参数的兼容规范化的工作方式有所不同。 spec是说:

Normalization Form KC does not attempt to map character sequences to compatibility composites. For example, a compatibility composition of “office” does not produce “o\uFB03ce”, even though “\uFB03” is a character that is the compatibility equivalent of the sequence of three characters “ffi”. In other words, the composition phase of NFC and NFKC are the same—only their decomposition phase differs, with NFKC applying compatibility decompositions.

例如:

//"ff"(U+FB00) -> "f"(U+0066) + "i"(U+0066) -> "f"(U+0066) + "i"(U+0066)
let c2 = '\u0066\u0066'; //ff
console.log(c2.normalize('NFKC').length); //2

关于javascript - 在什么情况下规范化 ('NFKC' ) 方法有效?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69058397/

相关文章:

javascript - 提交后联系 form7 数据插入到不同的数据库表中

javascript - 外部JS : How to hide the fourth tab of the item based on some if condition?

Matlab:imgradient 返回量化图像

javascript - Mongoose findOne范围疑惑

javascript - 根据IP地址加载某些页面元素

java - 将 SInhala Unicode 字母打印到 Epson TMU220D 打印机

php - 是否存在将 Unicode 文本大写的可靠方法?

Python unicode 正则表达式不适用于大字符串

mysql - 选择非标准化形式的mysql数据

Javascript - 标准化带重音的希腊字符