unicode - 为什么从Unicode字符集中删除了U + D800到U + DFFF范围内的代码点？

我正在学习有关UTF-16编码的信息，并且我读过如果要表示U + 10000到U + 10FFFF范围内的代码点，则必须使用代理对，即U + D800范围到U + DFFF。

因此，假设我要编码以下代码点:U + 10123(二进制为10000000100100011):

首先，我将按以下顺序排列这些位:

110110xxxxxxxxxx 110111xxxxxxxxxx

然后，我用代码点的二进制格式用x填充位置:

1101100001000000 1101110100100011(十六进制的D840 DD23)

我还读到U + D800到U + DFFF范围内的代码点已从Unicode字符集中删除，但我不明白为什么要删除此范围!

我的意思是该范围可以轻松地以4个字节进行编码，例如以下是U + D812代码点的UTF-16编码格式(二进制为1101100000010010):

1101100000110110 1101110000010010(D836 DC12以十六进制表示)

注意:我在示例中使用的是UTF-16 Big Endian。

最佳答案

代码点U + D800-U + DFFF专门保留用于UTF-16。由于它们不在U + 10000-U + 10FFFF的范围内，因此UTF-16不会使用代理对对它们进行单独编码，因此这些单独的代码点在UTF- 16个序列。

根据Unicode.org UTF-16 FAQ:

1:Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D800₁₆ to DBFF₁₆, and trailing, or low, surrogates are from DC00₁₆ to DFFF₁₆. They are called surrogates, since they do not represent characters directly, but only as a pair.

2:Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800₁₆ to DBFF₁₆ not followed by a value in the range DC00₁₆ to DFFF₁₆, or any value in the range DC00₁₆ to DFFF₁₆ not preceded by a value in the range D800₁₆ to DBFF₁₆.

关于unicode - 为什么从Unicode字符集中删除了U + D800到U + DFFF范围内的代码点？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40184882/

unicode - 为什么从Unicode字符集中删除了U + D800到U + DFFF范围内的代码点？

上一篇：string - 为什么 Split 在不同的字符串上表现不同？

下一篇：amazon-web-services - aws cloudformation 使用 Fn::加入列表