unicode - 为什么 Unicode 代码点总是至少用 2 个字节编写？

为什么 Unicode 代码点总是用 2 个字节(4 位数字)编写，即使这不是必需的？

$ -> U+0024
¢ -> U+00A2

最佳答案

TL;DR 这都是 Unicode 联盟的约定。

这是正式定义，可在 Appendix A: Notational Conventions of the Unicode standard (I've referenced the latest at this time, version 11) 中找到。 :

In running text, an individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.

<小时/>

它们是十六进制数字，表示 Unicode 标量值。最初，仅提供了第一个称为“基本多语言平面”的平面，它支持定义 U+0000 到 U+FFFF 的范围。因此，最初 U+ 编码始终具有 4 个十六进制字符。

但是，这只允许 64 Ki (65536) 代码点用于字符(不包括一些保留值)。所以后来单机扩展到17架。对于 U+10000 或更高的值，前导零被抑制，因此下一个字符将写入 U+10000，而不是 U+010000。目前有 17 个 64Ki 码位平面(其中一些可能被保留)，从 U+0000、U+10000 ... U+90000 到最后 U100000。

U+xxxx 表示法不遵循 UTF-8 编码。它也不遵循 UTF-16、UTF-32 或已弃用的 UCS 编码(无论是大端还是小端)。然而，基本多语言平面内的字符编码与十六进制的 UTF-16(BE) 相同。请注意，UTF-16 可能包含代理代码单元，这些代理代码单元用作转义以对其他平面中的字符进行编码。这些代码单元的范围未映射到字符，因此不会出现在文本代码点表示中。

例如，参见加减号，±:

Unicode code point: U+00B1 (as a textual string)
UTF-8             : 0xC2 0xB1 (as two bytes)
UTF-16            : 0x00B1
UTF-16BE          : 0x00B1 as 0x00 0xB1 (as two bytes)
UTF-16LE          : 0x00B1 as 0xB1 0x00 (as two bytes)

https://www.fileformat.info/info/unicode/char/00b1/index.htm

<小时/>

大部分信息可以在 at sil.org 中找到。 .

关于unicode - 为什么 Unicode 代码点总是至少用 2 个字节编写？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52042056/

unicode - 为什么 Unicode 代码点总是至少用 2 个字节编写？

上一篇：eclipse - 如何用新标签(快捷方式)包围 Eclipse 中的一些 html 代码？

下一篇：Gradle连接异常: Could not create an instance of Tooling API implementation using specified Gradle distribution