unicode - 如果 UTF-8 是 8 位编码,为什么需要 1-4 个字节?

标签 unicode encoding utf-8

在 Unicode 站点上,它写道 UTF-8 可以用 1-4 个字节表示。我从这个问题中了解到 https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 是一种 8 位编码。
那么,真相是什么?
如果是8位编码,那么ASCII和UTF-8有什么区别?
如果不是,那么为什么它被称为 UTF-8,如果它们占用相同的内存,为什么我们需要 UTF-16 和其他?

最佳答案

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky - Wednesday, October 08, 2003

以上摘录:

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn't be so bold as to waste that much memory.

And in fact now that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

关于unicode - 如果 UTF-8 是 8 位编码,为什么需要 1-4 个字节?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6338944/

相关文章:

Java MySQL UTF-8 编码问题

c - 为什么 wcwidth 返回 -1 并带有我可以在终端上打印的标志?

delphi - 如何在现代 Delphi 中使用 SameText for AnsiStrings?

java - 如何编写 Java 函数来返回 Unicode 点的标准名称?

Java URL编码

.net - RichTextBox 不显示 ' 并显示一个菱形,而不是里面有一个问号

ruby - 转换为utf8和从utf8转换时如何打包和解包猜测字符编码?

postgresql - 在 PostgreSQL 8.4 表中存储 unicode 字符

java - SpringBoot嵌入式tomcat服务器读取查询参数中的unicode字符为null

php - 克服 PHP、SoapServer、UTF-8 和非英文字符的编码问题?