c++ - 此 UTF-8 实现是实现定义的还是定义明确的？

我只是四处浏览寻找一些 UTF-8 代码点的实现(不，不是抄袭)并且偶然发现了 this :

typedef unsigned char char8_t;
typedef std::basic_string<unsigned char> u8string;

这段代码是否忽略了 CHAR_BIT 只需要至少 8，但可能更大这一事实？还是在这种情况下这无关紧要并且代码很好？如果是，那这是为什么？

此外，有人(大概是 SO 成员@NicolBolas？)写道:

const char *str = u8"This is a UTF-8 string.";
This is pretty much how UTF-8 will be used in C++ for string literals.

我以为 UTF-8 中的一个代码单元总是恰好是八位!
来自 Unicode 标准 8.0.0，第 2.5 章:

In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16- bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively.

_{(删除了换行符，删除了换行符上的连字符，添加了强调。)}

那么他为什么声称使用了 const char* 而不是 const uint8_t*(或建议的假设性 const char8_t*)？

最佳答案

uint8_t 仅存在于内存可以恰好 8 位访问的系统上。 UTF-8 没有任何这样的要求。它使用适合 8 位的值，但不对这些值的实际存储方式施加任何要求。每个 8 位值都可以存储为 16 位或 32 位或任何对其运行的系统有意义的存储；唯一的要求是该值必须正确。

关于c++ - 此 UTF-8 实现是实现定义的还是定义明确的？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34560711/

c++ - 此 UTF-8 实现是实现定义的还是定义明确的？

上一篇：c++ - 显式返回模板参数

下一篇：c++ - 重载运算符 '+='