c++ - C/C++ Unicode 字符编码大小和默认格式

我刚刚意识到(多亏了我的大学类(class))我认为我对 unicode 了解的很多事情都是错误的。因此，我开始阅读和修复我的知识，通过在 MSVC2012 中玩一个简单的“Hello world”C++ 程序，立即出现了以下疑问:

#include <iostream>
#include <string.h>
using namespace std;

int main(void) {

    char arr1[] = "I am a nice boy"; // Is this stored as UTF-8 (multi-byte) or ASCII?
    char arr[] = "I'm a nice èboi"; // All characters should be ASCII except the 'è' one, which encoding is used for this?
    cout << strlen(arr); // Returns 15 as ASCII, why?

    // If I choose "multi-byte character set" in my VS project configuration instead of "unicode", what does this mean and what
    // will this affect?

    char arr2[] = "I'm a niße boy"; // And what encoding is it used here?
    cout << strlen(arr2); // Returns 1514, what does this mean?

    // If UTF-32 usually use 4 bytes to encode a character (even if they're not needed), how can a unicode code point like U+FFFF
    // (FFFF hexadecimal is 65535 in decimal) represent any possible unicode character if the maximum is FFFF ? (http://inamidst.com/stuff/unidata/)

    return 0;
}

上面是用“多字节字符集”编译的，但由于多字节是一种 unicode 编码，我猜(？)即使这样也不清楚。

有人可以帮我清楚地解释上述问题吗？

最佳答案

    char arr1[] = "I am a nice boy"; // Is this stored as UTF-8 (multi-byte) or ASCII?

这存储在编译器的执行字符集中。编译器可以选择这是什么并且应该记录它。 GCC 允许您使用标志设置执行编码 -fexec-charset=charset 但我认为默认情况下使用 UTF-8，MSVC 使用机器在系统语言中配置的“非 Unicode 应用程序编码”设置(永远不能是 UTF-8)，并且 clang 无条件地使用 UTF-8。

char arr[] = "I'm a nice èboi"; // All characters should be ASCII except the 'è' one, which encoding is used for this?
cout << strlen(arr); // Returns 15 as ASCII, why?

编译器执行字符集实际上根本不必与 ASCII 兼容。例如，它可能是 EBDIC。

strlen(arr) 返回 15，因为使用编译器执行字符集编码的字符串文字的长度为 15 个字节。由于字符串文字的长度为 15 个字符，这可能意味着编译器执行字符集对每个字符使用单个字节，包括 'è'。 (并且由于 UTF-8 无法将该字符串编码为仅 15 个字节，这最终表明您的编译器未使用 UTF-8 作为编译器执行字符集。)

char arr2[] = "I'm a niße boy"; // And what encoding is it used here?
cout << strlen(arr2); // Returns 1514, what does this mean?

编码不会根据字符串的内容而改变。编译器将始终使用执行字符集。我假设“1514”是一个拼写错误，而 strlen(arr2) 实际上返回 14，因为该字符串中有 14 个字符，而且较早的字符串似乎也每个字符使用一个字节。

If I choose "multi-byte character set" in my VS project configuration instead of "unicode", what does this mean and what will this affect?

该设置与编译器使用的编码无关。它只是将 Microsoft header 中的宏设置为不同的东西。 TCHAR，所有在*W 和*A 函数之间进行选择的宏，等等。

事实上，当您启用'unicode'时，完全可以使用多字节字符串编写程序，并且当您启用'多字节字符集'时，也可以使用unicode。

If UTF-32 usually use 4 bytes to encode a character (even if they're not needed), how can a unicode code point like U+FFFF (FFFF hexadecimal is 65535 in decimal) represent any possible unicode character if the maximum is FFFF ? (http://inamidst.com/stuff/unidata/)

这个问题没有意义。也许如果你换个说法...

关于c++ - C/C++ Unicode 字符编码大小和默认格式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21798158/

c++ - C/C++ Unicode 字符编码大小和默认格式

上一篇：c++ - 当我 push_back 到 vector 时调用析构函数

下一篇：c++ - 如何确定 C++ 中的 IP 版本？