c++ - 如何将 UTF-8 字符串的一部分解析为 C++ 字符串？

我正在用 C++ 从 JSON 文件中读取字符串，当字符串中间出现特殊字符(如“ç”或“á”)时，它会将字符转换为“\u00e7”和“\u00e1” .

我正在读的单词是“Praça”，当我从 JSON 文件中读取时，它变成了“Pra\u00e7a”。特殊字符“ç”变成了“\u00e7”。我想知道如何解析回“Praça”。

最佳答案

你分析得对。当一个字符串被编码为 utf-8 时，您读取它时，您不想更改任何内容。该字符串在内存中已经是正确的。

问题在于无论显示字符串的系统都不处理 utf-8，而是以 C/C++/Java 源代码编码显示它。

最好的办法是修复处理和显示字符串的系统以处理 utf-8。如果您不能这样做并且仅限于使用 ascii，那么您可以将一些 unicode 字符映射回 ascii。但您将无法处理大多数字符(例如所有亚洲字符)。您基本上会回到 unicode 之前的世界。

如果这是您能做的最好的事情，那么这是您的映射:

bool KxnUTF8Code::decode( const char* str )
{
assert( str );
clear();

bool ret = true;

const char* cstr = str;
int len = strlen( str );

while ( *cstr )
{
   char c = cstr[0];
   if ( c >= 0x80 )
   {
     // this is a high bit character - starts a unicode character. Try converting to ascii
     if ( c == 0xc2 )
     {
        // decode a utf character in the 0xC2 range. 
        c = cstr[1];
        if (( c >= 0x80 ) && ( c <= 0xBF ))
        {
           // Only characters from 0x80-0xBF range are valid ascii
           write( &c, 1 );

        } else {

           // not ascii - you're boned.
           c = '?';
           write( &c, 1 );
           ret = false;
        }

        cstr += 2;

     } else if ( c == 0xc3 ) {

        // decode a utf character in the 0xC3 range. 
        c = cstr[1];
        if (( c >= 0x80 ) && ( c <= 0xBF ))
        {
           // Only characters from 0x80-0xBF range are valid
           c += 0x40;
           write( &c, 1 );

        } else {

           // not ascii - screwed again...
           c = '?';
           write( &c, 1 );
           ret = false;
        }

        cstr += 2;

     } else {

        // none of the longer codes are ascii either... 
        int codeLen = 1;

             if (( c >= 194 ) && ( c <= 223 )) codeLen = 2;
        else if (( c >= 224 ) && ( c <= 239 )) codeLen = 3;
        else if (( c >= 240 ) && ( c <= 247 )) codeLen = 4; // not all of these are yet allocated.
        else if (( c >= 248 ) && ( c <= 251 )) codeLen = 5; // none of these are allocated
        else if (( c >= 252 ) && ( c <= 253 )) codeLen = 6; // none of these are allocated

        cstr += codeLen; 

        c = '?';
        write( &c, 1 );
        ret = false;
     }

  } else {

     // unmodified ascii character.
     write( &cstr[0], 1 );
     cstr++;
  }
}

return ret;
}

关于c++ - 如何将 UTF-8 字符串的一部分解析为 C++ 字符串？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26044298/

c++ - 如何将 UTF-8 字符串的一部分解析为 C++ 字符串？

上一篇：c - Mempool amd memcheck 在 C 中的实现

下一篇：C - fork 和 printf 行为