c++ - 在linux/mac中获取一个多字节字符的 "char"

我在 linux 和 mac 中有 std::string 和 utf-8 字符(一些拉丁字符，一些非拉丁字符)。

我们知道，utf-8 字符大小不固定，有些字符不只是 1 个字节(如常规拉丁字符)。

问题是如何在偏移i 中获取字符？

使用 int32 数据类型存储 char 是有意义的，但我如何获取该字符？

例如:

std::string str = read_utf8_text();
int c_can_be_more_than_one_byte = str[i]; // <-- obviously this code is wrong

重要的是要指出，我不知道偏移i中字符的大小。

最佳答案

非常简单。

首先，您必须了解，您不能在不迭代字符串的情况下计算位置(对于变长字符来说这是显而易见的)

其次，您需要记住，在 utf-8 中，字符可以是 1-4 个字节，如果它们占用超过一个字节，则所有尾随字节都是 10有效位设置。因此，您只需计算字节数，如果 (byte_val & 0xC0) == 0x80 则忽略它们。

不幸的是，我现在没有可用的编译器，所以请善待代码中可能存在的错误:

int desired_index = 19;
int index = 0;
char* p = my_str.c_str(); 
while ( *p && index < desired_index ){
  if ( (*p & 0xC0) != 0x80 ) // if it is first byte of next character
    index++;
  p++;
}

// now p points to trailing (2-4) bytes of previous character, skip them
while ( (*p & 0xC0) == 0x80 )
  p++;

if ( *p ){
  // here p points to your desired char
} else {
  // we reached EOL while searching
}

关于c++ - 在linux/mac中获取一个多字节字符的 "char"，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53959490/

上一篇：c++ - 检查两个矩形是否重叠或边缘接触

下一篇：c++ - 使用 VS 公共(public)继承进行模板化

在 ActiveDirectory 和 Sqlite3 之间复制文本数据时的 C++ 问题

Mysql-unicode 字符可以，但重音符号不行

java - 如何在 Java 应用程序上编写 UTF-8 字符？

c++ - 无序自定义对象集的 boost 池

c++ - 为什么用 new 创建的 C++ 数组与 C 样式数组的行为不同？

Python Pandas excel UnicodeDecodeError : 'ascii' codec can't decode byte 0xe2 in position 11

Ruby 1.9 - 无效的多字节字符 (utf-8)

c++ - 自定义 UniformRandomBitGenerator 编译失败

c++ - 在 log4cxx 中使用 utf-8 字符