c++ - 将 UTF-8 std::string 解码为 std::u32string？

C++17 中是否有一种方法，使用 C++17 标准库，可以有效地将包含有效 UTF-8 代码单元序列的 std::string 解码为包含以下内容的 std::u32string相应的代码点序列(UTF-32 代码单元)？即它们都代表相同的文本？

换句话说，我如何实现以下功能:

std::u32string decode_utf8(const std::string& utf8_string) {
    ???
}

对于上下文，这是我当前的解决方案:

inline std::u32string decode_utf8(const std::string& utf8_string) {
  std::u32string result;
  result.resize(utf8_string.size());
  size_t output_pos = 0;

  const char* next_code_unit_ptr = &utf8_string[0];

  auto get_next_code_unit = [&] { return uint8_t(*next_code_unit_ptr++); };

  auto mask_match = [](uint8_t code_unit, uint8_t mask, uint8_t value) {
    return ((code_unit & mask) == value);
  };

  auto write_code_point = [&](uint32_t code_point) {
    result[output_pos] = char32_t(code_point);
    output_pos++;
  };

  while (true) {
    uint8_t starting_code_unit = get_next_code_unit();

    if (mask_match(starting_code_unit, 0b1000'0000, 0b0000'0000)) {
      if (starting_code_unit == 0) break;
      write_code_point(starting_code_unit);
      continue;
    }

    uint32_t code_point = 0;

    auto accumulate_trailing_code_unit = [&] {
      uint8_t trailing_code_unit = get_next_code_unit();
      if (!mask_match(trailing_code_unit, 0b1100'0000, 0b1000'0000))
        throw std::runtime_error("Invalid UTF-8");
      code_point <<= 6;
      code_point |= (trailing_code_unit & 0b0011'1111);
    };

    if (mask_match(starting_code_unit, 0b1110'0000, 0b1100'0000)) {
      code_point = (starting_code_unit & 0b0001'1111);
      accumulate_trailing_code_unit();
      write_code_point(code_point);
    } else if (mask_match(starting_code_unit, 0b1111'0000, 0b1110'0000)) {
      code_point = (starting_code_unit & 0b0000'1111);
      accumulate_trailing_code_unit();
      accumulate_trailing_code_unit();
      write_code_point(code_point);
    } else if (mask_match(starting_code_unit, 0b1111'1000, 0b1111'0000)) {
      code_point = (starting_code_unit & 0b0000'0111);
      accumulate_trailing_code_unit();
      accumulate_trailing_code_unit();
      accumulate_trailing_code_unit();
      write_code_point(code_point);
    } else
      throw std::runtime_error("Invalid UTF-8");
  };

  result.resize(output_pos);

  return result;
}

有没有更简单或更快的方法？

最佳答案

可以使用已弃用的标准设施在 C++17 中实现请求的 decode_utf8 函数。但是，使用 std::codecvt 构面及其虚拟接口(interface)会限制效率。

以下示例使用已弃用的 std::wstring_convert 类，但避免使用已弃用的 codecvt_utf8 方面。

#include <locale>
#include <cassert>

std::u32string decode_utf8(const std::string& utf8_string) {
  struct destructible_codecvt : public std::codecvt<char32_t, char, std::mbstate_t> {
    using std::codecvt<char32_t, char, std::mbstate_t>::codecvt;
    ~destructible_codecvt() = default;
  };
  std::wstring_convert<destructible_codecvt, char32_t> utf32_converter;
  return utf32_converter.from_bytes(utf8_string);
}

int main() {
  bool cmp = std::u32string(U"\U0001F64A") == decode_utf8(u8"\U0001F64A");
  assert(cmp);
  return !cmp;
}

上面的代码将无法在 C++20 中编译，因为 u8"" 字符串文字的类型为 const char8_t[]；使用https://github.com/tahonermann/char8_t-remediation中讨论和实现的技术可以在一定程度上缓解这个问题。。将 std::string 的使用更改为 std::u8string 并将 char 更改为 char8_t 不足以使它在 C++20 中工作，因为 std::wstring_convert 仅适用于基于 char 的类型；需要(用户提供的)替换 std::wstring_convert 才能将上述代码移植到 C++20。

C++20 没有提供有效的方法来执行请求的转换。这是一个问题SG16非常了解并正在努力(参见P1629)。实验性实现将在 C++23 的时间框架内提供，但解决方案是否能获得共识并及时通过委员会流程并被 C++23 采用尚待确定。

关于c++ - 将 UTF-8 std::string 解码为 std::u32string？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63050079/

c++ - 将 UTF-8 std::string 解码为 std::u32string？

上一篇：c++ - 将 GCC 构建为具有多库支持的 ARM 交叉编译器

下一篇：c++ - 即使我将单个节点设置为 NULL 并删除它们，递归清除二叉树也不起作用