winapi - MB_ERR_INVALID_CHARS 标志是否应该用于 MultiByteToWideChar 的 UTF-8 转换？

使用 Win32 API MultiByteToWideChar() 从 Unicode UTF-8 转换为 Unicode UTF-16 时，是否应该使用 MB_ERR_INVALID_CHARS 标志 ？

换句话说，如果出现错误的无效 UTF-8 输入，这是最好的行为，为什么？

使 MultiByteToWideChar() 调用失败 (使用 MB_ERR_INVALID_CHARS 标志)

只需用 REPLACEMENT CHARACTER U+FFFD

替换无效的输入 UTF-8 字符

最佳答案

从 安全性 的角度来看，在从 UTF-8 转换为 UTF-16 时使用 MB_ERR_INVALID_CHARS 似乎是最好的做法，特别是与 ill-formed UTF-8 subsequences 问题相关(如“Unicode 技术报告 #36:UNICODE安全考虑”):

3.1.1 Ill-Formed Subsequences

Suppose that a UTF-8 converter is iterating through input UTF-8 bytes, converting to an output character encoding. If the converter encounters an ill-formed UTF-8 sequence it can treat it as an error in a number of different ways, including substituting a character like U+FFFD, SUB, "?", or SPACE. However, it must not consume any valid successor bytes. For example, suppose we have the following sequence:

X = <... 41 C2 3E 42 ... >

This sequence overall is ill-formed, because it contains an ill-formed substring, namely the <C2> [...]

The UTF-8 converter can stop at the C2 byte, or substitute a character or sequence like U+FFFD and continue. However, it must not consume the 3E byte if it continues. [...]

Consuming a subsequent byte (such as 3E above) is not only non-conformant; it can lead to security breaches. [...]

实际上，使用 MB_ERR_INVALID_CHARS 标志会使 MultiByteToWideChar() API 在存在无效 UTF-8 序列的情况下失败 ，因此不存在后续代码(例如调用代码)可能会消耗无效子字符串之后的字节的风险。

关于winapi - MB_ERR_INVALID_CHARS 标志是否应该用于 MultiByteToWideChar 的 UTF-8 转换？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22824537/

winapi - MB_ERR_INVALID_CHARS 标志是否应该用于 MultiByteToWideChar 的 UTF-8 转换？

上一篇：uml - 如何在 Magic Draw 中创建对象图？

下一篇：d3.js - 增加graphviz中的秩间距