使用 Win32 API MultiByteToWideChar()
从 Unicode UTF-8 转换为 Unicode UTF-16 时,是否应该使用 MB_ERR_INVALID_CHARS
标志 ?
换句话说,如果出现错误的无效 UTF-8 输入,这是最好的行为,为什么?
MultiByteToWideChar()
调用 失败 (使用 MB_ERR_INVALID_CHARS
标志) U+FFFD
最佳答案
从 安全性 的角度来看,在从 UTF-8 转换为 UTF-16 时使用 MB_ERR_INVALID_CHARS
似乎是最好的做法,特别是与 ill-formed UTF-8 subsequences 问题相关(如“Unicode 技术报告 #36:UNICODE安全考虑”):
3.1.1 Ill-Formed Subsequences
Suppose that a UTF-8 converter is iterating through input UTF-8 bytes, converting to an output character encoding. If the converter encounters an ill-formed UTF-8 sequence it can treat it as an error in a number of different ways, including substituting a character like U+FFFD, SUB, "?", or SPACE. However, it must not consume any valid successor bytes. For example, suppose we have the following sequence:
X = <... 41 C2 3E 42 ... >
This sequence overall is ill-formed, because it contains an ill-formed substring, namely the <C2> [...]
The UTF-8 converter can stop at the C2 byte, or substitute a character or sequence like U+FFFD and continue. However, it must not consume the 3E byte if it continues. [...]
Consuming a subsequent byte (such as 3E above) is not only non-conformant; it can lead to security breaches. [...]
实际上,使用
MB_ERR_INVALID_CHARS
标志会使 MultiByteToWideChar()
API 在存在无效 UTF-8 序列的情况下失败 ,因此不存在后续代码(例如调用代码)可能会消耗无效子字符串之后的字节的风险。
关于winapi - MB_ERR_INVALID_CHARS 标志是否应该用于 MultiByteToWideChar 的 UTF-8 转换?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22824537/