unicode - 给定一个 UTF-8 字符串，在搜索 ASCII 字符时我可以将其视为字节字符串吗？

由于现在许多现代语言中的字符串都是 unicode 字符序列，因此它可以跨越多个字节。但是，如果我只关心某些 ascii 字符，将字符串视为字节序列是否安全(假设给定的字符串是有效的 unicode 字符序列)？

最佳答案

是。

[...] ASCII bytes do not occur when encoding non-ASCII code points into UTF-8 [...]

Moreover, 7-bit bytes (bytes where the most significant bit is 0) never appear in a multi-byte sequence, and no valid multi-byte sequence decodes to an ASCII code-point. [...] Therefore, the 7-bit bytes in a UTF-8 stream represent all and only the ASCII characters in the stream. Thus, many [programs] will continue to work as intended by treating the UTF-8 byte stream as a sequence of single-byte characters, without decoding the multi-byte sequences.

来自utf8everywhere.org :

By design of this encoding, UTF-8 guarantees that an ASCII character value or a substring will never match a part of a multi-byte encoded character.

维基百科的这个表格很好地体现了这一点:

Number of bytes   Byte 1     Byte 2     Byte 3     Byte 4
1                 0xxxxxxx             
2                 110xxxxx   10xxxxxx        
3                 1110xxxx   10xxxxxx   10xxxxxx    
4                 11110xxx   10xxxxxx   10xxxxxx   10xxxxxx

所有被视为 8 位字节的 ASCII 字符的最高有效位都设置为 0。但在多字节编码字符中，所有字节的 MSB 都设置为 1。

请注意，UTF8 是 Unicode 的一种编码。他们不一样!我的回答谈到了 UTF8 编码的字符串(幸运的是，这是最重要的编码)。

需要注意的另一件事是 Unicode 规范化，它将字符和“某种”包含 ASCII 字符的其他字符组合起来。以元音变音 ä 为例:

ä      0xC3A4    LATIN SMALL LETTER A WITH DIAERESIS
ä      0x61CC88  LATIN SMALL LETTER A  +  COMBINING DIAERESIS

如果您搜索 ASCII 字符“a”，您将在第二行中找到它，但不会在第一行中找到它，尽管这些行逻辑上包含相同的“用户感知字符”。您可以通过预先标准化字符串来至少部分地解决这个问题。

关于unicode - 给定一个 UTF-8 字符串，在搜索 ASCII 字符时我可以将其视为字节字符串吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50982263/

unicode - 给定一个 UTF-8 字符串，在搜索 ASCII 字符时我可以将其视为字节字符串吗？

上一篇：java - 线程中的异常 "main"org.springframework.beans.factory.NoSuchBeanDefinitionException : No bean named 'jobLauncher' is defined

下一篇：c - 如何在 C 中随机找到具有一些连续元素的子数组？