java - Java 程序中的 Unicode 转义行为

几天前，有人问我这个程序的输出:

public static void main(String[] args) {
    // \u0022 is the Unicode escape for double quote (")
    System.out.println("a\u0022.length() + \u0022b".length());
}

我的第一个想法是这个程序应该打印 a\u0022.length() +\u0022b 长度，即 16 但令人惊讶的是，它打印了 2 。我知道 \u0022 是 " 的 unicode 但我认为这个 " 会被转义并且只代表一个 " 文字，没有特殊含义。实际上，Java 以某种方式解析了这个字符串，如下所示:

System.out.println("a".length() + "b".length());

我无法理解这种奇怪的行为，为什么 Unicode 转义不像正常的转义序列那样表现？

更新显然，这是Java Puzzlers: Traps, Pitfalls, and Corner Cases 的脑筋急转弯之一。 Joshua Bloch 和 Neal Gafter 合着的书。更具体地说，这个问题与谜题 14:Escape Rout 有关。

最佳答案

Why Unicode escapes doesn't behave as normal escape sequences?

基本上，如果我的术语正确的话，它们是在读取输入的不同点进行处理的——在词法分析而不是解析中。它们不是字 rune 字或字符串文字中的转义序列，它们是整个源文件的转义序列。任何不属于 Unicode 转义序列的字符都可以用 Unicode 转义序列替换。所以你可以完全用 ASCII 编写程序，它实际上有非 ASCII 的变量、方法和类名......

从根本上说，我认为这是 Java 中的设计错误，因为它会导致一些非常奇怪的效果(例如，如果您在 // 注释中使用换行符的转义序列...)但它就是这样......

这在 section 3.3 of the JLS 中有详细说明:

A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.

...

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

关于java - Java 程序中的 Unicode 转义行为，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35901247/

java - Java 程序中的 Unicode 转义行为

上一篇：java - 必须在 intellij 中声明元素 web-app 错误(java、springmvc、maven)

下一篇：java - 使用 Lambda 将列表分组并汇总到 map 中