更新:在Java 11中,以下所述的错误似乎已修复
(可能它甚至在更早的时候就已修复,但是我不知道确切在哪个版本中使用。Bug report有关nhahtdh's answer中链接的类似问题,建议使用Java 9)。
TL; DR (修复前):
为什么[^\\D2]
,[^[^0-9]2]
和[^2[^0-9]]
在Java中得到不同的结果?
用于测试的代码。您现在可以跳过它。
String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };
System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
test.matches(regexes[0]), test.matches(regexes[1]),
test.matches(regexes[2]), test.matches(regexes[3]),
test.matches(regexes[4]), test.matches(regexes[5]));
可以说我需要正则表达式,它将接受以下字符
2
除外。 因此,此类正则表达式应表示除
0
,1
,3
,4
,...,9
之外的每个字符。我至少可以用两种方式写出来,这将是所有不是数字的数字的总和:2:[[^0-9]2]
[\\D2]
这两个正则表达式均按预期工作
match , [[^0-9]2] , [\D2]
--------------------------
x , true , true
1 , false , false
2 , true , true
3 , false , false
^ , true , true
[ , true , true
] , true , true
现在让我说我要撤消接受的字符。 (所以我想接受除2以外的所有数字)我可以创建正则表达式,其中明确包含所有接受的字符,例如
[013-9]
或尝试通过将其包装在另一个
[^...]
中来否定两个先前描述的正则表达式[^\\D2]
[^[^0-9]2]
甚至[^2[^0-9]]
但令我惊讶的是,只有前两个版本能按预期工作
match | [[^0-9]2] , [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]]
------+--------------------+-------------------------------------------
x | true , true | false , false , true , true
1 | false , false | true , true , false , true
2 | true , true | false , false , false , false
3 | false , false | true , true , false , true
^ | true , true | false , false , true , true
[ | true , true | false , false , true , true
] | true , true | false , false , true , true
所以我的问题是,为什么[^[^0-9]2]
或[^2[^0-9]]
不像[^\D2]
一样起作用?我可以以某种方式更正这些正则表达式,以便能够在其中使用[^0-9]
吗?
最佳答案
Oracle的Pattern
类实现的字符类解析代码中发生了一些奇怪的巫毒,如果您从Oracle网站下载JRE/JDK或使用OpenJDK,则该类随您一起。我还没有检查其他JVM(尤其是GNU Classpath)实现如何解析问题中的正则表达式。
从这一点出发,对Pattern
类及其内部工作的任何引用都严格限于Oracle的实现(引用实现)。
如问题所示,需要花费一些时间来阅读和理解Pattern
类如何解析嵌套否定。但是,我编写了一个program1以从Pattern
对象(带有Reflection API)中提取信息,以查看编译结果。下面的输出来自在Java HotSpot Client VM版本1.7.0_51上运行我的程序。
1:目前,该程序令人尴尬。完成并重构后,我将使用链接更新此帖子。
[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
这没什么奇怪的。
[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
上面的后两种情况与
[^0-9]
编译到相同的程序,即违反直觉的。[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
如问题中所述,在上述2种情况下没有什么奇怪的。
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0030][U+0031]
01
Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
如问题中所述,这两个案例按预期工作。但是,请注意引擎如何对第一个字符类(
\D
)进行补充,并将集差异应用于包含剩余字符的字符类。[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
正如Keppil在评论中进行的测试所证实的那样,上面的输出显示上述所有3个正则表达式都编译到了同一程序中!
[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
我们得到的
NOT(UNION(2, NOT(0-9))
代替了0-13-9
,而不是UNION(NOT(2), NOT(0-9))
,即NOT(2)
。[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
由于存在相同的错误,正则表达式
[^2[^[^0-9]]]
与[^2[^0-9]]
编译到相同的程序。有一个尚 Unresolved 错误,似乎具有相同的性质:JDK-6609854。
说明
初步
以下是在进一步阅读之前应该知道的
Pattern
类的实现细节:Pattern
类将String
编译成一个节点链,每个节点负责一个明确的职责,并将工作委托(delegate)给链中的下一个节点。 Node
类是所有节点的基类。 CharProperty
类是所有与字符类相关的Node
的基类。 BitClass
类是CharProperty
类的子类,它使用boolean[]
数组加快对Latin-1字符(代码点<= 255)的匹配。它具有add
方法,该方法允许在编译过程中添加字符。 CharProperty.complement
,Pattern.union
,Pattern.intersection
是与设置操作相对应的方法。他们所做的是不言自明的。 Pattern.setDifference
是asymmetric set difference。 乍一看解析字符类
在查看负责解析字符类的方法
CharProperty clazz(boolean consume)
方法的完整代码之前,让我们看一下极其简化的代码版本,以了解代码流程:private CharProperty clazz(boolean consume) {
// [Declaration and initialization of local variables - OMITTED]
BitClass bits = new BitClass();
int ch = next();
for (;;) {
switch (ch) {
case '^':
// Negates if first char in a class, otherwise literal
if (firstInClass) {
// [CODE OMITTED]
ch = next();
continue;
} else {
// ^ not first in class, treat as literal
break;
}
case '[':
// [CODE OMITTED]
ch = peek();
continue;
case '&':
// [CODE OMITTED]
continue;
case 0:
// [CODE OMITTED]
// Unclosed character class is checked here
break;
case ']':
// [CODE OMITTED]
// The only return statement in this method
// is in this case
break;
default:
// [CODE OMITTED]
break;
}
node = range(bits);
// [CODE OMITTED]
ch = peek();
}
}
该代码基本上会读取输入(将输入的
String
转换为代码点的以空值终止的int[]
),直到命中]
或String的结尾(未封闭的字符类)。该代码与
continue
块中的break
和switch
混合在一起有点混淆。但是,只要您意识到continue
属于外部for
循环而break
属于switch
块,该代码就很容易理解:continue
结尾的案例将永远不会在switch
语句之后执行代码。 break
结尾的案例可能会在switch
语句之后执行代码(如果还没有return
)。 通过上面的观察,我们可以看到,只要发现一个字符是非特殊字符,就应该将其包括在字符类中,我们将在
switch
语句之后执行代码,其中node = range(bits);
是第一个语句。如果检查source code,则方法
CharProperty range(BitClass bits)
解析“字符类中的单个字符或字符范围”。该方法要么返回传入的相同BitClass
对象(添加了新字符),要么返回CharProperty
类的新实例。血腥细节
接下来,让我们看一下完整的代码版本(省略解析字符类交集
&&
的部分):private CharProperty clazz(boolean consume) {
CharProperty prev = null;
CharProperty node = null;
BitClass bits = new BitClass();
boolean include = true;
boolean firstInClass = true;
int ch = next();
for (;;) {
switch (ch) {
case '^':
// Negates if first char in a class, otherwise literal
if (firstInClass) {
if (temp[cursor-1] != '[')
break;
ch = next();
include = !include;
continue;
} else {
// ^ not first in class, treat as literal
break;
}
case '[':
firstInClass = false;
node = clazz(true);
if (prev == null)
prev = node;
else
prev = union(prev, node);
ch = peek();
continue;
case '&':
// [CODE OMITTED]
// There are interesting things (bugs) here,
// but it is not relevant to the discussion.
continue;
case 0:
firstInClass = false;
if (cursor >= patternLength)
throw error("Unclosed character class");
break;
case ']':
firstInClass = false;
if (prev != null) {
if (consume)
next();
return prev;
}
break;
default:
firstInClass = false;
break;
}
node = range(bits);
if (include) {
if (prev == null) {
prev = node;
} else {
if (prev != node)
prev = union(prev, node);
}
} else {
if (prev == null) {
prev = node.complement();
} else {
if (prev != node)
prev = setDifference(prev, node);
}
}
ch = peek();
}
}
查看
case '[':
语句的switch
中的代码以及switch
语句后的代码:node
变量存储解析单元(独立字符,字符范围,速记字符类,POSIX/Unicode字符类或嵌套字符类)的结果。prev
变量存储了到目前为止的编译结果,并且总是在编译node
中的一个单元后立即进行更新。 由于记录字符类是否被否定的局部变量
boolean include
从未传递给任何方法调用,因此只能在此方法中对其进行操作。唯一读取和处理include
的位置是在switch
语句之后。正在 build 中
关于java - 正则表达式字符类双重否定中的错误?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21934168/