Java 正则表达式不匹配 ascii 范围之外，行为不同于 python 正则表达式

我想以与 sklearn 的 CountVectorizer 相同的方式从文档中过滤字符串做。它使用以下 RegEx:(?u)\b\w\w+\b。此 Java 代码的行为方式应相同:

Pattern regex = Pattern.compile("(?u)\\b\\w\\w+\\b");
Matcher matcher = regex.matcher("this is the document.!? äöa m²");

while(matcher.find()) {
    String match = matcher.group();
    System.out.println(match);
}

但这不会产生所需的输出，就像在 python 中那样:

this
is
the
document
äöa
m²

它反而输出:

this
is
the
document

我可以做些什么来包含非 ascii 字符，就像 python RegeEx 所做的那样？

最佳答案

正如 Wiktor 在评论中所建议的，您可以使用 (?U)打开标志 UNICODE_CHARACTER_CLASS .虽然这确实允许匹配 äöa , 这仍然不匹配 m² .那是因为UNICODE_CHARACTER_CLASS与 \w不认识 ²作为有效的字母数字字符。作为 \w 的替代品, 你可以使用 [\pN\pL_] .这匹配 Unicode 数字 \pN和 Unicode 字母 \pL (加上 _ )。 \pN Unicode 字符类包括 \pNo字符类，其中包括 Latin 1 Supplement - Latin-1 punctuation and symbols 字符类(它包括 ²³¹ )。或者，您可以只添加 \pNo Unicode 字符类到带有 \w 的字符类.这意味着以下正则表达式正确匹配您的字符串:

[\pN\pL_]{2,}         # Matches any Unicode number or letter, and underscore
(?U)[\w\pNo]{2,}      # Uses UNICODE_CHARACTER_CLASS so that \w matches Unicode.
                      # Adds \pNo to additionally match ²³¹

那为什么不 \w匹配²在 Java 中，但它在 Python 中呢？

Java的解释

查看OpenJDK 8-b132's Pattern implementation ，我们得到以下信息(我删除了与回答问题无关的信息):

Unicode support

The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression, when UNICODE_CHARACTER_CLASS flag is specified.

\w A word character: [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]

太棒了!现在我们有了 \w 的定义当(?U)使用标志。将这些 Unicode 字符类插入 this amazing tool会准确地告诉您这些 Unicode 字符类中的每一个匹配的内容。这篇文章不会太长，我会继续告诉你以下两个类都不匹配 ² :

\p{Alpha}
\p{gc=Mn}
\p{gc=Me}
\p{gc=Mc}
\p{Digit}
\p{gc=Pc}
\p{IsJoin_Control}

Python的解释

那么为什么Python要匹配²³¹当u标志与 \w 结合使用？这个很难找到，但我深入研究了 Python's source code (I used Python 3.6.5rc1 - 2018-03-13) .在删除了很多关于如何调用它的绒毛之后，基本上会发生以下情况:

\w定义为 CATEGORY_UNI_WORD ，然后以 SRE_ 为前缀. SRE_CATEGORY_UNI_WORD电话 SRE_UNI_IS_WORD(ch)
SRE_UNI_IS_WORD定义为 (SRE_UNI_IS_ALNUM(ch) || (ch) == '_') .
SRE_UNI_IS_ALNUM电话 Py_UNICODE_ISALNUM ，这又定义为 (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch)) .
这里重要的是Py_UNICODE_ISDECIMAL(ch) , 定义为 Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch) .

现在，让我们看一下方法_PyUnicode_IsDecimalDigit(ch) :

int _PyUnicode_IsDecimalDigit(Py_UCS4 ch)
{
    if (_PyUnicode_ToDecimalDigit(ch) < 0)
        return 0;
    return 1;
}

如我们所见，此方法返回 1如果_PyUnicode_ToDecimalDigit(ch) < 0 .那么_PyUnicode_ToDecimalDigit是什么意思呢？看起来像？

int _PyUnicode_ToDecimalDigit(Py_UCS4 ch)
{
    const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);

    return (ctype->flags & DECIMAL_MASK) ? ctype->decimal : -1;
}

太好了，基本上，如果字符的 UTF-32 编码字节具有 DECIMAL_MASK标记这将评估为 true 并且值大于或等于 0将被退回。

² 的 UTF-32 编码字节值是0x000000b2和我们的旗帜 DECIMAL_MASK是0x02 . 0x000000b2 & 0x02计算结果为真，所以 ²在 python 中被认为是有效的 Unicode 字母数字字符，因此 \w与 u标志匹配 ² .

关于Java 正则表达式不匹配 ascii 范围之外，行为不同于 python 正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49409074/

Java 正则表达式不匹配 ascii 范围之外，行为不同于 python 正则表达式

Java的解释

Unicode support

Python的解释

上一篇：java - 将属性传递给 Maven 子模块，该子模块没有当前模块作为父模块

下一篇：java - 需要关闭什么类型的流以及为什么