php - UTF-8 模式正则表达式中的非 ASCII 字符

问题

尽管 PHP 手册指出:

"In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes."

为什么波斯语数字在“UTF-8 模式”下匹配 \d 或 [[:digit:]]？

阐述

在 non-related question 中的回答者评论中提到在正则表达式中，\d 不仅匹配 ASCII 数字 0 到 9，而且还匹配波斯数字 (0 1 2 3 4 5 6 7).

上面提到的问题被标记为java但这种行为也可以在 PHP 中观察到。考虑到这一点，我编写了以下“测试”:

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/', $string, $capture);

生成的数组 $capture 包含 5 仅的匹配项。

使用 u 修饰符打开“UTF-8 模式”并运行:

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/u', $string, $capture);

$capture 的结果包含 3 和 5 上的匹配项。

注意事项

此问题涉及 PHP 5.6.22(最新)
这两个测试都是在明确使用 C 语言环境时执行的。

最佳答案

因为文档坏了。不幸的是，这并不是唯一的地方。

PHP 使用 PCRE在幕后实现其 preg_* 功能。因此，PCRE 的文档在那里是权威的。 PHP 的文档是基于 PCRE 的，但看起来您又发现了一个错误。

以下是您可以在 PCRE's docs 中阅读的内容(强调我的):

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:
[:alnum:]  becomes  \p{Xan}
[:alpha:]  becomes  \p{L}
[:blank:]  becomes  \h
[:digit:]  becomes  \p{Nd}
[:lower:]  becomes  \p{Ll}
[:space:]  becomes  \p{Xps}
[:upper:]  becomes  \p{Lu}
[:word:]   becomes  \p{Xwd}

如果您深入研究 PHP 文档，您会发现 the following :

u (PCRE_UTF8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

不幸的是，这是一个谎言。 PHP 中的 u 修饰符表示 PCRE_UTF8 | PCRE_UCP(UCP 代表 Unicode 字符属性)。 PCRE_UCP 标志改变了 \d、\w 等的含义，正如您从上面的文档中看到的那样。您的测试证实了这一点。

作为旁注，不要从一种正则表达式风格推断另一种风格的属性。它并不总是有效(嘿，甚至 this chart 忘记了 PCRE_UCP 选项)。

关于php - UTF-8 模式正则表达式中的非 ASCII 字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37658139/

php - UTF-8 模式正则表达式中的非 ASCII 字符

问题

阐述

注意事项

u (`PCRE_UTF8`)

上一篇：php - 没有数据库的 Laravel 中的 JWT 身份验证

下一篇：php - 如何在亚马逊 MWS 中发布产品？

php - UTF-8 模式正则表达式中的非 ASCII 字符

问题

阐述

注意事项

u (PCRE_UTF8)

上一篇：php - 没有数据库的 Laravel 中的 JWT 身份验证

下一篇：php - 如何在亚马逊 MWS 中发布产品？

u (`PCRE_UTF8`)