php - PHP中的正则表达式元字符\w和\pL

在版本 5.3.4 - 5.5.0beta1 中，\w 和 \pL 是否等价？

 <?php
preg_match_all('#\w#u','سیب',$f);
var_dump($f);

preg_match_all('#\pL#u','سیب',$f);
var_dump($f);

array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(2) "س"
    [1]=>
    string(2) "ی"
    [2]=>
    string(2) "ب"
  }
}
array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(2) "س"
    [1]=>
    string(2) "ی"
    [2]=>
    string(2) "ب"
  }
}

Try the above snippet in the Online PHP shell

最佳答案

看起来当您在 PCRE 正则表达式中使用 u 修饰符时，除了 PCRE_UTF8 标志外，PHP 还设置了 PCRE_UCP 标志，导致 Unicode 属性被引入 \w 和其他 POSIX 字符类，而不仅仅是默认的 ASCII 字符。来自man page on PCRE :

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters.

这就是 confirmed in the PHP source code (第 366-372 行)，我们在其中看到:

        case 'u':   coptions |= PCRE_UTF8;
/* In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
   characters, even in UTF-8 mode. However, this can be changed by setting
   the PCRE_UCP option. */
#ifdef PCRE_UCP
                    coptions |= PCRE_UCP;
#endif

因此，从我上面链接的同一个手册页，您会看到当 PCRE_UCP 设置时，字符类变为:

\d any character that \p{Nd} matches (decimal digit)

\s any character that \p{Z} matches, plus HT, LF, FF, CR

\w any character that \p{L} or \p{N} matches, plus underscore

关于php - PHP中的正则表达式元字符\w和\pL，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15696801/

php - PHP中的正则表达式元字符\w和\pL

上一篇：php - 使用 jQuery 或 Javascript 包含 PHP 文件

下一篇：PHP GDAL/OGR 库的使用，哪种方法更干净？