regex - 为什么在这个 Perl 正则表达式中 `\s+` 比 `\s\s*` 快得多？

为什么用 \s+ 替换 \s*(甚至 \s\s*)会导致此输入的加速？

use Benchmark qw(:all);
$x=(" " x 100000) . "_\n";
$count = 100;
timethese($count, {
    '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },
    '/\s+\n/' => sub { $x =~ /\s+\n/ },
});

Link to online version

我注意到我的代码中的正则表达式 s/\s*\n\s*/\n/g 很慢 - 当给定一个由大量空格和一些非空格组成的 450KB 输入文件时到处都有空格，最后还有一个换行符 - 正则表达式挂起并且从未完成。

我直观地用 s/\s+\n/\n/g 替换了正则表达式； s/\n\s+/\n/g; 一切都很好。

但是为什么速度这么快呢？使用 re Debug => "EXECUTE" 后，我注意到 \s+ 版本以某种方式优化为仅在一次迭代中运行: http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against "       _%n"
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against "       _%n" (9 bytes)
   0 <> <       _%n>         |  1:STAR(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   1 < > <      _%n>         |  1:STAR(3)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   2 <  > <     _%n>         |  1:STAR(3)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   3 <   > <    _%n>         |  1:STAR(3)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   4 <    > <   _%n>         |  1:STAR(3)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   5 <     > <  _%n>         |  1:STAR(3)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   6 <      > < _%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   8 <       _> <%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
   8 <       _> <%n>         |  3:  EXACT <\n>(5)
   9 <       _%n> <>         |  5:  END(0)
Match successful!

Matching REx "\s+\n" against "       _%n"
Matching stclass SPACE against "       _" (8 bytes)
   0 <> <       _%n>         |  1:PLUS(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...

我知道如果不存在换行符，Perl 5.10+ 将立即使正则表达式失败(不运行它)。我怀疑它正在使用换行符的位置来减少搜索量。对于上述所有情况，它似乎巧妙地减少了所涉及的回溯(通常 /\s*\n/ 针对一串空格将花费指数时间)。谁能深入了解为什么 \s+ 版本速度如此之快？

另请注意，\s*? 不提供任何加速。

最佳答案

首先，即使生成的正则表达式不会保持相同的含义，我们也可以将正则表达式简化为 \s*0 和 \s+0 并使用 ( “”×4)。 “_0” 作为输入。对于怀疑者，你可以看到here滞后仍然存在。

现在让我们考虑以下代码:

$x = (" " x 4) . "_ 0";
$x =~ /\s*0/; # The slow line 
$x =~ /\s+0/; # The fast line

使用use re debugcolor;挖掘一下，我们得到以下输出:

Guessing start of match in sv for REx "\s*0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
   0 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
   1 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 3 times out of 2147483647...
                                  failed...
   2 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 2 times out of 2147483647...
                                  failed...
   3 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 1 times out of 2147483647...
                                  failed...
   5 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 0 times out of 2147483647...
   5 <    _0>|  3:  EXACT <0>(5)
   6 <    _0>|  5:  END(0)
Match successful!

-----------------------

Guessing start of match in sv for REx "\s+0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes)
   0 <    _0>|  1:PLUS(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed

Perl 似乎 be optimized for failure 。它首先会查找常量字符串(仅消耗 O(N))。在这里，它将查找 0 :在偏移量 5 处找到 float 子字符串“0”...

然后它将从正则表达式的变量部分开始，分别是\s*和\s+，针对整个最小字符串检查:

Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char

之后，它将查找第一个满足 stclass 要求的位置，此处为位置 0。

\s*0:
- 从0开始，找到4个空格则失败；
- 从1开始，找到3个空格则失败；
- 从2开始，找到2个空格则失败；
- 从3开始，找到1个空格则失败；
- 从 4 开始，找到 0 个空格，然后不会失败；
- 查找精确的0
\s+0:
- 从0开始，找到4个空格就失败。由于未匹配最小空格数，正则表达式立即失败。

如果您想享受 Perl 正则表达式优化的乐趣，可以考虑以下正则表达式 /*\n 和 /*\n。乍一看，它们看起来相同，具有相同的含义......但是如果您针对 (""x 40000) 运行它。 "_\n" 第一个将检查所有可能性，而第二个将查找 "\n" 并立即失败。

在普通的、非优化的正则表达式引擎中，两个正则表达式都可能导致灾难性的回溯，因为它们需要在模式碰撞时重试该模式。但是，在上面的示例中，第二个 Perl 不会失败，因为它已经过优化以查找 float 子字符串“0%n”

<小时/>

您可以在 Jeff Atwood's blog 上看到另一个示例.

另请注意，问题不在于 \s 考虑因素，而是使用 xx* 而不是 x+ 的任何模式，请参阅 example with 0s还有regex explosive quantifiers

通过如此简短的示例，行为是“可找到的”，但如果您开始使用复杂的模式，则很难发现，例如:Regular expression hangs program (100% CPU usage)

关于regex - 为什么在这个 Perl 正则表达式中 `\s+` 比 `\s\s*` 快得多？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38431931/

regex - 为什么在这个 Perl 正则表达式中 `\s+` 比 `\s\s*` 快得多？

上一篇：haskell - 将 Haskell 用于大型实时系统(: how (if?)？

下一篇：haskell - 为 n 维网格类型编写 cojoin 或 cobind