regex - 为什么正则表达式可以有指数级的运行时间?

标签 regex

可以编写在某些情况下需要指数运行时间的正则表达式。这样的例子是 (aa|aa)* .如果有一个奇数输入a s 它需要指数运行时间。

这很容易测试。如果输入仅包含 a s 并且长度为 51,正则表达式需要几秒钟来计算(在我的机器上)。相反,如果输入长度为 52,则其计算时间并不明显(我使用 JavaRE 的内置 Regex 解析器对此进行了测试)。

我写了一个正则表达式解析器来找出这种行为的原因,但我没有找到。我的解析器可以构建一个 ASTNFA基于正则表达式。之后它可以将 NFA 转换为 DFA .为此,它使用 powerset construction algorithm .

当我解析上面提到的 Rgex 时,解析器会创建一个具有 7 个状态的 NFA - 转换后 DFA 中只剩下 3 个状态。 DFA 代表更明智的 Regex (aa)* ,可以非常快地解析。

因此,我不明白为什么解析器会如此缓慢。这是什么原因?他们不把 NFA 翻译成 DFA 吗?如果是,为什么不呢?他们计算如此缓慢的技术原因是什么?

最佳答案

Russ Cox has a very detailed article about why this is and the history of regexes ( part 2part 3 )。

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.



很大程度上,这归结为“正则”表达式中非正则特性的激增,例如反向引用,以及大多数程序员(继续)无知,不包含这些特性的正则表达式有更好的替代品(其中很多) .

While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, “Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.”) Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.

关于regex - 为什么正则表达式可以有指数级的运行时间?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8887724/

相关文章:

regex - 在 shell 脚本中进行字符串操作的最佳方法是什么?

regex - 如何以 1 美元的价格使用 notepad++ 的正则表达式

java - String.replaceAll 的正则表达式

正则表达式匹配字符串中的 4 个字母

javascript - 无法转义 "?"但 "\"在 javascript 正则表达式中工作正常

Java 正则表达式 - 试图从以特定字符串开头的行中分离出文本?

regex - powershell中将字符串解析成hash最简洁的方法

java - 使用 Lookahead Regex 时出现模式异常

javascript - 允许逗号和空格分隔的数字列表的正则表达式

java - 用正则表达式分割许多单个字符(java)