下面的代码包含一个正则表达式,旨在提取 C# 字符串文字,但正则表达式匹配超过几个字符的输入字符串的性能很差。
class Program
{
private static void StringMatch(string s)
{
// regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
if (m.Success)
Trace.WriteLine(m.Value);
else
Trace.WriteLine("no match");
}
public static void Main()
{
// this first string is unterminated (so the match fails), but it returns instantly
StringMatch("\"OK");
// this string is terminated (the match succeeds)
StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");
// this string is unterminated (so the match will fail), but it never returns
StringMatch("\"This is another unterminated string and takes FOREVER to match");
}
}
我可以将正则表达式重构为不同的形式,但是谁能解释为什么性能如此糟糕?
最佳答案
你遇到了 catastrophic backtracking:
让我们稍微简化一下正则表达式(没有转义引号和第二个可选组,因为在您的评论中,它与测试的字符串无关):
"(([^\\"]*))*"
([^\\"]*)
匹配除引号或反斜杠之外的任何字符串。这 again 包含在一个可选组中,该组可以重复任意次数。
现在输入字符串 "ABC
,正则表达式引擎需要尝试以下排列:
-
"
,ABC
-
"
,ABC
,<empty string>
-
"
,AB
,C
-
"
,AB
,C
,<empty string>
-
"
,AB
,<empty string>
,C
-
"
,AB
,<empty string>
,C
,<empty string>
-
"
,<empty string>
,AB
,C
-
"
,<empty string>
,AB
,C
,<empty string>
-
"
,<empty string>
,AB
,<empty string>
,C
,<empty string>
-
"
,<empty string>
,AB
,<empty string>
,C
-
"
,A
,BC
-
"
,A
,BC
,<empty string>
-
"
,A
,<empty string>
,BC
-
"
,<empty string>
,A
,BC
- 等
-
"
,A
,B
,C
-
"
,A
,B
,C
,<empty string>
-
"
,A
,B
,<empty string>
,C
- 等等
然后每一个都失败了,因为没有后面的 "
.
此外,您只是测试子字符串,而不是强制正则表达式匹配整个字符串。而且您通常希望使用正则表达式的逐字字符串来减少所需的反斜杠数量。这个怎么样:
foundMatch = Regex.IsMatch(subjectString,
@"\A # Start of the string
"" # Match a quote
(?: # Either match...
\\. # an escaped character
| # or
[^\\""] # any character except backslash or quote
)* # any number of times
"" # Match a quote
\Z # End of the string",
RegexOptions.IgnorePatternWhitespace);
关于c# - 正则表达式性能缓慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9687596/