.net - RegexOptions.Compiled 如何工作?

标签 .net regex

当您将正则表达式标记为要编译的表达式时,幕后发生了什么?这与缓存的正则表达式相比有何不同?

使用此信息,您如何确定计算成本与性能提升相比何时可以忽略不计?

最佳答案

RegexOptions.Compiled 指示正则表达式引擎使用轻量级代码生成 ( LCG ) 将正则表达式编译为 IL。此编译发生在对象的构造过程中,并且严重减慢了对象的速度。反过来,使用正则表达式的匹配速度更快。

如果您不指定此标志,您的正则表达式将被视为“已解释”。

举个例子:

public static void TimeAction(string description, int times, Action func)
{
    // warmup
    func();

    var watch = new Stopwatch();
    watch.Start();
    for (int i = 0; i < times; i++)
    {
        func();
    }
    watch.Stop();
    Console.Write(description);
    Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}

static void Main(string[] args)
{
    var simple = "^\\d+$";
    var medium = @"^((to|from)\W)?(?<url>http://[\w\.:]+)/questions/(?<questionId>\d+)(/(\w|-)*)?(/(?<answerId>\d+))?";
    var complex = @"^(([^<>()[\]\\.,;:\s@""]+"
      + @"(\.[^<>()[\]\\.,;:\s@""]+)*)|("".+""))@"
      + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
      + @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
      + @"[a-zA-Z]{2,}))$";


    string[] numbers = new string[] {"1","two", "8378373", "38737", "3873783z"};
    string[] emails = new string[] { "sam@sam.com", "sss@s", "sjg@ddd.com.au.au", "onelongemail@oneverylongemail.com" };

    foreach (var item in new[] {
        new {Pattern = simple, Matches = numbers, Name = "Simple number match"},
        new {Pattern = medium, Matches = emails, Name = "Simple email match"},
        new {Pattern = complex, Matches = emails, Name = "Complex email match"}
    })
    {
        int i = 0;
        Regex regex;

        TimeAction(item.Name + " interpreted uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        i = 0;
        TimeAction(item.Name + " compiled uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern, RegexOptions.Compiled);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern);
        i = 0;
        TimeAction(item.Name + " prepared interpreted match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern, RegexOptions.Compiled);
        i = 0;
        TimeAction(item.Name + " prepared compiled match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

    }
}

它对 3 个不同的正则表达式执行 4 次测试。首先,它测试单个一次性匹配(编译与非编译)。其次,它测试重复使用相同正则表达式的匹配。

我的机器上的结果(在发行版中编译,未附加调试器)

1000 个单一匹配(构造正则表达式、匹配和处理)

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x86      |    4 ms        |    26 ms           |    31 ms
Interpreted | x64      |    5 ms        |    29 ms           |    35 ms
Compiled    | x86      |  913 ms        |  3775 ms           |  4487 ms
Compiled    | x64      | 3300 ms        | 21985 ms           | 22793 ms

1,000,000 matches - reusing the Regex object

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x86      |  422 ms        |   461 ms           |  2122 ms
Interpreted | x64      |  436 ms        |   463 ms           |  2167 ms
Compiled    | x86      |  279 ms        |   166 ms           |  1268 ms
Compiled    | x64      |  281 ms        |   176 ms           |  1180 ms

These results show that compiled regular expressions can be up to 60% faster for cases where you reuse the Regex object. However in some cases can be over 3 orders of magnitude slower to construct.

It also shows that the x64 version of .NET can be 5 to 6 times slower when it comes to compilation of regular expressions.


The recommendation would be to use the compiled version in cases where either

  1. You do not care about object initialization cost and need the extra performance boost. (note we are talking fractions of a millisecond here)
  2. You care a little bit about initialization cost, but are reusing the Regex object so many times that it will compensate for it during your application life cycle.

Spanner in the works, the Regex cache

The regular expression engine contains an LRU cache which holds the last 15 regular expressions that were tested using the static methods on the Regex class.

For example: Regex.Replace, Regex.Match etc.. all use the Regex cache.

The size of the cache can be increased by setting Regex.CacheSize. It accepts changes in size any time during your application's life cycle.

New regular expressions are only cached by the static helpers on the Regex class. If you construct your objects the cache is checked (for reuse and bumped), however, the regular expression you construct is not appended to the cache.

This cache is a trivial LRU cache, it is implemented using a simple double linked list. If you happen to increase it to 5000, and use 5000 different calls on the static helpers, every regular expression construction will crawl the 5000 entries to see if it has previously been cached. There is a lock around the check, so the check can decrease parallelism and introduce thread blocking.

The number is set quite low to protect yourself from cases like this, though in some cases you may have no choice but to increase it.

My strong recommendation would be never pass the RegexOptions.Compiled option to a static helper.

For example:

// WARNING: bad code
Regex.IsMatch("10000", @"\\d+", RegexOptions.Compiled)

原因是您面临着 LRU 缓存未命中的严重风险,这将触发 super 昂贵编译。此外,您不知道您所依赖的库正在做什么,因此几乎没有能力控制或预测最佳可能缓存大小。

另请参阅:BCL team blog

<小时/>

注意:这与 .NET 2.0 和 .NET 4.0 相关。 4.5 中存在一些预期的变化,可能会导致对此进行修订。

关于.net - RegexOptions.Compiled 如何工作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/513412/

相关文章:

Python: "FOR"循环打印基于行的列字段匹配

python - 使用特定单词在字符串中查找 n 个单词的最优雅方法

javascript - 正则表达式需要数字,然后是空格,然后是字母?

c# - 为什么不允许条件属性方法返回 void 以外的值

c# - 如何在 C# (.NET) 中将文件上传到 SFTP 服务器?

c# - 如何将并发包转换为列表?

.net - 回发不适用于 aspx 页面作为默认文档

.net - VB.NET 数组算术

java - 输入文本文件上的正则表达式 float

python:在逗号和点之后拆分字符串