Ruby string.scan(/#{regexp_pattern}/) - 执行时间

问题:

ruby .scan 使用正则表达式模式最多需要 5 分钟。时间取决于正在扫描的字符串。

测试在 ruby“2.5.1”和 ruby“2.4.2”上运行。

例子:

def time_regexp_test(string)
    start = Time.now
    puts "parse start: #{start}"

    regexp_pattern = "[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?"

    email = string.scan(/#{regexp_pattern}/).flatten.last
    finish = Time.now
    puts "parse finish: #{finish}"
    puts "total #{(finish-start).to_s}"
    email
end

strings = [
'"Test test Real Estate - Test\'s International Real Estate" <test@test.com>',
'"Test test Real Estate - Christie\'s International Real Estate" <test@test.com>',
'"Test test Real Estate - Christie\'s International Real Estate"',
'"Test test Real Estate - Christie\'s International Real Estate" t@',
'"Test test Real Estate - testtesttest\'s International Real Estate" <test@test.com>',
'"testtesttest\'s" <test@test.com>',
'testtesttest\'s <test@test.com>'
]
strings.each_with_index do |string, n|
  puts "Test # #{n}"
  puts "Input: #{string}"
  time_regexp_test(string)
end

结果:

Test # 0
Input: "Test test Real Estate - Test's International Real Estate" <test@test.com>
parse start: 2018-04-19 17:43:26 +0200
parse finish: 2018-04-19 17:43:29 +0200
total 3.630606
Test # 1
Input: "Test test Real Estate - Christie's International Real Estate" <test@test.com>
parse start: 2018-04-19 17:43:29 +0200
parse finish: 2018-04-19 17:43:54 +0200
total 24.119056
Test # 2
Input: "Test test Real Estate - Christie's International Real Estate"
parse start: 2018-04-19 17:43:54 +0200
parse finish: 2018-04-19 17:43:54 +0200
total 0.000256
Test # 3
Input: "Test test Real Estate - Christie's International Real Estate" t@
parse start: 2018-04-19 17:43:54 +0200
parse finish: 2018-04-19 17:44:06 +0200
total 12.093272
Test # 4
Input: "Test test Real Estate - testtesttest's International Real Estate" <test@test.com>
parse start: 2018-04-19 17:44:06 +0200
parse finish: 2018-04-19 17:46:51 +0200
total 165.338206
Test # 5
Input: "testtesttest's" <test@test.com>
parse start: 2018-04-19 17:46:51 +0200
parse finish: 2018-04-19 17:46:51 +0200
total 0.000385
Test # 6
Input: testtesttest's <test@test.com>
parse start: 2018-04-19 17:46:51 +0200
parse finish: 2018-04-19 17:46:51 +0200
total 0.000369

我们可以看到，解析某些字符串的时间非常长(测试#4)。如果我们在电子邮件地址的某些部分添加 @ 字符，时间会增加，如果将字符添加到带有 ' 字符的单词中，时间也会增加。

在 https://regexr.com/3o721 中测试此正则表达式- 所有工作都很快。

问题出在哪里？

更新:

尝试删除字符显示删除“-”字符使解析速度更快(165.338206 -> 0.578216)。

但是为什么？

string = '"Test test Real Estate - testtesttest\'s International Real Estate" <test@test.com>'
time_regexp_test(string.delete("-"))
parse start: 2018-04-19 18:17:21 +0200
parse finish: 2018-04-19 18:17:22 +0200
total 0.578216

最佳答案

您需要正确地转义点。

要么

regexp_pattern = '[a-zA-Z0-9!#$%&\'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&\'*+/=?^_`{|}~-]+)*@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?'
email = string.scan(/#{regexp_pattern}/).flatten.last

或者

regexp_pattern = /[a-zA-Z0-9!#\$%&'*+\/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#\$%&'*+\/=?^_`{|}~-]+)*@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?/
email = string.scan(regexp_pattern).flatten.last

否则，您的 "\." 被 Ruby 引擎解析为仅仅是一个 .，它匹配任何字符，但 Onigmo 正则表达式引擎的换行字符除外，并且你在经典中被绊倒了catastrophic backtracking .

如果您想在正则表达式测试器中重现与在 Ruby 代码中相同的行为，只需 remove a backslash before the dots in your pattern .

关于Ruby string.scan(/#{regexp_pattern}/) - 执行时间，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49925688/

Ruby string.scan(/#{regexp_pattern}/) - 执行时间

上一篇：ruby-on-rails - Rails 找不到 docker 应用程序的 rake gem

下一篇：ruby - 为什么 Ruby 实例方法调用在以 'self' 为前缀时表现不同？