问题:
ruby .scan 使用正则表达式模式最多需要 5 分钟。时间取决于正在扫描的字符串。
测试在 ruby“2.5.1”和 ruby“2.4.2”上运行。
例子:
def time_regexp_test(string)
start = Time.now
puts "parse start: #{start}"
regexp_pattern = "[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?"
email = string.scan(/#{regexp_pattern}/).flatten.last
finish = Time.now
puts "parse finish: #{finish}"
puts "total #{(finish-start).to_s}"
email
end
strings = [
'"Test test Real Estate - Test\'s International Real Estate" <test@test.com>',
'"Test test Real Estate - Christie\'s International Real Estate" <test@test.com>',
'"Test test Real Estate - Christie\'s International Real Estate"',
'"Test test Real Estate - Christie\'s International Real Estate" t@',
'"Test test Real Estate - testtesttest\'s International Real Estate" <test@test.com>',
'"testtesttest\'s" <test@test.com>',
'testtesttest\'s <test@test.com>'
]
strings.each_with_index do |string, n|
puts "Test # #{n}"
puts "Input: #{string}"
time_regexp_test(string)
end
结果:
Test # 0
Input: "Test test Real Estate - Test's International Real Estate" <test@test.com>
parse start: 2018-04-19 17:43:26 +0200
parse finish: 2018-04-19 17:43:29 +0200
total 3.630606
Test # 1
Input: "Test test Real Estate - Christie's International Real Estate" <test@test.com>
parse start: 2018-04-19 17:43:29 +0200
parse finish: 2018-04-19 17:43:54 +0200
total 24.119056
Test # 2
Input: "Test test Real Estate - Christie's International Real Estate"
parse start: 2018-04-19 17:43:54 +0200
parse finish: 2018-04-19 17:43:54 +0200
total 0.000256
Test # 3
Input: "Test test Real Estate - Christie's International Real Estate" t@
parse start: 2018-04-19 17:43:54 +0200
parse finish: 2018-04-19 17:44:06 +0200
total 12.093272
Test # 4
Input: "Test test Real Estate - testtesttest's International Real Estate" <test@test.com>
parse start: 2018-04-19 17:44:06 +0200
parse finish: 2018-04-19 17:46:51 +0200
total 165.338206
Test # 5
Input: "testtesttest's" <test@test.com>
parse start: 2018-04-19 17:46:51 +0200
parse finish: 2018-04-19 17:46:51 +0200
total 0.000385
Test # 6
Input: testtesttest's <test@test.com>
parse start: 2018-04-19 17:46:51 +0200
parse finish: 2018-04-19 17:46:51 +0200
total 0.000369
我们可以看到,解析某些字符串的时间非常长(测试#4)。 如果我们在电子邮件地址的某些部分添加 @ 字符,时间会增加,如果将字符添加到带有 ' 字符的单词中,时间也会增加。
在 https://regexr.com/3o721 中测试此正则表达式- 所有工作都很快。
问题出在哪里?
更新:
尝试删除字符显示删除“-”字符使解析速度更快(165.338206 -> 0.578216)。
但是为什么?
string = '"Test test Real Estate - testtesttest\'s International Real Estate" <test@test.com>'
time_regexp_test(string.delete("-"))
parse start: 2018-04-19 18:17:21 +0200
parse finish: 2018-04-19 18:17:22 +0200
total 0.578216
最佳答案
您需要正确地转义点。
要么
regexp_pattern = '[a-zA-Z0-9!#$%&\'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&\'*+/=?^_`{|}~-]+)*@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?'
email = string.scan(/#{regexp_pattern}/).flatten.last
或者
regexp_pattern = /[a-zA-Z0-9!#\$%&'*+\/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#\$%&'*+\/=?^_`{|}~-]+)*@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?/
email = string.scan(regexp_pattern).flatten.last
否则,您的 "\."
被 Ruby 引擎解析为仅仅是一个 .
,它匹配任何字符,但 Onigmo 正则表达式引擎的换行字符除外,并且你在经典中被绊倒了catastrophic backtracking .
如果您想在正则表达式测试器中重现与在 Ruby 代码中相同的行为,只需 remove a backslash before the dots in your pattern .
关于Ruby string.scan(/#{regexp_pattern}/) - 执行时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49925688/