URL 的正则表达式如下。
(?i)\b((?:[a-z][\w-]+://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''.,<>?]))
过去 6 个月我在我的项目中使用了同样的方法,效果很好。但今天,对于以下输入之一,它挂起代码。
'$.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"'
为了找出原因,我分隔了每个字符,在其后面附加了该字符后面的部分输入,并对形成的子字符串应用了正则表达式。我得到了以下输出
Scanned text: $
Time taken: 9.05990600586e-06
Scanned text: $.
Time taken: 5.96046447754e-06
Scanned text: $.g
Time taken: 4.05311584473e-06
Scanned text: $.ge
Time taken: 4.05311584473e-06
Scanned text: $.get
Time taken: 3.81469726562e-06
Scanned text: $.getS
Time taken: 5.00679016113e-06
Scanned text: $.getSc
Time taken: 5.00679016113e-06
Scanned text: $.getScr
Time taken: 5.96046447754e-06
Scanned text: $.getScri
Time taken: 1.00135803223e-05
Scanned text: $.getScrip
Time taken: 1.09672546387e-05
Scanned text: $.getScript
Time taken: 1.21593475342e-05
Scanned text: $.getScript(
Time taken: 1.4066696167e-05
Scanned text: $.getScript("
Time taken: 1.50203704834e-05
Scanned text: $.getScript("h
Time taken: 1.59740447998e-05
Scanned text: $.getScript("ht
Time taken: 1.59740447998e-05
Scanned text: $.getScript("htt
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http:
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http:/
Time taken: 1.90734863281e-05
Scanned text: $.getScript("http://
Time taken: 1.97887420654e-05
Scanned text: $.getScript("http://w
Time taken: 2.21729278564e-05
Scanned text: $.getScript("http://ww
Time taken: 2.31266021729e-05
Scanned text: $.getScript("http://www
Time taken: 2.09808349609e-05
Scanned text: $.getScript("http://www.
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.g
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.go
Time taken: 2.09808349609e-05
Scanned text: $.getScript("http://www.goo
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.goog
Time taken: 2.00271606445e-05
Scanned text: $.getScript("http://www.googl
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google
Time taken: 2.09808349609e-05
Scanned text: $.getScript("http://www.google.
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.google.c
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google.co
Time taken: 2.21729278564e-05
Scanned text: $.getScript("http://www.google.com
Time taken: 2.90870666504e-05
Scanned text: $.getScript("http://www.google.com/
Time taken: 1.8835067749e-05
Scanned text: $.getScript("http://www.google.com/r
Time taken: 2.21729278564e-05
Scanned text: $.getScript("http://www.google.com/re
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.google.com/rec
Time taken: 2.31266021729e-05
Scanned text: $.getScript("http://www.google.com/reca
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google.com/recap
Time taken: 2.121925354e-05
Scanned text: $.getScript("http://www.google.com/recapt
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.google.com/recaptc
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google.com/recaptch
Time taken: 2.40802764893e-05
Scanned text: $.getScript("http://www.google.com/recaptcha
Time taken: 1.71661376953e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/
Time taken: 1.31130218506e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/a
Time taken: 1.28746032715e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/ap
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api
Time taken: 1.31130218506e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/
Time taken: 1.28746032715e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/j
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/r
Time taken: 2.40802764893e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/re
Time taken: 2.09808349609e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/rec
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/reca
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recap
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recapt
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptc
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptch
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_a
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_aj
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_aja
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.j
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js"
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",f
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",fu
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",fun
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",func
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",funct
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",functi
Time taken: 1.90734863281e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",functio
Time taken: 1.47819519043e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function()
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){R
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Re
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Rec
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Reca
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recap
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recapt
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptc
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptch
Time taken: 1.71661376953e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.c
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.cr
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.cre
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.crea
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.creat
Time taken: 1.71661376953e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(
Time taken: 1.81198120117e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(S
Time taken: 2.00271606445e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(St
Time taken: 2.21729278564e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Sta
Time taken: 2.69412994385e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Stac
Time taken: 3.50475311279e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Stack
Time taken: 5.19752502441e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackE
Time taken: 8.48770141602e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackEx
Time taken: 0.000190019607544
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExc
Time taken: 0.00028395652771
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExch
Time taken: 0.000540971755981
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExcha
Time taken: 0.00104284286499
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchan
Time taken: 0.00213289260864
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchang
Time taken: 0.00400710105896
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange
Time taken: 0.00786900520325
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.
Time taken: 0.0186932086945
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.o
Time taken: 0.0350210666656
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.op
Time taken: 0.0638809204102
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.opt
Time taken: 0.126477003098
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.opti
Time taken: 0.257136106491
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.optio
Time taken: 0.5178399086
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.option
Time taken: 1.06005501747
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options
Time taken: 2.07899594307
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.
Time taken: 4.15297198296
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.s
Time taken: 8.27355504036
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.si
Time taken: 16.3851959705
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.sit
Time taken: 32.8552730083
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site
Time taken: 66.1672520638
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.
Time taken: 132.832139015
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.r
Time taken: 266.925510168
Sent a keyboard interrupt, to kill the process, as it was taking a lot of time.
可以观察到,附加每个字符后,运行 re.findall() 所需的时间增加了一倍。在整行上运行正则表达式将花费大量时间。所以实际上它并没有处于挂起状态,而是需要非常非常长的时间。
我不知道为什么 re.findall 花了那么多时间。避免这种情况的一种方法可能是限制 re.findall() 的结果运行特定的持续时间。我猜想可能有一种更简单的方法来避免这种情况。如果是的话请提出建议。
如果需要,代码如下。
import re
import time
url = re.compile('(?i)((?:[a-z][\w-]+://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''.,<>?]))')
buggy = '$.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"'
buggy_part = ''
for each_char in buggy:
buggy_part += each_char
start = time.time()
re.findall(url, buggy_part)
done = time.time()
elapsed = done - start
print 'Scanned text: ' + buggy_part
print 'Time taken: ' + str(elapsed) + '\n'
最佳答案
我发现你的表情难以理解,但很明显你正在遭受catastrophic backtracking的困扰;由于贪婪模式,性能呈指数下降。
事实上,用 +?
非贪婪量词替换 +
量词可以让你的测试脚本几乎立即运行:
>>> url = re.compile('(?i)((?:[a-z][\w-]+?://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+?[.][a-z]{2,4}/)(?:[^\s()<>]+?|\(([^\s()<>]+?|(\([^\s()<>]+?\)))*\))+?(?:\(([^\s()<>]+?|(\([^\s()<>]+?\)))*\)|[^\s`!()\[\]{};:''.,<>?]))')
[.. your script ..]
[('http://www', '', '', '', ''), ('.google.com/re', '', '', '', '')]
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"
Time taken: 0.000473022460938
这可能不是正确的匹配;也许其中一种模式现在还不够贪婪。足够。尝试将模式限制为仅一两个贪婪匹配,寻找更具体的模式和 anchor 以限制贪婪搜索的范围。您很可能不需要那么多贪婪量词来完成工作。
关于用于特定输入挂起的 Python 正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21280953/