用于特定输入挂起的 Python 正则表达式

标签 python regex

URL 的正则表达式如下。

(?i)\b((?:[a-z][\w-]+://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''.,<>?]))

过去 6 个月我在我的项目中使用了同样的方法,效果很好。但今天,对于以下输入之一,它挂起代码。

'$.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"'

为了找出原因,我分隔了每个字符,在其后面附加了该字符后面的部分输入,并对形成的子字符串应用了正则表达式。我得到了以下输出

Scanned text: $
Time taken: 9.05990600586e-06

Scanned text: $.
Time taken: 5.96046447754e-06

Scanned text: $.g
Time taken: 4.05311584473e-06

Scanned text: $.ge
Time taken: 4.05311584473e-06

Scanned text: $.get
Time taken: 3.81469726562e-06

Scanned text: $.getS
Time taken: 5.00679016113e-06

Scanned text: $.getSc
Time taken: 5.00679016113e-06

Scanned text: $.getScr
Time taken: 5.96046447754e-06

Scanned text: $.getScri
Time taken: 1.00135803223e-05

Scanned text: $.getScrip
Time taken: 1.09672546387e-05

Scanned text: $.getScript
Time taken: 1.21593475342e-05

Scanned text: $.getScript(
Time taken: 1.4066696167e-05

Scanned text: $.getScript("
Time taken: 1.50203704834e-05

Scanned text: $.getScript("h
Time taken: 1.59740447998e-05

Scanned text: $.getScript("ht
Time taken: 1.59740447998e-05

Scanned text: $.getScript("htt
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http:
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http:/
Time taken: 1.90734863281e-05

Scanned text: $.getScript("http://
Time taken: 1.97887420654e-05

Scanned text: $.getScript("http://w
Time taken: 2.21729278564e-05

Scanned text: $.getScript("http://ww
Time taken: 2.31266021729e-05

Scanned text: $.getScript("http://www
Time taken: 2.09808349609e-05

Scanned text: $.getScript("http://www.
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.g
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.go
Time taken: 2.09808349609e-05

Scanned text: $.getScript("http://www.goo
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.goog
Time taken: 2.00271606445e-05

Scanned text: $.getScript("http://www.googl
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google
Time taken: 2.09808349609e-05

Scanned text: $.getScript("http://www.google.
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.google.c
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google.co
Time taken: 2.21729278564e-05

Scanned text: $.getScript("http://www.google.com
Time taken: 2.90870666504e-05

Scanned text: $.getScript("http://www.google.com/
Time taken: 1.8835067749e-05

Scanned text: $.getScript("http://www.google.com/r
Time taken: 2.21729278564e-05

Scanned text: $.getScript("http://www.google.com/re
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.google.com/rec
Time taken: 2.31266021729e-05

Scanned text: $.getScript("http://www.google.com/reca
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google.com/recap
Time taken: 2.121925354e-05

Scanned text: $.getScript("http://www.google.com/recapt
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.google.com/recaptc
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google.com/recaptch
Time taken: 2.40802764893e-05

Scanned text: $.getScript("http://www.google.com/recaptcha
Time taken: 1.71661376953e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/
Time taken: 1.31130218506e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/a
Time taken: 1.28746032715e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/ap
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api
Time taken: 1.31130218506e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/
Time taken: 1.28746032715e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/j
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/r
Time taken: 2.40802764893e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/re
Time taken: 2.09808349609e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/rec
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/reca
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recap
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recapt
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptc
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptch
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_a
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_aj
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_aja
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.j
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js"
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",f
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",fu
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",fun
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",func
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",funct
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",functi
Time taken: 1.90734863281e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",functio
Time taken: 1.47819519043e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function()
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){R
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Re
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Rec
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Reca
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recap
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recapt
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptc
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptch
Time taken: 1.71661376953e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.c
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.cr
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.cre
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.crea
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.creat
Time taken: 1.71661376953e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(
Time taken: 1.81198120117e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(S
Time taken: 2.00271606445e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(St
Time taken: 2.21729278564e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Sta
Time taken: 2.69412994385e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Stac
Time taken: 3.50475311279e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Stack
Time taken: 5.19752502441e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackE
Time taken: 8.48770141602e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackEx
Time taken: 0.000190019607544

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExc
Time taken: 0.00028395652771

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExch
Time taken: 0.000540971755981

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExcha
Time taken: 0.00104284286499

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchan
Time taken: 0.00213289260864

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchang
Time taken: 0.00400710105896

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange
Time taken: 0.00786900520325

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.
Time taken: 0.0186932086945

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.o
Time taken: 0.0350210666656

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.op
Time taken: 0.0638809204102

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.opt
Time taken: 0.126477003098

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.opti
Time taken: 0.257136106491

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.optio
Time taken: 0.5178399086

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.option
Time taken: 1.06005501747

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options
Time taken: 2.07899594307

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.
Time taken: 4.15297198296

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.s
Time taken: 8.27355504036

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.si
Time taken: 16.3851959705

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.sit
Time taken: 32.8552730083

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site
Time taken: 66.1672520638

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.
Time taken: 132.832139015

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.r
Time taken: 266.925510168

Sent a keyboard interrupt, to kill the process, as it was taking a lot of time.

可以观察到,附加每个字符后,运行 re.findall() 所需的时间增加了一倍。在整行上运行正则表达式将花费大量时间。所以实际上它并没有处于挂起状态,而是需要非常非常长的时间。

我不知道为什么 re.findall 花了那么多时间。避免这种情况的一种方法可能是限制 re.findall() 的结果运行特定的持续时间。我猜想可能有一种更简单的方法来避免这种情况。如果是的话请提出建议。

如果需要,代码如下。

import re
import time

url = re.compile('(?i)((?:[a-z][\w-]+://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''.,<>?]))')
buggy = '$.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"'

buggy_part = ''
for each_char in buggy:
    buggy_part += each_char
    start = time.time()
    re.findall(url, buggy_part)
    done = time.time()
    elapsed = done - start
    print 'Scanned text: ' + buggy_part
    print 'Time taken: ' + str(elapsed) + '\n'

最佳答案

我发现你的表情难以理解,但很明显你正在遭受catastrophic backtracking的困扰;由于贪婪模式,性能呈指数下降。

事实上,用 +? 非贪婪量词替换 + 量词可以让你的测试脚本几乎立即运行:

>>> url = re.compile('(?i)((?:[a-z][\w-]+?://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+?[.][a-z]{2,4}/)(?:[^\s()<>]+?|\(([^\s()<>]+?|(\([^\s()<>]+?\)))*\))+?(?:\(([^\s()<>]+?|(\([^\s()<>]+?\)))*\)|[^\s`!()\[\]{};:''.,<>?]))')
[.. your script ..]
[('http://www', '', '', '', ''), ('.google.com/re', '', '', '', '')]
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"
Time taken: 0.000473022460938

这可能不是正确的匹配;也许其中一种模式现在还不够贪婪。足够。尝试将模式限制为仅一两个贪婪匹配,寻找更具体的模式和 anchor 以限制贪婪搜索的范围。您很可能不需要那么多贪婪量词来完成工作。

关于用于特定输入挂起的 Python 正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21280953/

相关文章:

python - 两个数字并打印从第一个数字到第二个数字的列表

python - Pandas 等于神秘行为

python - makemessages 命令如何为 *.po 文件寻址项目文件夹外的模板

regex - 单词中的撇号不被识别为字符串替换

regex - 将最后一个八位字节与 IP 地址隔离并将其放入变量中

python - 通过切断循环来缩短元组列表?

正则表达式以字符串开头和/或结尾 - 如何简化?

regex - Linux 用包含双引号的变量替换

python - 使用 Regexp 过滤器的 Elasticsearch Bool 查询

python - 无法使用 python 3.5 安装 opencv 3.1,仅适用于 2.7