python - 在某些字符串上匹配正则表达式的 URL 非常慢

这是我在某些字符串中查找 URL 的正则表达式(我需要域的组，因为进一步的操作是基于域的)我注意到在这个例子中对于一些字符串 'fffffffff' 它非常慢，有一些明显的东西我不见了？

>>> URL_ALLOWED = r"[a-z0-9$-_.+!*'(),%]"
>>> URL_RE = re.compile(
...     r'(?:(?:https?|ftp):\/\/)?'  # protocol
...     r'(?:www.)?' # www
...     r'('  # host - start
...         r'(?:'
...             r'[a-z0-9]'  # first character of domain('-' not allowed)
...             r'(?:'
...                 r'[a-z0-0-]*'  #  characters in the middle of domain
...                 r'[a-z0-9]' #  last character of domain('-' not allowed)
...             r')*'
...             r'\.'  # dot before next part of domain name
...         r')+'
...         r'[a-z]{2,10}'  # TLD
...         r'|'  # OR
...         r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}'  # IP address
...     r')' # host - end
...     r'(?::[0-9]+)?'  # port
...     r'(?:\/%(allowed_chars)s+/?)*'  # path
...     r'(?:\?(?:%(allowed_chars)s+=%(allowed_chars)s+&)*'  # GET params
...     r'%(allowed_chars)s+=%(allowed_chars)s+)?'  # last GET param
...     r'(?:#[^\s]*)?' % {  # anchor
...         'allowed_chars': URL_ALLOWED
...     },
...     re.IGNORECASE
... )
>>> from time import time
>>> strings = [
...     'foo bar baz',
...     'blah blah blah blah blah blah',
...     'f' * 10,
...     'f' * 20,
...     'f' * 30,
...     'f' * 40,
... ]
>>> def t():
...     for string in strings:
...             t1 = time()
...             URL_RE.findall(string)
...             print string, time() - t1
... 
>>> t()
foo bar baz 3.91006469727e-05
blah blah blah blah blah blah 6.98566436768e-05
ffffffffff 0.000313997268677
ffffffffffffffffffff 0.183916091919
ffffffffffffffffffffffffffffff 178.445468903

是的，我知道还有另一种解决方案可以使用非常简单的正则表达式(例如包含点的单词)并稍后使用 urlparse 来获取域，但是当我们在 URL 中没有协议(protocol)时，urlparse 无法按预期工作:

>>> urlparse('example.com')
ParseResult(scheme='', netloc='', path='example.com', params='', query='', fragment='')
>>> urlparse('http://example.com')
ParseResult(scheme='http', netloc='example.com', path='', params='', query='', fragment='')
>>> urlparse('example.com/test/test')
ParseResult(scheme='', netloc='', path='example.com/test/test', params='', query='', fragment='')
>>> urlparse('http://example.com/test/test')
ParseResult(scheme='http', netloc='example.com', path='/test/test', params='', query='', fragment='')
>>> urlparse('example.com:1234/test/test')
ParseResult(scheme='example.com', netloc='', path='1234/test/test', params='', query='', fragment='')
>>> urlparse('http://example.com:1234/test/test')
ParseResult(scheme='http', netloc='example.com:1234', path='/test/test', params='', query='', fragment='')

是的，在 http://前面加上也是一个解决方案(我仍然不是 100% 确定是否没有其他 urlparse 问题)但我很好奇这个正则表达式有什么问题

最佳答案

我认为这是因为这部分

...         r'(?:'
...             r'[a-z0-9]'  # first character of domain('-' not allowed)
...             r'(?:'
...                 r'[a-z0-0-]*'  #  characters in the middle of domain
...                 r'[a-z0-9]' #  last character of domain('-' not allowed)
...             r')*'
...             r'\.'  # dot before next part of domain name
...         r')+'

如果 set_of_symbols#1 和 set_of_symbols#2 具有相同的符号，则不应使用这样的结构 ([set_of_symbols#1]*[set_of_symbols#2])*。

请尝试使用以下代码:

...         r'(?:'
...             r'[a-z0-9]'  # first character of domain('-' not allowed)
...             r'[a-z0-0-]*'  #  characters in the middle of domain
...             r'(?<=[a-z0-9])' #  last character of domain('-' not allowed)
...             r'\.'  # dot before next part of domain name
...         r')+'

它应该工作得更好。

关于python - 在某些字符串上匹配正则表达式的 URL 非常慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12005945/

python - 在某些字符串上匹配正则表达式的 URL 非常慢

上一篇：python - 使用 pcapy 或 scapy 监控自生(HTTP)网络流量

下一篇：python - 覆盖 os.walk 以返回生成器对象作为第三项