python - 为什么在 Python 中捕获组时正则表达式搜索速度较慢？

我有一个动态生成正则表达式的应用程序代码从配置中进行一些解析。当对两个变体的性能进行计时时，正则表达式变体捕获 OR 正则表达式的每个部分明显慢于一个正常的正则表达式。原因是正则表达式模块内部某些操作的开销。

>>> import timeit
>>> setup = '''
... import re
... '''   

#no capture group 
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.922958850861

#with capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.44321084023

#no capture group
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.913202047348

# capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.41544604301

问题:是什么导致使用捕获组时性能显着下降？

最佳答案

原因很简单，使用捕获组表示引擎将内容保存在内存中，而使用非捕获组表示引擎不保存任何内容。考虑一下您是在告诉引擎执行更多操作。

例如，使用这个正则表达式 (hello|bye|ola|cheers) 或 (hello)|(bye)|(ola)|(cheers) 会影响远高于使用原子组或像 (?:hello|bye|ola|cheers) 这样的非捕获组。

使用正则表达式时，您知道是否要捕获或不捕获上述情况的内容。如果您想捕获其中任何一个词，您将失去性能，但如果您不需要捕获内容，那么您可以通过改进它来节省性能，例如使用非捕获组

我知道您标记了 python，但已经为 javascript 准备了一个在线基准测试，以显示捕获和非捕获组对 js 正则表达式引擎的影响。

https://jsperf.com/capturing-groups-vs-non-capturing-groups

关于python - 为什么在 Python 中捕获组时正则表达式搜索速度较慢？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41444807/

python - 为什么在 Python 中捕获组时正则表达式搜索速度较慢？

上一篇：python - 'weight' 在 tkinter 中有什么作用？

下一篇：python - "Contains"类或正则表达式的美丽汤？