python - 解决python正则表达式中负面回溯限制的方法

我正在编写一个正则表达式来识别文本中的问题，但前提是在我感兴趣的短语之前的 n=3 个单词内没有否定。这是我到目前为止所拥有的:

regex = r'''(?ix)     # case insensitive, verbose mode
\s+?
(?<!(not|no|never)){1,3}  # if this is within 3 words, you do not match, negative lookbehind
\s+?
(a|the|any|my|your)  # articles
\s+?
(issue|issues|problem|problems) # words of interest
'''

应该匹配:

matches = [
"a problem",
"the issue",
"any of the issues",
"not even close to being your issue",
]

不应该匹配:

non_matches = [
  "not a problem",
  "never your problem",
  "not the issue",
  "not overwhelmingly your issue",
  "not too close your issue"
]

如果我在没有负面回顾的情况下运行:

regex2 = r'''(?ix)  # case insensitive, verbose
(a|the|any|my|your)    # articles
\s+?
(issue|issues|problem|problems) # words of interest
'''

我得到正确的正匹配。

>>> for p in matches:
...   print(re.findall(regex2, p))
[('a', 'problem')]
[('the', 'issue')]
[('the', 'issue')]
[('your', 'issue')]

但是，如果我包含了我需要的否定前瞻以正确排除否定匹配，我会得到:

re.error: look-behind requires fixed-width pattern

我知道这只是 python 正则表达式引擎的一个限制，但是在这种情况下常用的适当解决方法是什么？有没有一种简单的方法可以将 0,1,2,3 模式组合在一起来处理它？还有什么？

最佳答案

在没有动态长度后视支持的情况下，您可以在 Python 中使用这种变通方法:

regex = r'''(?ixm)
^
(?!.*
   \b(?:not?|never)\s+
   (?:\w+\s+){0,2}
   (?:a|the|any|my|your)\s+
   (?:issues?|problems?)
)
.*\b(a|the|any|my|your)
\s+
(issues?|problems?)
'''

RegEx Demo
在这里，如果输入中存在不允许的模式，我们将在正则表达式的开头使用负前瞻来使匹配失败。

(?!.*
   \b(?:not?|never)\s+
   (?:\w+\s+){0,2}
   (?:a|the|any|my|your)\s+
   (?:issues?|problems?)
)

当我们有 no 时，这将导致匹配失败或 not或 never在文章的 1 到 3 个单词内，然后是您感兴趣的单词。

关于python - 解决python正则表达式中负面回溯限制的方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68135405/

python - 解决python正则表达式中负面回溯限制的方法

上一篇：python - 检索 Python DataFrame 中的平均值

下一篇：转换到 _Bool