Python:为什么使用 re.IGNORECASE 编译的正则表达式会删除第一个字符？

我在 python 中有以下正则表达式来解析文本中的列表:

re.compile('(.*,?) and (.*)')

一切都按预期工作，除了在使用 re.IGNORECASE 标志编译 re 时，不会返回前两个字符。

没有 IGNORECASE 标志编译时 re 组 1 的输出:

the 1689 London Baptist Confession of Faith , 1742 Philadelphia Baptist Confession , the 1833 New Hampshire Baptist Confession of Faith , the Southern Baptist Convention 's Baptist Faith and Message ,

使用 IGNORECASE 标志编译时 re 组 1 的输出:

e 1689 London Baptist Confession of Faith , 1742 Philadelphia Baptist Confession , the 1833 New Hampshire Baptist Confession of Faith , the Southern Baptist Convention 's Baptist Faith and Message ,

documentation关于船旗国:

Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale. To get this effect on non-ASCII Unicode characters such as ü and Ü, add the UNICODE flag.

所以没有关于此行为的提示，有什么可以解释它或者我在这里遗漏了一些明显的东西？编辑:根据评论中的要求，完整的代码示例(Python 3.6.5):

listing_re = re.compile('(.*,?) and (.*)')
def parse_listing(txt):
    listing_search = listing_re.search(txt, re.IGNORECASE)
    if listing_search:
        seperated_by_comma = listing_search.group(1)    # listing of expressions, seperated by ','
        parts = seperated_by_comma.split(',')           # split string at ','
        parts.append(listing_search.group(2))           # append the single expression after 'and'
        return [x.strip() for x in parts if x.strip()]  # return list of stripped exressions
    return list()

print(parse_listing("the 1689 London Baptist Confession of Faith , 1742 Philadelphia Baptist Confession , the 1833 New Hampshire Baptist Confession of Faith , the Southern Baptist Convention 's Baptist Faith and Message , and written church covenants"))

上面的代码返回错误的结果，删除 re.IGNORECASE 标志返回正确的结果。

最佳答案

这里的问题是您在错误的位置传递了 re.IGNORECASE 标志。由于 listing_re 是编译后的正则表达式，listing_re.search 具有如下签名 ( docs ):

Pattern.search(string[, pos[, endpos]])

[...]

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

如您所见，您已将 re.IGNORECASE 作为 pos 参数的值传递。由于 re.IGNORECASE 的值恰好为 2，因此您最终会跳过前 2 个字符。

>>> re.IGNORECASE
<RegexFlag.IGNORECASE: 2>

正确的用法是将标志传递给 re.compile :

listing_re = re.compile('(.*,?) and (.*)', re.IGNORECASE)

关于Python:为什么使用 re.IGNORECASE 编译的正则表达式会删除第一个字符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52026997/

Python:为什么使用 re.IGNORECASE 编译的正则表达式会删除第一个字符？

上一篇：python - 为什么 Pycharm 不能运行相对导入的代码？

下一篇：angular - 使用 angular-oauth2-oidc 库从 Auth0 获取完全访问 token