python - 使用 Python 正则表达式捕获占有者和前缀

我正在尝试为 Python 编写一个正则表达式来捕获语料库中出现的各种形式的“群岛”。

这是一个测试字符串:

这是我关于岛屿、群岛和群岛空间的句子。我想确保群岛的猫不会被遗忘。我们不能忘记元群岛和原群岛历史学家，他们倾向于拼写复数“archipelagoes”。

我想从字符串中捕获以下内容:

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

尝试 1

使用正则表达式 (archipelag.*?)\b 并使用 Pythex 进行测试, 我捕获了所有六种形式的一部分。但是也有问题:

archipelago's 仅被捕获为 archipelago。我想要占有欲。
meta-archipelagic 仅作为 archipelagic 捕获。我希望能够捕获带连字符的前缀。
protoarchipelagic 仅被捕获为 archipelagic。我希望能够捕获非连字符前缀。

尝试 2

如果我尝试使用正则表达式 (archipelag.*?)\s(请参阅 Pythex )，所有格 archipelago's 现在会被捕获，但是后面的逗号第一个实例也被捕获(例如，archipelagos,)。它未能完全捕获最终的'archipelagoes.'。

最佳答案

正则表达式 ((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?) 适用于此。如果您有其他要求，您可能希望进一步修改它。

注意使用非捕获组 (?:) 来对表达式进行分组，这样我们就可以使用 ? 匹配零个或其中一个

import re

pat = re.compile(r"((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)")

corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'"

for match in pat.findall(corpus):
    print(match)

打印

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

Here it is on regex101

关于python - 使用 Python 正则表达式捕获占有者和前缀，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49439828/

python - 使用 Python 正则表达式捕获占有者和前缀

尝试 1

尝试 2

上一篇：python - 保存后字段归零

下一篇：python - 在 Pandas 中，如何使用一个表中的值作为索引从另一个表中提取数据？