python - 在 EntityRuler 中将 RegEx 用于短语模式

我试图用这样的 EntityRuler 找到 FRT 实体:

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

然后我得到了这个结果

[('Apple', 'FRT'), ('is', 'FRT'), ('red', 'FRT'), ('.', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT'), ('is', 'FRT'), ('green', 'FRT'), ('.', 'FRT')]

你能告诉我如何修正我的代码以便我得到这个结果吗

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

提前谢谢你。

最佳答案

您需要使用此 patterns 声明来修复整个代码:

patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

有两件事:1) 如果你没有在TEXT、LOWER 下定义，REGEX 运算符本身不起作用，等top-level token和 2) 您使用的正则表达式已损坏，因为您使用的是字符类而不是分组结构。

另外，比照。这些场景:

[{"TEXT": {"REGEX": "[Aa]pples?"}}] 会找到 Apple, apple , Apples, apples, 但不会找到 APPLES
[{"LOWER": {"REGEX": "apples?"}}] 会找到 Apple, apple, Apples、apples、APPLES、aPPleS 等。和还有stapples(staples 的拼写错误)
[{"TEXT": {"REGEX": r"\b[Aa]pples?\b"}}] 会找到 Apple, apple、Apples、apples，但不会找到 APPLES、nor stapples 因为 \b 是单词边界。

关于python - 在 EntityRuler 中将 RegEx 用于短语模式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57667710/

python - 在 EntityRuler 中将 RegEx 用于短语模式

上一篇：python - 如何确定 PyPI 包中可用的模块

下一篇：python - dataframe.idxmax() - 前 N 次出现