python - spaCy - 将扩展函数添加到管道导致堆栈溢出

标签 python spacy

我正在尝试将基于匹配器规则的函数添加到我的 spaCy 管道中。但是,将其添加到管道会导致 StackOverflow 错误。很有可能是用户错误。任何建议或想法将不胜感激。

运行该函数而不将其添加到管道中效果很好。

代码示例:

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def extend_matcher_entities(doc):
    matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
    matcher.add("TIME", None, nlp("0305Z"), nlp("1315z"),nlp("0830Z"),nlp("0422z"))

    new_ents = []
    for match_id, start, end in matcher(doc):
        new_ent = Span(doc, start, end, label=nlp.vocab.strings[match_id])
        new_ents.append(new_ent)
    
    doc.ents = new_ents
    return doc

# Add the component after the named entity recognizer
nlp.add_pipe(extend_matcher_entities, after='ner')

doc = nlp("At 0560z, I walked over to my car and got in to go to the grocery store.")

# extend_matcher_entities(doc)
print([(ent.text, ent.label_) for ent in doc.ents])

这个来自 spacy 代码示例的示例运行良好:

import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "PERSON" and ent.start != 0:
            prev_token = doc[ent.start - 1]
            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                print(new_ent)
                new_ents.append(new_ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    print(new_ents)
    return doc

# Add the component after the named entity recognizer
nlp.add_pipe(expand_person_entities, after='ner')

doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])

我错过了什么?

最佳答案

导致您出现循环引用的违规行是:

matcher.add("TIME", None, nlp("0305Z"), nlp("1315z"),nlp("0830Z"),nlp("0422z"))

将其从函数定义中删除,就可以了:

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
pattern = [nlp(t) for t in ("0305Z","1315z","0830Z","0422z")]


def extend_matcher_entities(doc):
    matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
    matcher.add("TIME", None, *pattern)

    new_ents = []
    for match_id, start, end in matcher(doc):
        new_ent = Span(doc, start, end, label=nlp.vocab.strings[match_id])
        new_ents.append(new_ent)
    
    doc.ents = new_ents
    # doc.ents = list(doc.ents) + new_ents
    return doc

# Add the component after the named entity recognizer
nlp.add_pipe(extend_matcher_entities, after='ner')

doc = nlp("At 0560z, I walked over to my car and got in to go to the grocery store.")

# extend_matcher_entities(doc)
print([(ent.text, ent.label_) for ent in doc.ents])
[('0560z', 'TIME')]

另请注意,通过 doc.ents = new_ents 您将覆盖之前提取的任何实体

关于python - spaCy - 将扩展函数添加到管道导致堆栈溢出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65039857/

相关文章:

python - 在 Python 中查找字符串中多个字符的最后一次出现

gensim - 使用 Spacy 在 doc 中查找最相似的句子

python - 无法在 WinPython : "ModuleNotFoundError: No module named ' semver'"上安装 spaCy

python - argparse 和 ConfigParser 字符串替换语法从何而来?

python - 通过Python将csv文件中第一行的分隔符从 ','替换为 ';'

python - loc 和 iloc 的类型是什么? (括号与圆括号)

python - 将词向量从 Gensim 加载到 SpaCy Vectors 类

python - 树遍历并在Python中获取相邻的子节点

python - python 中的 "Resolve Package Not Found"错误

python - 无法保存matplotlib动画