python - 使用 spacy 和 Matcher 提取 NER 主语 + 动词的问题

我从事一个 NLP 项目，我必须使用 spacy 和 spacy Matcher 来提取所有作为 nsubj (主语)的命名实体及其相关动词:我的 NE nsubj 的调控动词。示例:

Georges and his friends live in Mexico City
"Hello !", says Mary

我需要在第一句中提取“Georges”和“live”，在第二句中提取“Mary”和“says”，但我不知道我的命名实体和动词之间有多少个单词与之相关的。所以我决定更多地探索 spacy Matcher。所以我正在努力在 Matcher 上编写一个模式来提取我的 2 个单词。当 NE 主语位于动词之前时，我得到了很好的结果，但我不知道如何编写一个模式来匹配与之相关的单词之后的 NE 主语。根据指南，我也可以用“常规 spacy”来完成这项任务，但我不知道该怎么做。 Matcher 的问题在于我无法管理 NE 和 VERB 之间的依赖类型并获取好的 VERB。我是 spacy 的新手，我一直使用 NLTK 或 Jieba(针对中文)。我什至不知道如何用 spacy 标记句子中的文本。但我选择将整个文本分割成句子，以避免两个句子之间的错误匹配。这是我的代码

import spacy
from nltk import sent_tokenize
from spacy.matcher import Matcher

nlp = spacy.load('fr_core_news_md')

matcher = Matcher(nlp.vocab)

def get_entities_verbs():

    try:

        # subjet before verb
        pattern_subj_verb = [{'ENT_TYPE': 'PER', 'DEP': 'nsubj'}, {"POS": {'NOT_IN':['VERB']}, "DEP": {'NOT_IN':['nsubj']}, 'OP':'*'}, {'POS':'VERB'}]
        # subjet after verb
        # this pattern is not good

        matcher.add('ent-verb', [pattern_subj_verb])

        for sent in sent_tokenize(open('Le_Ventre_de_Paris-short.txt').read()):
            sent = nlp(sent)
            matches = matcher(sent)
            for match_id, start, end in matches:
                span = sent[start:end]
                print(span)

    except Exception as error:
        print(error)


def main():

    get_entities_verbs()

if __name__ == '__main__':
    main()

即使是法语，我也可以向你保证我得到了很好的结果

Florent regardait
Lacaille reparut
Florent baissait
Claude regardait
Florent resta
Florent, soulagé
Claude s’était arrêté
Claude en riait
Saget est matinale, dit
Florent allait
Murillo peignait
Florent accablé
Claude entra
Claude l’appelait
Florent regardait
Florent but son verre de punch ; il le sentit
Alexandre, dit
Florent levait
Claude était ravi
Claude et Florent revinrent
Claude, les mains dans les poches, sifflant

我有一些错误的结果，但 90% 是好的。我只需要捕获每行的第一个和最后一个单词就可以得到我的几个 NE/动词。所以我的问题是。当 NE 是与 Matcher 相关的动词的主语时，如何提取 NE ，或者只是如何使用 spacy (不是 Matcher)来提取 NE ？有很多因素需要考虑。即使 100% 不可能，您是否有办法获得尽可能最好的结果？我需要一个与此模式后的 VERB Governor + NER subj 匹配的模式:

pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

此模式的所有功劳都归功于polm23

最佳答案

这是依赖项匹配器的完美用例。如果您在运行之前将实体合并为单个标记，也会使事情变得更容易。这段代码应该可以满足您的需要:

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")

# merge entities to simplify this
nlp.add_pipe("merge_entities")


pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PERVERB", [pattern])

texts = [
        "John Smith and some other guy live there",
        '"Hello!", says Mary.',
        ]

for text in texts:
    doc = nlp(text)
    matches = matcher(doc)

    for match in matches:
        match_id, (start, end) = match
        # note order here is defined by the pattern, so the nsubj will be first
        print(doc[start], "::", doc[end])
    print()

查看the docs for the DependencyMatcher .

关于python - 使用 spacy 和 Matcher 提取 NER 主语 + 动词的问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67259823/

python - 使用 spacy 和 Matcher 提取 NER 主语 + 动词的问题

上一篇：python - 如何使用 python-binance 获取历史买入价和卖出价

下一篇：r - 增加 Shiny 仪表板中侧标签和 Shiny 小部件之间的距离