python - 如何在 spaCy 中提取带有关键短语的句子

标签 python nlp spacy

到目前为止,我一直在使用 Spacy,发现它在 NLP 中非常直观和强大。 我正在尝试使用 word basecontent type base 搜索两种方式进行文本句子搜索,但到目前为止,我找不到任何 spacy 解决方案。

我有这样的文字:

In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.[1] Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".[2]

As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect.[3] A quip in Tesler's Theorem says "AI is whatever hasn't been done yet."[4] For instance, optical character recognition is frequently excluded from things considered to be AI,[5] having become a routine technology.[6] Modern machine capabilities generally classified as AI include successfully understanding human speech,[7] competing at the highest level in strategic game systems (such as chess and Go),[8] autonomously operating cars, intelligent routing in content delivery networks, and military simulations[9].

Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism,[10][11] followed by disappointment and the loss of funding (known as an "AI winter"),[12][13] followed by new approaches, success and renewed funding.[11][14] For most of its history, AI research has been divided into sub-fields that often fail to communicate with each other.[15] These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"),[16] the use of particular tools ("logic" or artificial neural networks), or deep philosophical differences.[17][18][19] Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).[15]

现在,我想提取完整的多个单词或字符串匹配的句子。例如,我想搜索 intelligentmachine learning。并打印出所有包含该单个或两个给定字符串的完整句子。

有没有什么方法可以用 spacy 导入 spacy 模型来感知短语匹配.. 就像它找到所有包含单词的智能和机器学习并打印出来一样?还有其他选项,它是否也可以像搜索机器学习一样找到,还建议深度学习、人工智能、模式识别等?

import spacy
nlp = spacy.load("en_core_web_sm")
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)

phrases = ['machine learning', ''intelligent, 'human']

patterns = [nlp(text) for text in phrases]

phrase_matcher.add('AI', None, *patterns)

sentence = nlp (processed_article)

matched_phrases = phrase_matcher(sentence)

for match_id, start, end in matched_phrases:
    string_id = nlp.vocab.strings[match_id]  
    span = sentence[start:end]                   
    print(match_id, string_id, start, end, span.text)

我试过这个,它不提供完整的句子,而只提供具有匹配 ID 号的单词。

简而言之,

  1. 我正在尝试搜索多个单词输入并找到包含输入单个字符串或全部的完整句子
  2. 我正在尝试使用经过训练的模型从输入中找出建议的句子。

最佳答案

第 1 部分:

i want to search intelligent and machine learning. and it prints all complete sentences which contain this single or both given strings.

您可以通过此方法找到包含您要查找的关键字的完整句子。请记住,句子边界是统计确定的,因此,如果传入的段落来自新闻或维基百科,它会工作得很好,但如果数据来自社交媒体,它就不会工作。

import spacy
from spacy.matcher import PhraseMatcher

text = """I like tomtom and I cannot lie. In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.  Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its  environment and takes actions that maximize its chance of successfully achieving its goals.[1] Colloquially,  the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive"  functions that humans associate with the human mind, such as "learning" and "problem solving".[2] """

nlp = spacy.load("en_core_web_sm")

phrase_matcher = PhraseMatcher(nlp.vocab)
phrases = ['machine learning', 'artificial intelligence']
patterns = [nlp(text) for text in phrases]
phrase_matcher.add('AI', None, *patterns)

doc = nlp(text)

for sent in doc.sents:
    for match_id, start, end in phrase_matcher(nlp(sent.text)):
        if nlp.vocab.strings[match_id] in ["AI"]:
            print(sent.text)

输出

In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.  
Colloquially,  the term "artificial intelligence" is often used to describe machines (or computers)

第 2 部分:

can it also finds as with search machine learning, also suggests deep learning, artificial intelligence, pattern recognition etc?

是的。这是很有可能的,您需要使用 word2vecsense2vec 才能做到这一点。

关于python - 如何在 spaCy 中提取带有关键短语的句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62776477/

相关文章:

python - 如何制作一个可以从任何目录 : the script file doesn’t have to be in the same directory as the . csv 文件运行的 python 脚本?

python - 在 Python 中拆分 C 文件?

python - 如何一次性预处理 NLP 文本(小写、删除特殊字符、删除数字、删除电子邮件等)?

python - 列出预训练模型中 spaCy 中最相似的词

python - 在 Pandas 循环中合并多个系列

python - 使用 pyc 编译 IronPython - 没有名为 numpy 的模块

python - 如何添加已知单词 tokenizer keras python?

python - 使用 Python 在许多文档中搜索许多表达式

python - 将稀疏的 NER 实体标签移至顶部或底部

python-3.x - Spacy - 标记带引号的字符串