gensim - 使用 Spacy 在 doc 中查找最相似的句子

标签 gensim similarity spacy doc2vec sentence-similarity

我正在寻找使用类似 most_similar() 的解决方案来自 Gensim但使用 Spacy .
我想在使用 NLP 的句子列表中找到最相似的句子。

我尝试使用 similarity()来自 Spacy (例如 https://spacy.io/api/doc#similarity )一个一个循环,但需要很长时间。

更深入:

我想把所有这些句子放在一个图中(比如 this )来找到句子集群。

任何的想法 ?

最佳答案

这是一个简单的内置解决方案,您可以使用:

import spacy

nlp = spacy.load("en_core_web_lg")
text = (
    "Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
    " These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
    " The term semantic similarity is often confused with semantic relatedness."
    " Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
    " My favorite fruit is apples."
)
doc = nlp(text)
max_similarity = 0.0
most_similar = None, None
for i, sent in enumerate(doc.sents):
    for j, other in enumerate(doc.sents):
        if j <= i:
            continue
        similarity = sent.similarity(other)
        if similarity > max_similarity:
            max_similarity = similarity
            most_similar = sent, other
print("Most similar sentences are:")
print(f"-> '{most_similar[0]}'")
print("and")
print(f"-> '{most_similar[1]}'")
print(f"with a similarity of {max_similarity}")

(来自 wikipedia 的文字)
它将产生以下输出:
Most similar sentences are:
-> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
and
-> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
with a similarity of 0.9583859443664551
请注意以下来自 spacy.io 的信息:

To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:

- python -m spacy download en_core_web_sm
+ python -m spacy download en_core_web_lg

另见 Document similarity in Spacy vs Word2Vec有关如何提高相似度分数的建议。

关于gensim - 使用 Spacy 在 doc 中查找最相似的句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56150678/

相关文章:

tensorflow - 在 keras 中使用带有 LSTM nn 的 Gensim Fasttext 模型

python-3.x - gensim KeydVectors 维度

gensim - 如何加载从 StarSpace 生成的嵌入(在 tsv 文件中)

set - 空集之间的 Jaccard 相似度

python - 将 Spacy 文档的一部分提取为新文档

machine-learning - 在自然语言处理中是否有减少词汇量的好方法?

php - 在 PHP 中加速 levenshtein/similar_text

algorithm - 找到两个矢量形状的相似性

python - 我想从 spacy 中的文本中提取文本值

python - 下载空间时出现ReadTimeoutError en_core_web_lg