python-3.x - 使用GenSim的短语之间的语义相似性

标签 python-3.x nltk gensim

背景

我正在尝试判断一个短语是否与使用Gensim在语料库中找到的其他单词在语义上相关。例如,这是预先标记的语料库文档:

 **Corpus**
 Car Insurance
 Car Insurance Coverage
 Auto Insurance
 Best Insurance
 How much is car insurance
 Best auto coverage
 Auto policy
 Car Policy Insurance

我的代码(基于this gensim tutorial)使用余弦相似度对语料库中的所有字符串判断短语的语义相关性。

问题

看来,如果查询包含在我的词典中找到的任何术语,则该短语在语义上与语料库相似(例如,** Giraffe Poop Car Murderer的余弦相似度为1,但在语义上不相关)。我不确定如何解决此问题。

代码
#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
        for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
        for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]  
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

#convert the query to LSI space
vec_lsi = lsi[vec_bow]              
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

最佳答案

首先,您不是直接比较单词袋向量的余弦相似度,而是首先通过应用潜在语义分析(https://en.wikipedia.org/wiki/Latent_semantic_analysis)来降低文档向量的维数。很好,但是我只是想强调一下。通常假设语料库的底层语义空间的维数比唯一标记的数量低。因此,LSA在向量空间上应用主成分分析,并且仅保留向量空间中方差最大的方向(即空间中变化最快的方向,因此被认为包含更多信息)。这受传递给num_topics构造函数的LsiModel参数的影响。

其次,我清理了一下您的代码并嵌入了语料库:

# Tokenize Corpus and filter out anything that is a
# stop word or has a frequency <1

from gensim import corpora, models, similarities
from collections import defaultdict

documents = [
    'Car Insurance',  # doc_id 0
    'Car Insurance Coverage',  # doc_id 1
    'Auto Insurance',  # doc_id 2
    'Best Insurance',  # doc_id 3
    'How much is car insurance',  # doc_id 4
    'Best auto coverage',  # doc_id 5
    'Auto policy',  # doc_id 6
    'Car Policy Insurance',  # doc_id 7
]

stoplist = set(['is', 'how'])

texts = [[word.lower() for word in document.split()
          if word.lower() not in stoplist]
         for document in documents]

print texts
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result
# as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

# convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

print sims

如果运行以上命令,将得到以下输出:
[(0, 0.97798139), (4, 0.97798139), (7, 0.94720691), (1, 0.89220524), (3, 0.61052465), (2, 0.42138112), (6, -0.1468758), (5, -0.22077486)]

其中该列表中的每个条目都对应于按余弦相似度降序排列的(doc_id, cosine_similarity)

与查询文档中一样,实际上是词汇表中唯一部分(由语料库构建)的唯一单词是car,所有其他标记都将被删除。因此,对模型的查询由单例文档car组成。因此,您可以看到所有包含car的文档都与您的输入查询非常相似。

文档#3(Best Insurance)排名也很高的原因是, token insurance通常与car(您的查询)同时出现。这正是分布语义背后的原因,即“一个单词的特征在于它所拥有的公司”(Firth,J. R. 1957)。

关于python-3.x - 使用GenSim的短语之间的语义相似性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31821821/

相关文章:

python-3.x - ImportError : sys. meta_path 为 None,Python 可能正在关闭

python - 每行共享 y 轴

python - 卡在 Celery 队列中的任务

python - NLTK包,未定义标签

python gensim : indices array has non-integer dtype (float64)

python - 我的代码抛出有关 NoneType 的错误,但应该定义它

python - 斯坦福 NER 标注器 NLTK(python)与 JAVA 的结果差异

python - 如何将 NLTK block 输出到文件?

python - Gensim Word2vec 存储属性 syn0norm

python - 在 word2vec Gensim 中获取二元组和三元组