在 blog post 中我读到以下余弦相似度的“天真实现”永远不应在生产中使用,博客文章没有解释原因,我真的很好奇,谁能给出解释?
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
# the counts we computed above
sentence_m = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0])
sentence_h = np.array([0, 0, 1, 1, 1, 1, 0, 0, 0])
sentence_w = np.array([0, 0, 0, 1, 0, 0, 1, 1, 1])
# We should expect sentence_m and sentence_h to be more similar
print(cos_sim(sentence_m, sentence_h)) # 0.5
print(cos_sim(sentence_m, sentence_w)) # 0.25
最佳答案
cos_sim
函数应该是这样的。问题是用 counts 表示句子。考虑 using tf-idf相反。
关于python - 余弦相似度的简单实现有什么问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53775465/