python - 了解聚类中的 np.zeros

标签 python python-3.x vector cluster-analysis similarity

我正在学习聚类，并且在几个教程中看到了一些我不太理解的相似性度量部分的内容:

tfidf_vector = TfidfVectorizer()
tfidf_matrix = tfidf_vector.fit_transform(file)

#and/or

count_vector = CountVectorizer()
count_matrix = count_vector.fit_transform(file)

#AND HERE
file_size = len(file)
x = np.zeros((file_size, file_size))
#and here the similarity measures like cosine_similarity, jaccard...

for elm in range(file_size):
    x[elm] = cosine_similarity(tfidf_matrix[i:i+1], tfidf_matrix)

y = np.subtract(np.ones((file_size, file_size),dtype = np.float), x)

new_file = np.asarray(y)
w = new_file.reshape((1,file_size,file_size))

为什么我们需要 np.zeros？ tfidf_matrix/count_matrix 不足以用于相似性度量吗？

最佳答案

这段代码做了同样的事情(我将i更改为elm，因为它看起来像是一个拼写错误)

x = []
for elm in range(file_size):
    x.append(cosine_similarity(tfidf_matrix[elm:elm+1], tfidf_matrix)
x = np.asarray(x)

您还可以将 np.zeros 替换为 np.empty。预先创建数组，然后填充数组的每个元素，比追加到列表然后将其转换为 numpy 数组要稍微高效一些。许多其他编程语言需要像 numpy 一样预先分配数组，这就是为什么很多人选择以这种方式填充数组的原因。

然而，由于这是Python，你应该做任何你认为对你自己和其他人来说最容易阅读的方法。

关于python - 了解聚类中的 np.zeros，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51501712/

上一篇：python - 当列表元素是需要解析的文本页时，如何将列表的每个元素存储在数据框中？

下一篇：python - 将 python 脚本的上行消息发布到 TTN(物联网)模拟节点

Python 和 Selenium - 重新启动程序并重用相同的浏览器 session

python - Django-Configurations 导致我的 STATIC_URL 无效

python - 即使在 plt.plot 之前使用 plt.figure () 也会出现错误 <Figure size 1000x600 with 1 Axes>

r - 如何创建一个空的日期向量？

c++ - 如何从具有重复项的 vector 中仅删除一个元素

python - 单步执行子程序调用，但不调用参数

python - 从 Python 2 返回的区别——版本检查和 Python 3

python - 有没有一种Python式的方法来迭代两个列表的差异？

vb.net - Flash 电影只在 Chrome 中呈现，而不是 WebBrowser 控件？