python - 如何使用 sklearn 计算词-词共现矩阵？

我正在 sklearn 中寻找一个模块，它可以让您推导出词-词共现矩阵。

我可以获得文档-术语矩阵，但不确定如何获取同现的词-词矩阵。

最佳答案

这是我在 scikit-learn 中使用 CountVectorizer 的示例解决方案。并引用这个post ，你可以简单地使用矩阵乘法来得到词-词共现矩阵。

from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
        'this cat good',
        'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format

也可以引用count_model中的词库，

count_model.vocabulary_

或者，如果您想按对角线分量进行归一化(引用上一篇文章中的回答)。

import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * Xc # normalized co-occurence matrix

额外注意@Federico Caccia 的回答，如果您不希望从自己的文本中出现虚假的共现，请将大于 1 的出现设置为 1，例如

X[X > 0] = 1 # do this line first before computing cooccurrence
Xc = (X.T * X)
...

关于python - 如何使用 sklearn 计算词-词共现矩阵？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35562789/

上一篇：python - CPU : AbstractConv2d Theano optimization failed 上的 Theano CNN

下一篇：python - SQLalchemy 属性错误 : 'str' object has no attribute '_sa_instance_state'

相关文章：

sql - 将游戏世界矩阵位置保存到数据库

python - 在 Python 中求解矩阵形式的非线性方程组

python - Scikit : Problem returning Dataframe from imputer instead of Numpy Array

python - 为什么数据框无法在 matplotlib 中绘制 3D 图形？

python - 在返回向量的函数上使用 Numpy Vectorize

matlab - MATLAB 中矩阵中补丁的循环移位

python - 当只有一个输入时如何处理MinMaxScaler？

python - 分类器超参数之间的相关性

python - Celery multi 以 1 个或多个进程启动

python - pymssql 和 Adaptive Server 连接失败