python - scipy 稀疏矩阵中行的 L2 归一化

标签 python numpy scipy

由于我只想使用 numpy 和 scipy (我不想使用 scikit-learn)，我想知道如何对巨大的 scipy csc_matrix (2,000,000 x 500,000) 中的行执行 L2 标准化。该操作必须消耗尽可能少的内存，因为它必须适合内存。

到目前为止我所拥有的是:

import scipy.sparse as sp

tf_idf_matrix = sp.lil_matrix((n_docs, n_terms), dtype=np.float16)
# ... perform several operations and fill up the matrix

tf_idf_matrix = tf_idf_matrix / l2_norm(tf_idf_matrix)
# l2_norm() is what I want

def l2_norm(sparse_matrix):
    pass

最佳答案

由于我在任何地方都找不到答案，因此我将在这里发布我如何解决该问题。

def l2_norm(sparse_csc_matrix):
    # first, I convert the csc_matrix to csr_matrix which is done in linear time
    norm = sparse_csc_matrix.tocsr(copy=True)

    # compute the inverse of l2 norm of non-zero elements
    norm.data **= 2
    norm = norm.sum(axis=1)
    n_nzeros = np.where(norm > 0)
    norm[n_nzeros] = 1.0 / np.sqrt(norm[n_nzeros])
    norm = np.array(norm).T[0]

    # modify sparse_csc_matrix in place
    sp.sparsetools.csr_scale_rows(sparse_csc_matrix.shape[0],
                                  sparse_csc_matrix.shape[1],
                                  sparse_csc_matrix.indptr,
                                  sparse_csc_matrix.indices,
                                  sparse_csc_matrix.data, norm)

如果有人有更好的方法，请发布。

关于python - scipy 稀疏矩阵中行的 L2 归一化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22122035/

上一篇：Python - 合并 CSV 文件中的列

下一篇：python - 使用python返回字典中具有相同对应值的所有键

相关文章：

python - 压缩一维 numpy 数组到二维汉明距离矩阵

python - 使用 pandas 的列 View ？

python - PyQT QTreeWidget 迭代

python - 将现有的 NumPy 数组转换为 ctype 数组，以便在多进程之间共享

python - numpy中ndarray的 "ndim, shape, size, ..etc"的身份是什么

python - numpy.shares_memory 和 numpy.may_share_memory 有什么区别？

python - 如何使用 FFT 查找方波的频率

Python - 迭代和更新嵌套字典和列表

python - 安装Python2.6

python - 如何在串行通信中使用 pyserial 解码字节