python - 将非常重复的矩阵添加到 numpy/scipy 中的稀疏矩阵中?

标签 python matrix numpy scipy sparse-matrix

我正在尝试在 NumPy/Scipy 中实现一个函数来计算 Jensen-Shannon divergence在单个(训练)向量和大量其他(观察)向量之间。观察向量存储在一个非常大的 (500,000x65536) Scipy sparse matrix 中(密集矩阵不适合内存)。

作为算法的一部分,我需要为每个观察向量 Oi 计算 T+Oi,其中 T 是训练向量。我无法使用 NumPy 的常用广播规则找到一种方法来做到这一点,因为稀疏矩阵似乎不支持那些(如果 T 保留为密集数组,Scipy 尝试首先使稀疏矩阵密集,它运行内存不足;如果我将 T 设为稀疏矩阵,则 T+Oi 会失败,因为形状不一致)。

目前我正在采取非常低效的步骤,将训练向量平铺到 500,000x65536 稀疏矩阵中:

training = sp.csr_matrix(training.astype(np.float32))
tindptr = np.arange(0, len(training.indices)*observations.shape[0]+1, len(training.indices), dtype=np.int32)
tindices = np.tile(training.indices, observations.shape[0])
tdata = np.tile(, observations.shape[0])
mtraining = sp.csr_matrix((tdata, tindices, tindptr), shape=observations.shape)

但是当它只存储大约 1500 个“真实”元素时,这会占用大量内存(大约 6GB)。构建也很慢。

我试图通过使用 stride_tricks 使 CSR 矩阵的 indptr 和数据成员不使用重复数据的额外内存来变得聪明。
training = sp.csr_matrix(training)
mtraining = sp.csr_matrix(observations.shape,dtype=np.int32)
tdata =
vdata = np.lib.stride_tricks.as_strided(tdata, (mtraining.shape[0], tdata.size), (0, tdata.itemsize))
indices = training.indices
vindices = np.lib.stride_tricks.as_strided(indices, (mtraining.shape[0], indices.size), (0, indices.itemsize))
mtraining.indptr = np.arange(0, len(indices)*mtraining.shape[0]+1, len(indices), dtype=np.int32) = vdata
mtraining.indices = vindices

但这不起作用,因为跨步 View 和 mtraining.indices 是错误的形状(并且根据 this answer 没有办法使它成为正确的形状)。尝试使用 .flat 迭代器使它们看起来平坦失败,因为它看起来不够像数组(例如,它没有 dtype 成员),并且使用 flatten() 方法最终会复制。



我什至没有考虑过的另一种选择是自己以稀疏格式实现总和,以便您可以充分利用数组的周期性。如果您滥用 scipy 稀疏矩阵的这种特殊行为,这很容易做到:

>>> a = sps.csr_matrix([1,2,3,4])
array([1, 2, 3, 4])
>>> a.indices
array([0, 1, 2, 3])
>>> a.indptr
array([0, 4])

>>> b = sps.csr_matrix((np.array([1, 2, 3, 4, 5]),
...                     np.array([0, 1, 2, 3, 0]),
...                     np.array([0, 5])), shape=(1, 4))
>>> b
<1x4 sparse matrix of type '<type 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>
>>> b.todense()
matrix([[6, 2, 3, 4]])



def csr_add_sparse_vec(sps_mat, sps_vec) :
    """Adds a sparse vector to every row of a sparse matrix"""
    # No checks done, but both arguments should be sparse matrices in CSR
    # format, both should have the same number of columns, and the vector
    # should be a vector and have only one row.

    rows, cols = sps_mat.shape
    nnz_vec = len(
    nnz_per_row = np.diff(sps_mat.indptr)
    longest_row = np.max(nnz_per_row)

    old_data = np.zeros((rows * longest_row,),
    old_cols = np.zeros((rows * longest_row,), dtype=sps_mat.indices.dtype)

    data_idx = np.arange(longest_row) < nnz_per_row[:, None]
    data_idx = data_idx.reshape(-1)
    old_data[data_idx] =
    old_cols[data_idx] = sps_mat.indices
    old_data = old_data.reshape(rows, -1)
    old_cols = old_cols.reshape(rows, -1)

    new_data = np.zeros((rows, longest_row + nnz_vec,),
    new_data[:, :longest_row] = old_data
    del old_data
    new_cols = np.zeros((rows, longest_row + nnz_vec,),
    new_cols[:, :longest_row] = old_cols
    del old_cols
    new_data[:, longest_row:] =
    new_cols[:, longest_row:] = sps_vec.indices
    new_data = new_data.reshape(-1)
    new_cols = new_cols.reshape(-1)
    new_pointer = np.arange(0, (rows + 1) * (longest_row + nnz_vec),
                            longest_row + nnz_vec)

    ret = sps.csr_matrix((new_data, new_cols, new_pointer),

    return ret

它没有以前那么快,但它可以在大约 1 秒内完成 10,000 行。:
In [2]: a
<10000x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 15000000 stored elements in Compressed Sparse Row format>

In [3]: b
<1x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 1500 stored elements in Compressed Sparse Row format>

In [4]: csr_add_sparse_vec(a, b)
<10000x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 30000000 stored elements in Compressed Sparse Row format>

In [5]: %timeit csr_add_sparse_vec(a, b)
1 loops, best of 3: 956 ms per loop

编辑 这段代码非常非常慢
def csr_add_sparse_vec(sps_mat, sps_vec) :
    """Adds a sparse vector to every row of a sparse matrix"""
    # No checks done, but both arguments should be sparse matrices in CSR
    # format, both should have the same number of columns, and the vector
    # should be a vector and have only one row.

    rows, cols = sps_mat.shape

    new_data =
    new_pointer = sps_mat.indptr.copy()
    new_cols = sps_mat.indices

    aux_idx = np.arange(rows + 1)

    for value, col in itertools.izip(, sps_vec.indices) :
        new_data = np.insert(new_data, new_pointer[1:], [value] * rows)
        new_cols = np.insert(new_cols, new_pointer[1:], [col] * rows)
        new_pointer += aux_idx

    return sps.csr_matrix((new_data, new_cols, new_pointer),

关于python - 将非常重复的矩阵添加到 numpy/scipy 中的稀疏矩阵中?,我们在Stack Overflow上找到一个类似的问题:


python - 用 None 替换 Pandas 或 Numpy Nan 以与 MysqlDB 一起使用

python - 稳态概率(马尔可夫链)Python 实现

Python URLRetrieve限制速率并恢复部分下载

python - 在 NumPy 中更改数组边缘的值

python - 如何使用 Bokeh 绘制 html 表格?

python - 在 Python 中最有效地迭代大型字典列表

R - 如何根据一列中的值汇总其他列

r - 具有显着性星的非方形相关矩阵的相关图

javascript - 在没有 cv.imshow() 的情况下显示 Opencv.js 矩阵

python - Python 中的数值积分与矢量化函数的自适应求积