python - 重复索引 : np. add.at/sparse.csr_matrix 的最快添加？

假设我有一个 num_indices * n 索引矩阵(在 range(m) 中)和一个 num_indices * n 值矩阵，即

m, n = 100, 50
num_indices = 100000
indices = np.random.randint(0, m, (num_indices, n))
values = np.random.uniform(0, 1, (num_indices, n))

我想获得一个 m * n 形状的结果矩阵，这样每一列都被索引并在索引和值矩阵中的相应列上求和。以下是一些实现:

基本 for 循环

tic = time.time()
res1 = np.zeros((m, n))
for i in range(num_indices):
    for j in range(n):
        res1[indices[i, j], j] += values[i, j]
print(f'element-wise for loop used {time.time() - tic:.5f}s')

np.add.at，多维

tic = time.time()
res2 = np.zeros((m, n))
np.add.at(res2, (indices, np.arange(n)[None, :]), values)
print(f'np.add.at used {time.time() - tic:.5f}s')

np.add.at，带 for 循环

tic = time.time()
reslst = []
for colid in range(n):
    rescol = np.zeros((m, 1))
    np.add.at(rescol, (indices[:, colid], np.zeros(num_indices, dtype=int)), values[:, colid])
    reslst.append(rescol)
res3 = np.hstack(reslst)
print(f'np.add.at with loop used {time.time() - tic:.5f}s')

scipy.sparse，多维

from scipy import sparse
tic = time.time()
res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.tile(np.arange(n), num_indices))), shape=(m, n)).toarray()
print(f'sparse.csr used {time.time() - tic:.5f}s')

scipy.sparse，带有 for 循环

tic = time.time()
reslst = []
for colid in range(n):
    rescol = sparse.csr_matrix((values[:, colid], (indices[:, colid], np.zeros(num_indices, dtype=int))), shape=(m, 1)).toarray()
    reslst.append(rescol)
res5 = np.hstack(reslst)
print(f'sparse.csr with loop used {time.time() - tic:.5f}s')

结果都是一样的:

assert np.isclose(res2, res1).all()
assert np.isclose(res3, res1).all()
assert np.isclose(res4, res1).all()
assert np.isclose(res5, res1).all()

但是，速度很奇怪，而且不令人满意:

for loop used 1.69889s
np.add.at used 0.17287s
np.add.at with loop used 0.47767s
sparse.csr used 0.16847s
sparse.csr with loop used 0.09845s

我的基本问题是:

为什么np.add.at这么慢 - 甚至比scipy.sparse还慢？我认为 numpy 会受益匪浅，尤其是在多维情况下。
对于scipy.sparse，为什么多维比for循环还要慢？没有使用并发吗？ (为什么对于 numpy 来说，多维更快？)
总的来说，有没有更快的方法来实现我想要的功能？

最佳答案

令人惊讶的是，np.add.at 在 Numpy 中的实现效率非常低。

关于scipy.sparse，我无法在我的机器上使用 Scipy 1.7.1 重现相同的性能改进:它几乎没有更快。就像 Numpy 的 np.add.at 一样，它的速度还远远不够。

您可以使用 Numba 高效地实现此操作:

import numba as nb

@nb.njit('(int_[:,::1], float64[:,::1], int_, int_)')
def compute(indices, values, n, m):
    res = np.zeros((m, n), dtype=np.float64)
    for i in range(num_indices):
        for j in range(n):
            #assert 0 <= indices[i, j] < m
            res[indices[i, j], j] += values[i, j]
    return res

tic = time.time()
result = compute(indices, values, n, m)
print(f'Numba used {time.time() - tic:.5f}s')

请注意，该函数假设indices 包含有效值(即没有越界)。否则，该函数可能会崩溃或默默地损坏程序。如果您不确定，您可以启用断言，但代价是计算速度较慢(在我的机器上慢两倍)。

请注意，只要 8 * n * m 足够小，输出数组适合 L1 缓存(通常从 16 KB 到 64知识库)。否则，由于随机访问模式效率低下，它可能会慢一些。如果输出数组不适合 L2 缓存(通常几百 KB)，那么速度会明显变慢。

结果

element-wise for loop used 2.45158s
np.add.at used 0.28600s
sparse.csr used 0.19600s
sparse.csr with loop used 0.18900s
Numba used 0.00500s

因此，Numba 比最快的实现快约 40 倍，比最慢的实现快约 500 倍。 numba 代码速度要快得多，因为代码被编译为优化的 native 二进制文件，与 Numpy 和 Scipy 相比，没有额外的开销(即边界检查、临时数组、函数调用)。

关于python - 重复索引 : np. add.at/sparse.csr_matrix 的最快添加？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72040697/

python - 重复索引 : np. add.at/sparse.csr_matrix 的最快添加？

结果

上一篇：php - REGEXP 与 CONCAT 中的“重复运算符操作数无效”

下一篇：R 绘制如何将文本链接到气泡以便在单击图例时将其删除