This question is a follow up to my answer in Efficient way to compute the Vandermonde matrix.
这是设置:
x = np.arange(5000) # an integer array
N = 4
现在,我将计算 Vandermonde matrix以两种不同的方式:
m1 = (x ** np.arange(N)[:, None]).T
还有,
m2 = x[:, None] ** np.arange(N)
健全性检查:
np.array_equal(m1, m2)
True
这些方法是相同的,但它们的性能不同:
%timeit m1 = (x ** np.arange(N)[:, None]).T
42.7 µs ± 271 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit m2 = x[:, None] ** np.arange(N)
150 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
因此,尽管第一种方法需要在最后进行换位,但它仍然比第二种方法快 3 倍。
唯一的区别是,在第一种情况下,广播的是较小数组,而在第二种情况下,广播的是较大。
So, with a fairly decent understanding of how numpy works, I can guess that the answer would involve the cache. The first method is a lot more cache friendly than the second. However, I'd like an official word from someone with more experience than me.
时间上出现这种鲜明对比的原因可能是什么?
最佳答案
我也试着查看 broadcast_arrays
:
In [121]: X,Y = np.broadcast_arrays(np.arange(4)[:,None], np.arange(1000))
In [122]: timeit X+Y
10.1 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [123]: X,Y = np.broadcast_arrays(np.arange(1000)[:,None], np.arange(4))
In [124]: timeit X+Y
26.1 µs ± 30.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [125]: X.shape, X.strides
Out[125]: ((1000, 4), (4, 0))
In [126]: Y.shape, Y.strides
Out[126]: ((1000, 4), (0, 4))
np.ascontiguousarray
将 0 跨度维度转换为完整维度
In [132]: Y1 = np.ascontiguousarray(Y)
In [134]: Y1.strides
Out[134]: (16, 4)
In [135]: X1 = np.ascontiguousarray(X)
In [136]: X1.shape
Out[136]: (1000, 4)
使用完整数组操作更快:
In [137]: timeit X1+Y1
4.66 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
因此,使用 0-strided 数组会有某种时间损失,即使它没有首先显式扩展数组。成本与形状有关,也可能与扩展的维度有关。
关于python - 广播的 NumPy 算法 - 为什么一种方法的性能如此之高?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48253210/