numpy 数组的顺序如何影响乘法速度?我如何根据矩阵的大小自动选择它?
问题最初来自使用 cudamat 的代码:
def test_mat():
#need to init cublas?
# cm.cublas_init()
n = 1024
for i in xrange(1,20): # 2^15 max or python fails
m= 2
m=m**i
# print m
print i
try:
t0= time.time()
# cpum1 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')
# cpum2 = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='C')
#CUDA need fortran order of array for speed?
cpum1 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
cpum2 = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='F')
c = np.dot(cpum2.T, cpum1.T)
print (time.time()-t0)
t0= time.time()
gpum1 = cm.CUDAMatrix(cpum1)
gpum2 = cm.CUDAMatrix(cpum2)
gm = cm.dot(gpum2.T, gpum1.T)
gm.copy_to_host()
print (time.time()-t0)
except:
pass
# cm.cublas_shutdown()
print 'done'
这是我做过的一些测试,但我需要一些理论观点。
def test_order(m,n):
#default
a = np.array(np.random.rand(m, n)*10, dtype=np.float32)
b = np.array(np.random.rand(n, m)*10, dtype=np.float32)
t0= time.time()
c = np.dot(a,b)
print (time.time()-t0)
#1
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')
t0= time.time()
c = np.dot(a,b)
print (time.time()-t0)
#2
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
t0= time.time()
c = np.dot(a,b)
print (time.time()-t0)
#3
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')
t0= time.time()
c = np.dot(a,b)
print (time.time()-t0)
#4
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
t0= time.time()
c = np.dot(a,b)
print (time.time()-t0)
print 'done'
m= 1024*10
n= 1024*1
7.125
7.14100003242
6.95299983025
8.14100003242
7.15600013733
m= 1024*1
n= 1024*10
0.718999862671
0.734000205994
0.641000032425
0.656000137329
0.655999898911
以下是测试峰值内存使用情况的代码:
import numpy as np
import time
from memory_profiler import profile
@profile
def test_order_():
m= 1024*1
n= 1024*10
#what used by default when c= np.dot(a,b)
c = np.array(np.zeros((m, m)), dtype=np.float32, order='C')
#c = np.array(np.zeros((m, m)), dtype=np.float32, order='F')
#1
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')
t0= time.time()
c[:]= np.dot(a,b)
# np.dot(a,b,out= c) # only for C-Array !
print (time.time()-t0)
del a
del b
# del c
#2
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
t0= time.time()
c[:]= np.dot(a,b)
# np.dot(a,b,out= c) # only for C-Array !
print (time.time()-t0)
del a
del b
# del c
#3
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')
t0= time.time()
c[:]= np.dot(a,b)
# np.dot(a,b,out= c) # only for C-Array !
print (time.time()-t0)
del a
del b
# del c
#4
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
t0= time.time()
c[:]= np.dot(a,b)
# np.dot(a,b,out= c) # only for C-Array !
print (time.time()-t0)
del a
del b
# del c
print 'done'
if __name__ == '__main__':
test_order_()
还找到了一些有关 numpy.dot copy 和 fast_dot 的信息
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order
还有一些performance tips这很奇怪,但我每次运行示例时都无法重现结果。(也许在重新运行一些数据chaches之前?)
最佳答案
性能取决于您拥有的底层线性代数库。
# ORDER C-C
In [6]: %timeit a.dot(b)
10 loops, best of 3: 87.6 ms per loop
# ORDER C-F
In [8]: %timeit a.dot(b)
10 loops, best of 3: 87.8 ms per loop
# ORDER F-C
In [10]: %timeit a.dot(b)
10 loops, best of 3: 90.1 ms per loop
# ORDER F-F
In [12]: %timeit a.dot(b)
10 loops, best of 3: 90 ms per loop
我使用的是在 native 上通过 SSE3 编译的 ATLAS,如 np.show_config()
所示。重新运行计算表明两者之间不存在统计差异。事实上,没有什么区别,因为库在执行乘积之前会复制数组。所述复制需要 650 µs(包括 Python 开销),低于您的时间。随着矩阵的增长,点积占主导地位,并且您看不到复制效果。如果您使用较小的矩阵,Python 开销会掩盖任何影响。
如果您监视内存并使用非常大的数组,您可以看到实际发生的副本。
关于python - numpy 数组的顺序如何影响乘法速度?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24016207/