python - numpy 数组的顺序如何影响乘法速度?

标签 python arrays numpy cuda matrix-multiplication

numpy 数组的顺序如何影响乘法速度?我如何根据矩阵的大小自动选择它?

问题最初来自使用 cudamat 的代码:

def test_mat():
    #need to init cublas?
    # cm.cublas_init()

    n = 1024

    for i in xrange(1,20):  # 2^15 max or python fails
        m= 2
        m=m**i
        # print m
        print i
        try:
            t0= time.time()
            # cpum1 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')
            # cpum2 = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='C')
            #CUDA need fortran order of array for speed?
            cpum1 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
            cpum2 = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='F')
            c = np.dot(cpum2.T, cpum1.T)
            print (time.time()-t0)

            t0= time.time()
            gpum1 = cm.CUDAMatrix(cpum1)
            gpum2 = cm.CUDAMatrix(cpum2)
            gm = cm.dot(gpum2.T, gpum1.T)
            gm.copy_to_host()
            print (time.time()-t0)
        except:
            pass

    # cm.cublas_shutdown()

    print 'done' 

这是我做过的一些测试,但我需要一些理论观点。

def test_order(m,n):            
    #default
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32)
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32)

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #1
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #2
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #3
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #4
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)


    print 'done'    

m= 1024*10
n= 1024*1
7.125
7.14100003242
6.95299983025
8.14100003242
7.15600013733

m= 1024*1
n= 1024*10  
0.718999862671
0.734000205994
0.641000032425
0.656000137329
0.655999898911

以下是测试峰值内存使用情况的代码:

import numpy as np
import time
from memory_profiler import profile

@profile    
def test_order_():

    m= 1024*1
    n= 1024*10

    #what used by default when c= np.dot(a,b)
    c = np.array(np.zeros((m, m)), dtype=np.float32, order='C')
    #c = np.array(np.zeros((m, m)), dtype=np.float32, order='F')

    #1
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #2
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #3
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #4
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    print 'done'

if __name__ == '__main__':
    test_order_()

还找到了一些有关 numpy.dot copy 和 fast_dot 的信息

The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order

还有一些performance tips这很奇怪,但我每次运行示例时都无法重现结果。(也许在重新运行一些数据chaches之前?)

最佳答案

性能取决于您拥有的底层线性代数库。

# ORDER C-C    
In [6]: %timeit a.dot(b)
10 loops, best of 3: 87.6 ms per loop

# ORDER C-F
In [8]: %timeit a.dot(b)
10 loops, best of 3: 87.8 ms per loop

# ORDER F-C
In [10]: %timeit a.dot(b)
10 loops, best of 3: 90.1 ms per loop

# ORDER F-F
In [12]: %timeit a.dot(b)
10 loops, best of 3: 90 ms per loop

我使用的是在 native 上通过 SSE3 编译的 ATLAS,如 np.show_config() 所示。重新运行计算表明两者之间不存在统计差异。事实上,没有什么区别,因为库在执行乘积之前会复制数组。所述复制需要 650 µs(包括 Python 开销),低于您的时间。随着矩阵的增长,点积占主导地位,并且您看不到复制效果。如果您使用较小的矩阵,Python 开销会掩盖任何影响。

如果您监视内存并使用非常大的数组,您可以看到实际发生的副本。

关于python - numpy 数组的顺序如何影响乘法速度?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24016207/

相关文章:

javascript ->>>= 是什么意思?

python - Python 中 2 的幂的列表理解因 numpy 数组而失败

python - 如何使用 f 字符串遍历数据帧?

python - 某些列 pandas 上相同值的动态 bool 掩码

python - 在 virtualenv 中安装 python-scipy

c - 使用++增加数组指针

使用数组的 Java 基于字符的队列

python - 确定 numpy 子数组是否重叠?

python - 如何修复 SQLAlchemy : SAWarning: DELETE statement on table expected to delete 1 row(s); 0 were matched

python - 如何使用 boto3 创建 ec2 实例