python - 当使用大数组作为输入数据时出现 CUDA 错误

标签 python cuda numba numba-pro

我有一个代码可以通过 python3.5 使用 numba 和 CUDA8.0 在 GPU 中进行一些计算。当输入大小为(50,27)的数组时,它成功运行并得到正确的结果。我将输入数据更改为size(200,340),出现错误。

我在代码中使用共享内存。是不是共享内存不够?还是网格大小和 block 大小不好?我不知道如何识别它并为网格和 block 选择合适的大小。

我设置了小网格尺寸和 block 尺寸,错误是一样的。

我应该怎么做才能解决这个问题?感谢您的一些建议。

我简化了我的代码,但它有同样的错误。在这里设置输入数据的大小很方便:df = np.random.random_sample((300, 200)) + 10

代码:

import os,sys,time,math
import pandas as pd
import numpy as np

from numba import cuda, float32

os.environ['NUMBAPRO_NVVM']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\bin\nvvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\libdevice'

bpg = 8
tpb = (4,32) 

tsize = (3,4) 
hsize = (1,4)

@cuda.jit
def calcu_T(D, T):

    gw = cuda.gridDim.x
    bx = cuda.blockIdx.x
    tx = cuda.threadIdx.x
    bw = cuda.blockDim.x
    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y

    c_num = D.shape[1]
    c_index = bx

    while c_index<c_num*c_num:
        c_x = int(c_index/c_num)
        c_y = c_index%c_num

        if c_x==c_y:
            T[c_x,c_y] = 0.0
        else:
            X = D[:,c_x]
            Y = D[:,c_y]

            hbuf = cuda.shared.array(hsize, float32)

            h = tx

            Xi = X[h:]
            Xi1 = X[:-h]
            Yih = Y[:-h]

            sbuf = cuda.shared.array(tsize, float32)

            L = len(Xi)

            #mean
            if ty==0:
                Xi_m = 0.0
                Xi1_m = 0.0
                Yih_m = 0.0
                for i in range(L):
                    Xi_m += Xi[i]
                    Xi1_m += Xi1[i]
                    Yih_m += Yih[i]
                Xi_m = Xi_m/L
                Xi1_m = Xi1_m/L
                Yih_m = Yih_m/L
                sbuf[0,tx] = Xi_m
                sbuf[1,tx] = Xi1_m
                sbuf[2,tx] = Yih_m

            cuda.syncthreads()

            sl = cuda.shared.array(tpb, float32)

            r_index = ty
            s_l = 0.0
            while r_index<L:
                s1 = 0.0
                for i in range(L):
                    s1 += (Xi[r_index]+Xi1[i])/sbuf[0,tx]

                s_l += s1
                r_index +=bh
            sl[tx,ty] = s_l
            cuda.syncthreads()

            #
            if ty==0:
                ht = 0.0
                for i in range(bh):
                    ht += sl[tx,i]
                hbuf[0,tx] = ht/L
            cuda.syncthreads()

            #max
            if tx==0 and ty==0:
                m_t = 0.0
                for index,ele in enumerate(hbuf[0]):
                    if index==0:
                        m_t = ele
                    elif ele>m_t:
                        m_t = ele

                T[c_x,c_y] = m_t

        c_index +=gw



df = np.random.random_sample((300, 200)) + 10
D = np.array(df, dtype=np.float32)
r,c = D.shape

T = np.empty([c,c])

dD = cuda.to_device(D)
dT = cuda.device_array_like(T)

calcu_T[bpg, tpb](dD,dT)
dT.copy_to_host(T)

错误:

Traceback (most recent call last):
      File "G:\myworkspace\python3.5\forte\forte170327\test10fortest8.py", line 118, in <module>
        dT.copy_to_host(T)
      File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\devicearray.py", line 198, in copy_to_host
        _driver.device_to_host(hostary, self, self.alloc_size, stream=stream)
      File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 1481, in device_to_host
        fn(host_pointer(dst), device_pointer(src), size, *varargs)
      File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 259, in safe_cuda_api_call
        self._check_error(fname, retcode)
      File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 296, in _check_error
        raise CudaAPIError(retcode, msg)
    numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR

我的设备信息:

Device 0: 
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147483648 bytes)
  ( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes

最佳答案

你的代码没有任何问题。如果我在我的 GTX970 上运行你的代码,我会得到:

In [11]: main??
Signature: main()
Source:   
def main():

    df = np.random.random_sample((300, 200)) + 10
    D = np.array(df, dtype=np.float32)
    r,c = D.shape

    T = np.empty([c,c])

    dD = cuda.to_device(D)
    dT = cuda.device_array_like(T)

    calcu_T[bpg, tpb](dD,dT)
    dT.copy_to_host(T)
File:      ~/SO/crash.py
Type:      function

In [12]: %timeit -n 3 -r 3 main()
3 loops, best of 3: 6.61 s per loop

即没有运行时错误,但包括内核在内的 python 代码运行需要 6.6 秒。如果我使用 CUDA 分析器分析代码:

$ nvprof python crash.py

==13828== NVPROF is profiling process 13828, command: python crash.py
All finished
==13828== Profiling application: python crash.py
==13828== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  6.59109s         1  6.59109s  6.59109s  6.59109s  cudapy::__main__::calcu_T$241(Array<float, int=2, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>)
  0.00%  26.271us         1  26.271us  26.271us  26.271us  [CUDA memcpy DtoH]
  0.00%  21.279us         1  21.279us  21.279us  21.279us  [CUDA memcpy HtoD]

==13828== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 98.51%  6.59118s         1  6.59118s  6.59118s  6.59118s  cuMemcpyDtoH
  1.42%  94.890ms         1  94.890ms  94.890ms  94.890ms  cuDevicePrimaryCtxRetain
  0.05%  3.4116ms         1  3.4116ms  3.4116ms  3.4116ms  cuModuleLoadDataEx
  0.01%  417.96us         1  417.96us  417.96us  417.96us  cuLinkCreate
  0.00%  227.57us         1  227.57us  227.57us  227.57us  cuLinkAddData
  0.00%  195.72us         2  97.859us  95.710us  100.01us  cuMemAlloc
  0.00%  190.10us         1  190.10us  190.10us  190.10us  cuLinkComplete
  0.00%  139.04us         1  139.04us  139.04us  139.04us  cuMemGetInfo
  0.00%  53.193us         1  53.193us  53.193us  53.193us  cuMemcpyHtoD
  0.00%  29.538us         1  29.538us  29.538us  29.538us  cuDeviceGetName
  0.00%  17.895us         1  17.895us  17.895us  17.895us  cuLaunchKernel
  0.00%  2.0250us         1  2.0250us  2.0250us  2.0250us  cuCtxPushCurrent
  0.00%  2.0150us         5     403ns     255ns     752ns  cuFuncGetAttribute
  0.00%  1.6260us         2     813ns     547ns  1.0790us  cuDeviceGetCount
  0.00%  1.1430us         1  1.1430us  1.1430us  1.1430us  cuModuleGetFunction
  0.00%     951ns         2     475ns     372ns     579ns  cuDeviceGet
  0.00%     796ns         1     796ns     796ns     796ns  cuLinkDestroy
  0.00%     787ns         1     787ns     787ns     787ns  cuDeviceComputeCapability

您可以看到您发布的内核需要 6.5 秒才能运行。

您没有提供任何详细信息,但我猜测您正在 Windows 上运行,您的 GPU 是显示 GPU,并且您的代码运行速度足够慢,以至于达到了 WDDM 显示管理器看门狗超时限制。这是有非常详细的记录的,并且之前已经被问过数百次 - 例如 here .

您选择的搜索引擎和 CUDA Windows 入门指南将为您提供有关从操作系统和硬件角度改善情况的替代方案的信息。然而,最明显的就是改进代码以使其运行得更快。

关于python - 当使用大数组作为输入数据时出现 CUDA 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43366740/

相关文章:

python - Mypy 是否忽略构造函数?

python - 使用重采样对齐 Pandas 中的多个时间序列

c++ - 在 ubuntu 中将 c++ 编译器、链接器标志添加到 nsight eclipse 6.5

python - Numba 中的笛卡尔积

python - Numpy 托普利茨矩阵

python - 我怎么知道我的 Embarrassingly Parallel 任务是否适合 GPU?

python - 在不下载视频的情况下提取 youtube 视频的特定帧

python - 在Python中,如何使该模型针对PC的音频输出流进行检查

python - 在 Nvidia-Docker 中运行 CUFFT 例程

opencv - CUDA 仅处理了 OpenCV 16 位灰度垫中总列的一半