CUDA 矩阵乘法写入错误的内存位置

标签 c matrix cuda matrix-multiplication

我一直在尝试编写的简单程序的想法是从用户那里获取输入以查看要相乘的矩阵有多大。

dd@cuda-Linux:~/Desktop/multi$ ./program
What is the rowSize of a? 33
What is the colSize of a? 33
What is the rowSize of b? 33
What is the colSize of b? 33
Would you like to write the results to a file?(y or n)
y
Creating the random numbers now
Writing Matrix A to file now...
Writing Matrix B to file now...
Starting it on the device
Writing Matrix C to file now...
Finish

但是问题出在我的线程计算上。我可以使用 32x32 矩阵,它会正常运行并给出正确的结果。然而,一旦我运行 33x33,我会得到如下结果:
[Matrix A] x[Matrix B] =[Matrix C] (链接到它们,而不是将几个巨大的矩阵粘贴到这篇文章中。但是使用矩阵 c,您可以看到它的一半开始写出错误的数字。我的显卡有 1024 个线程的限制,这是一个 32x32 矩阵。另外,当我去跑一个100x100的矩阵,矩阵C全是0。

设 mem_size_X 为 sizeof(float) * size_X,size_X 为矩阵的高度*宽度。现在高度和宽度必须相同,即 32x32。另外,“block_size”只是高度。因此,对于 32x32 矩阵, block 大小对应于 32。
主机代码(启动):

    float* deviceMatrixA;
    float* deviceMatrixB;
    cudaMalloc((void**) &deviceMatrixA, mem_size_A);//allocate mem_size_x on the device. 
    cudaMalloc((void**) &deviceMatrixB, mem_size_B);


    cudaMemcpy(deviceMatrixA, a.elements, mem_size_A, cudaMemcpyHostToDevice);
    cudaMemcpy(deviceMatrixB, b.elements, mem_size_B, cudaMemcpyHostToDevice);



    int size_C = c.rowSize * c.colSize;
    int mem_size_C = sizeof(float) * size_C;
    c.elements = (float*) malloc(mem_size_C);


    float* deviceMatrixC;
    cudaMalloc((void**) &deviceMatrixC, mem_size_C);


    dim3 threads(block_size, block_size);
    dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);



    matrixMul<<< grid, threads,2*block_size*block_size*sizeof(float)>>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize, block_size);//sizeof(float)*block_size*block_size
    cudaThreadSynchronize();

内核代码:

// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int aBegin = wA * block_size * by;
    int aEnd   = aBegin + wA - 1;
    int aStep  = block_size;

    int bBegin = block_size * bx;

    int bStep  = block_size * wB;
    float Csub=0;

    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        extern __shared__ float As[];
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        smem[ty*block_size+tx] = A[a + wA * ty + tx];

        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];

        __syncthreads();

        for (int k = 0; k < block_size; ++k)
            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;

        __syncthreads();
    }

    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}

谢谢

最佳答案

正如我在您的 earlier, almost identical question 上告诉您的,此矩阵乘法代码仅设计用于对维度为 block_size 的整数倍的矩阵进行计算。如果你选择block_size=32,那么它只能用于32x32、64x64、96x96、128x128等。没有你have done with dynamically allocated shared memory改变这一点。

为了验证情况是否如此,让我们从一个完整的、可编译的重现案例开始,它将运行您的内核,检查它是否执行并将其输出与主机上完成的简单引用计算进行比较。这段代码是您发布的内核,加上您的启动参数计算的核心。它将从标准输入读取大小,然后运行该案例。如果结果差异超过一定的容差,则会引发断言错误。这是代码,它应该在 CUDA 3.0 或更高版本上编译并在任何 CUDA 兼容 GPU 上运行:

#include <assert.h>
#include <cstdio>
#include <cstdlib>
#include <cmath>

inline void GPUassert(cudaError_t code, char * file, int line, bool Abort=true)
{
    if (code != 0) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file,line);
        if (Abort) exit(code);
    }       
}

#define GPUerrchk(ans) { GPUassert((ans), __FILE__, __LINE__); }

__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB, size_t block_size)
{
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int aBegin = wA * block_size * by;
    int aEnd   = aBegin + wA - 1;
    int aStep  = block_size;
    int bBegin = block_size * bx;
    int bStep  = block_size * wB;

    float Csub=0.f;
    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        extern __shared__ float smem[];

        smem[ty*block_size+tx] = A[a + wA * ty + tx];
        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];

        __syncthreads();

        for (int k = 0; k < block_size; ++k)
            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;

        __syncthreads();
    }

    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;
}

inline float frand(){
    return (float)rand()/(float)RAND_MAX;
}

void matmul(float *C, const float *A, const float *B,  int wA, int wB)
{
    for(int k=0; k<wB; k++) {
        for(int j=0; j<wB; j++) {
            float dotp = 0.f;
            for(int i=0; i<wA; i++) {
                dotp += A[j*wA+i] * B[i*wB+k];
            }
            C[j*wB+k] = dotp;
        }
    }
}

int main(int argc, char ** argv)
{
    int val = 128;

    if ( argc == 2 ) {
        val = atoi(argv[1]);
    }

    int m = val, n = val, mn = m*n;
    size_t sz = size_t(mn) * sizeof(float);

    srand(time(NULL));

    float * A = new float[mn], * B = new float[mn], * C= new float[mn];
    float * A_, * B_, * C_;

    for(int i=0; i<mn; i++) {
        A[i] = frand(); B[i] = frand();
    }

    GPUerrchk( cudaMalloc((void **)&A_, sz) );
    GPUerrchk( cudaMalloc((void **)&B_, sz) );
    GPUerrchk( cudaMalloc((void **)&C_, sz) );

    GPUerrchk( cudaMemcpy(A_, A, sz, cudaMemcpyHostToDevice) );
    GPUerrchk( cudaMemcpy(B_, B, sz, cudaMemcpyHostToDevice) );

    // Launch configuration
    // Note that the input matrice sizes *must* be a round
    // multiple of blocksize for this code to work correctly.
    const int blocksize=16;
    const int shmsz = size_t(2*blocksize*blocksize) * sizeof(float);
    dim3 block=dim3(blocksize,blocksize), grid = dim3(m/block.x,m/block.y); 

    matrixMul<<<grid,block,shmsz>>>(C_,A_,B_,m,n,blocksize);
    GPUerrchk( cudaPeekAtLastError() );

    GPUerrchk( cudaMemcpy(C, C_, sz, cudaMemcpyDeviceToHost) );

    // Verfication on host
    float * Cref = new float[mn];
    matmul(Cref,A,B,m,n); 
    const float tol = 5e-5f;
    for(int i=0; i<mn; i++) {
        assert(fabs(C[i]-Cref[i])/C[i] < tol);
    }

    GPUerrchk( cudaThreadExit() ); // CUDA 3.2 compatible

    return 0;
}

现在,让我们针对不同的大小运行此代码。为了验证 GPU 上的代码没有做任何错误,我将使用 cuda-memcheck 实用程序运行它,该实用程序可以检测越界内存访问。以下所有测试均在具有计算能力 1.2 卡和 CUDA 3.2 的 OS X 10.6 计算机上进行,使用 blocksize=16 :

$ nvcc -arch=sm_12 -Xcompiler="-Wall" -Xptxas="-v" -o matmul2 matmul2.cu
ptxas info    : Compiling entry function '_Z9matrixMulPfS_S_iim' for 'sm_12'
ptxas info    : Used 16 registers, 32+16 bytes smem, 4 bytes cmem[1]

让我们尝试一下矩阵小于 blocksize 的情况首先

$ cuda-memcheck ./matmul2 4
========= CUDA-MEMCHECK
GPUassert: invalid configuration argument matmul2.cu 101
========= ERROR SUMMARY: 0 errors

这里我们无法运行内核,并出现无效的配置参数错误。为什么?正因为如此:

    dim3 block=dim3(blocksize,blocksize), grid = dim3(m/block.x,m/block.y); 

m,n < blocksize 时,会导致 0 网格大小.

接下来让我们尝试 block 大小的最小舍入倍数,在本例中为 16:

$ cuda-memcheck ./matmul2 16
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

运行时没有错误,也没有断言失败。现在让我们将大小增加到 17:

cuda-memcheck ./matmul2 17
========= CUDA-MEMCHECK
GPUassert: unspecified launch failure matmul2.cu 103
========= Invalid __global__ read of size 4
=========     at 0x000001f8 in matrixMul
=========     by thread (0,2,0) in block (0,0)
=========     Address 0x001009c8 is out of bounds
=========
========= ERROR SUMMARY: 1 error

我们检测到内存访问越界并出现启动失败错误,这是预期的。现在让我们尝试 64、96 和 128:

$ cuda-memcheck ./matmul2 64
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

$ cuda-memcheck ./matmul2 96
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

$ cuda-memcheck ./matmul2 128
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

最后让我们尝试 129:

$ cuda-memcheck ./matmul2 129
========= CUDA-MEMCHECK
GPUassert: unspecified launch failure matmul2.cu 103
========= Invalid __global__ read of size 4
=========     at 0x000001f8 in matrixMul
=========     by thread (0,1,0) in block (0,0)
=========     Address 0x00120904 is out of bounds
=========
========= ERROR SUMMARY: 1 error

即使您不明白为什么会发生越界错误,您是否至少愿意接受此代码确实仅适用于 block 大小的整数倍的矩阵?

关于CUDA 矩阵乘法写入错误的内存位置,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9244747/

相关文章:

image-processing - GPU 是否适合基于案例的图像过滤?

cuda - CUDA 和其他 OptiX 组件中固有的射线三角形相交

c - 使用指针仅打印列表的一些元素

c - 如何读取文件并将内容放入结构中(使用 c)?

c - C中的堆栈代码一运行就崩溃并停止工作

c - 矩阵乘法 - C

java - 在 Parallel Colt 中添加矩阵和 vector

c - 包含双字段的结构的大小

OpenGL glRotate 和 glTranslate 顺序

CUDA 浮点加法给出错误答案(与 CPU 浮点运算相比)