Cuda矩阵乘法

标签 cuda matrix-multiplication

我正在尝试在 cuda 中编写矩阵乘法代码,这与 Nvidia 的 cuda 编程指南非常相似,但它不起作用。它应该执行 C=alpha*A*B+beta*C ,但对于每个 A,B C 保持不变。

__global__ void MatMulKernel(int m,int n,int k,double *A,double *B,double *C,double alpha,double beta)
{
    double Ctemp = 0.0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int ind;
    for (ind = 0; ind < k; ++ind)
    {
       Ctemp += A[row+ind*m]*B[ind+col*k];
    }

   C[row+m*col] = alpha*Ctemp+beta*C[row+m*col];
//C[row+m*col] = Ctemp;
   __syncthreads();
}

extern "C" void
local_mm_cuda (const int m, const int n, const int k, const double alpha,
  const double *A, const int lda, const double *B, const int ldb,
  const double beta, double *C, const int ldc)
{

 int row, col;

  /* Verify the sizes of lda, ldb, and ldc */
  assert (lda >= m);
  assert (ldb >= k);
  assert (ldc >= m);

  // allocating memory for device array
  double *dA,*dB,*dC;
  size_t sizeA = sizeof(double)*m*k;
  size_t sizeB = sizeof(double)*n*k;
  size_t sizeC = sizeof(double)*m*n;

  cudaMalloc((void**)&dA,sizeA);
  cudaMalloc((void**)&dB,sizeB);
  cudaMalloc((void**)&dC,sizeC);

  cudaMemcpy(dA, A, sizeA, cudaMemcpyHostToDevice);
  cudaMemcpy(dB, B, sizeB, cudaMemcpyHostToDevice);
  cudaMemcpy(dC, C, sizeC, cudaMemcpyHostToDevice);

  // calling matrix multiplication kernal
  dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
  dim3 dimGrid( n/dimBlock.x, m/dimBlock.y);
  MatMulKernel<<<dimGrid, dimBlock>>>(m,n,k,dA,dB,dC,alpha,beta);
  cudaThreadSynchronize();

  // saving C calculated back in C
  cudaMemcpy(dC,C, sizeC,cudaMemcpyDeviceToHost);
  cudaFree(dA);
  cudaFree(dB);
  cudaFree(dC);
}

最佳答案

尝试修改

"dim3 dimGrid( n/dimBlock.x, m/dimBlock.y);"

"dim3 dimGrid( (n+dimBlock.x-1)/dimBlock.x, (m+dimBlock.y-1)/dimBlock.y); "

关于Cuda矩阵乘法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10327726/

相关文章:

c++ - CUDA,使用 memset(或 fill 或 ...)将 float 数组设置为 max val possible

c# - 在 GPU 全局内存中存储选择性元素

c - CUDA FFT 函数调用期间出现错误“code=2(CUFFT_ALLOC_FAILED)”

c - 涉及多个进程的矩阵乘法中的共享内存

c++ - boost ublas矩阵产品的问题

c++ - Cuda:最小二乘求解,速度较差

c++ - CUDA:二维网格中的线程 ID 分配

python - 在numpy中找到子数组的点积

c - 如何更高效地实现矩阵公式?

c - CUDA 中更快的矩阵乘法