c++ - CUDA 内核不更新输出数据

标签 c++ image cuda

好吧,这个任务的主要思想是计算多个图像的平均值,我让它以正常方式运行,所以我想我会尝试使用 CUDA,但不幸的是我在输出中收到的是第一张图片而不是平均值。 (在内核中,我还尝试将一些像素设置为 0 以确保发生了某些事情,但没有运气..)

////My kernel:
//nImages - number of images in the memory
//nBytes - number of pixels*color per image (also it's a size of dataOut)
//nImages*nBytes gives us the size of dataIn 
//nBatch - dataIn has 1 milion bytes per image, we run in 6144 threads, so we need 163 batches to calc the whole dataOut
__global__ 
void avg_arrays(unsigned char* cuDataIn, unsigned char* cuDataOut, int nImages, int nBytes, int nBatch) 
{
   //get the position of the correct byte
   int j = threadIdx.x +  nBatch;
   //if we're outside of image then give up
   if(j >= nBytes) return;
   //proceed averaging
   long lSum = 0;
   for(int i=0; i < nImages; ++i) 
      lSum += cuDataIn[i*nBytes + j];
   lSum = lSum / nImages;
   cuDataOut[j] = lSum;
}

内存分配等

unsigned char* dataIn = 0;
unsigned char* dataOut= 0;

// Allocate and Transfer memory to the devicea
gpuErrchk( cudaMalloc((void**)&dataIn, nPixelCountBGR * nNumberOfImages * sizeof(unsigned char)));                                  //dataIn
gpuErrchk( cudaMalloc((void**)&dataOut, nPixelCountBGR * sizeof(unsigned char)));                               //dataOut
gpuErrchk( cudaMemcpy(dataIn, bmps,  nPixelCountBGR * nNumberOfImages * sizeof(unsigned char), cudaMemcpyHostToDevice ));           //dataIn
gpuErrchk( cudaMemcpy(dataOut, basePixels, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyHostToDevice ));   //dataOut

// Perform the array addition
dim3 dimBlock(N);  
dim3 dimGrid(1);

//do it in batches, unless it's possible to run more threads at once, anyway N is a number of max threads
for(int i=0; i<nPixelCountBGR; i+=N){
   cout << "Running with: nImg: "<< nNumberOfImages << ", nPixBGR " << nPixelCountBGR << ", and i = " << i << endl;
   avg_arrays<<<dimGrid, dimBlock>>>(dataIn, dataOut, nNumberOfImages, nPixelCountBGR, 0);
}
// Copy the Contents from the GPU
gpuErrchk(cudaMemcpy(basePixels, dataOut, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyDeviceToHost)); 

gpuErrchk(cudaFree(dataOut));
gpuErrchk(cudaFree(dataIn));

错误检查没有带来任何消息,所有代码运行顺利,最后我得到的只是第一张图片的精确拷贝。

以防万一有人需要这里是一些控制台输出:

Running with: nImg: 29, nPixBGR 1228800, and i = 0
...
Running with: nImg: 29, nPixBGR 1228800, and i = 1210368
Running with: nImg: 29, nPixBGR 1228800, and i = 1216512
Running with: nImg: 29, nPixBGR 1228800, and i = 1222656
Time of averaging: 0.219

最佳答案

如果 N 大于 512 或 1024(取决于您在哪个 GPU 上运行,您没有提及),那么这是无效的:

dim3 dimBlock(N); 

因为您无法启动每个 block 包含超过 512 或 1024 个线程的内核:

 avg_arrays<<<dimGrid, dimBlock>>>(...
                          ^
                          |
                     this is limited to 512 or 1024

如果你学习proper cuda error checking并将其应用于您的内核启动,您将捕获此类错误。

关于c++ - CUDA 内核不更新输出数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19934798/

相关文章:

c++ - QGIS 找不到头文件

c++ - 如何文件流字符串(包括空格)?

c++ - 在 O(1) 上运行的内存池

C++ 表达式求值顺序

javascript - 在 ReactJS 中显示来自 Json 的多个图像

java - 关于使用 Java Swing 循环动态加载图像的问题

jquery - 使用兼容IE7/8的jquery旋转图像

c++ - LLVM 检索 AllocaInst 的名称

c++ - 如何在混合 cuda C++ 程序中调试主机代码?

arrays - CUDA如何比较两个二维数组?