CUDA 内核和 2D 数组 - 它是如何工作的?

标签 c cuda nvidia

我正在做一种图像旋转方法。它需要两个矩阵和一定程度的旋转。它将原始矩阵旋转一定角度并将其存储到旋转矩阵中。 我有以下“正常”代码(对于 CPU - 取自此站点 - http://sinepost.wordpress.com/2012/07/24/image-rotation/ )并且它正常工作;

static void RotateImage(unsigned char original[RAW_HEIGHT][RAW_WIDTH] , unsigned char rotated[RAW_HEIGHT][RAW_WIDTH] , int degrees)
{
    double centerX = RAW_WIDTH/2;
    double centerY = RAW_HEIGHT/2;

    for(int x = 0; x< RAW_HEIGHT;x++)
    {
        for (int y = 0; y < RAW_WIDTH; y++)
        {
            double dir = calculateDirection(x-centerX,y-centerY);
            double mag = calculateMagnitude(x-centerX,y-centerY);

            dir-=degrees;

            int origX = (int)(centerX + calculateX(dir,mag));
            int origY = (int)(centerY + calculateY(dir,mag));

            if (origX >= 0 && origX < RAW_HEIGHT && origY >= 0 && origY < RAW_WIDTH)
            {
                    rotated[x][y] = original[origX][origY];
            }
        }
    }
}

我想将此代码转换为 CUDA 代码。这是我的版本:

#define RAW_WIDTH 1600*3
#define RAW_HEIGHT 1200

unsigned char *dev_original_image;
unsigned char *dev_rotated_image;

__global__ void rotatePicture(unsigned char *original, unsigned char *rotated, int degrees)
{
    int x = threadIdx.x + blockDim.x * blockIdx.x;
    int y = threadIdx.y + blockDim.y * blockIdx.y;
    int offset_rotated = x + y * blockDim.x * gridDim.x;

    double centerX = 2400.0;
    double centerY = 600.0;

    double dir = (atan2(y-centerY,x-centerX))*180/3.14159265;
    double mag = sqrt((x-centerX)*(x-centerX) + (y-centerY)*(y-centerY));

    dir = dir - degrees;

    int origX = (int)(centerX + cos((dir*3.14159265/180)) * mag);
    int origY = (int)(centerY + sin((dir*3.14159265/180)) * mag);
    int offset_original = origX + origY * blockDim.x * gridDim.x;

    if(offset_original > 0 && offset_original < RAW_HEIGHT*RAW_WIDTH)
        *(rotated + offset_rotated) = *(original + offset_original);
}

但它没有给出与 CPU 部分相同的结果。 我认为问题在于传递 CUDA kerenl 的参数。我将它们作为二维数组传递,可以吗?谁可以给我解释一下这个? 这是我的 kerenl 配置和调用:

dim3 BlockPerGrid(450,400,1);
dim3 ThreadsPerGrid(8,4,1);

cudaMalloc((void**)&dev_original_image,sizeof(unsigned char)*RAW_HEIGHT*RAW_WIDTH);
cudaMalloc((void**)&dev_rotated_image,sizeof(unsigned char)*RAW_HEIGHT*RAW_WIDTH);

cudaMemcpy(dev_original_image, raw_image2D, sizeof(unsigned char)*RAW_HEIGHT*RAW_WIDTH,cudaMemcpyHostToDevice);
cudaMemcpy(dev_rotated_image, raw_image2D_rotated, sizeof(unsigned char)*RAW_HEIGHT*RAW_WIDTH, cudaMemcpyHostToDevice);

rotatePicture<<<BlockPerGrid,ThreadsPerGrid>>>(dev_original_image,dev_rotated_image, deg);

感谢您的建议!

注意:我修改了代码并且工作得更好,但仍然不正确。

最佳答案

这是潜伏在这些水域中的其他问题的解决方案。 这是我正确的内核:

__global__ void rotatePicture(unsigned char *original, unsigned char *rotated, int degrees)
{
    int x = threadIdx.x + blockDim.x * blockIdx.x;
    int y = threadIdx.y + blockDim.y * blockIdx.y;
    int offset_rotated = x + y * blockDim.x * gridDim.x;

    double centerX = 2400.0;
    double centerY = 600.0;

    double dir = (atan2(x-centerX,y-centerY))*180/3.14159265;
    double mag = sqrt((x-centerX)*(x-centerX) + (y-centerY)*(y-centerY));

    dir = dir - degrees;

    int origX = (int)(centerX + sin((dir*3.14159265/180)) * mag);
    int origY = (int)(centerY + cos((dir*3.14159265/180)) * mag);
    int offset_original = origX + origY * blockDim.x * gridDim.x;

    if(origX > 0 && origX < RAW_WIDTH && origY > 0 && origY < RAW_HEIGHT)
        *(rotated + offset_rotated) = *(original + offset_original);
}

此外,我还像这样更改了内核尺寸(以适应我的 1600*3 宽度和 1200 高度):

dim3 BlockPerGrid(600,300,1);
dim3 ThreadsPerGrid(8,4,1);

因此,它的功能与上面的 CPU 版本相同,但使用 GPU 资源。享受吧

关于CUDA 内核和 2D 数组 - 它是如何工作的?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14942330/

相关文章:

filter - 按键值推力过滤器

cuda - Nvidia CUDA Profiler 的时间线包含许多大的空白

ubuntu - CUDA 2.1 "error: unknown type name ' size_t'"

java - OpenCL 和 Java - 奇怪的性能结果

更改一维和二维可变长度数组的大小

c - C 中静态分配数组大小的限制

c - 如何使用semctl设置信号量集中第n个信号量的值

tensorflow - 更换gpu后我必须重新安装tensorflow吗?

c - 如何通过插入的 USB 设备找到 USB 主机 Controller

c++ - 如何为 GpuMat 编写内核?