image - Cuda 图像平均滤波器

标签 image image-processing matrix cuda gpu

平均过滤器是线性类的加窗滤波器,用于平滑信号(图像)。该滤波器用作低通滤波器。过滤器背后的基本思想是对信号(图像)的任何元素取其邻域的平均值。

如果我们有 m x n矩阵,我们要应用大小为 k 的平均滤波器在它上面,然后对于矩阵中的每个点p:(i,j)该点的值将是正方形中所有点的平均值
Square Kernel
该图是用于大小为 2 的滤波的 Square kernel ,黄色框是要平均的像素,所有网格是相邻像素的平方,像素的新值将是它们的平均值。
问题是这个算法很慢,特别是在大图像上,所以我想到了使用GPGPU .
现在的问题是 , 如果可能的话,如何在 cuda 中实现?

最佳答案

这是embarrassingly parallel的经典案例可以很容易地映射到 CUDA 框架的图像处理问题。平均滤波器被称为 Box Filter在图像处理领域。

最简单的方法是使用 CUDA 纹理进行过滤,因为边界条件可以很容易地通过纹理处理。

假设您在主机上分配了源和目标指针。该程序将是这样的。

  • 分配足够大的内存来保存设备上的源图像和目标图像。
  • 将源图像从主机复制到设备。
  • 将源图像设备指针绑定(bind)到纹理。
  • 指定适当的 block 大小和足够大的网格以覆盖图像的每个像素。
  • 使用指定的网格和 block 大小启动过滤内核。
  • 将结果复制回主机。
  • 解绑贴图
  • 免费设备指针。

  • 盒式过滤器的示例实现

    内核
    texture<unsigned char, cudaTextureType2D> tex8u;
    
    //Box Filter Kernel For Gray scale image with 8bit depth
    __global__ void box_filter_kernel_8u_c1(unsigned char* output,const int width, const int height, const size_t pitch, const int fWidth, const int fHeight)
    {
        int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
        int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
    
        const int filter_offset_x = fWidth/2;
        const int filter_offset_y = fHeight/2;
    
        float output_value = 0.0f;
    
        //Make sure the current thread is inside the image bounds
        if(xIndex<width && yIndex<height)
        {
            //Sum the window pixels
            for(int i= -filter_offset_x; i<=filter_offset_x; i++)
            {
                for(int j=-filter_offset_y; j<=filter_offset_y; j++)
                {
                    //No need to worry about Out-Of-Range access. tex2D automatically handles it.
                    output_value += tex2D(tex8u,xIndex + i,yIndex + j);
                }
            }
    
            //Average the output value
            output_value /= (fWidth * fHeight);
    
            //Write the averaged value to the output.
            //Transform 2D index to 1D index, because image is actually in linear memory
            int index = yIndex * pitch + xIndex;
    
            output[index] = static_cast<unsigned char>(output_value);
        }
    }
    

    包装函数:
    void box_filter_8u_c1(unsigned char* CPUinput, unsigned char* CPUoutput, const int width, const int height, const int widthStep, const int filterWidth, const int filterHeight)
    {
    
        /*
         * 2D memory is allocated as strided linear memory on GPU.
         * The terminologies "Pitch", "WidthStep", and "Stride" are exactly the same thing.
         * It is the size of a row in bytes.
         * It is not necessary that width = widthStep.
         * Total bytes occupied by the image = widthStep x height.
         */
    
        //Declare GPU pointer
        unsigned char *GPU_input, *GPU_output;
    
        //Allocate 2D memory on GPU. Also known as Pitch Linear Memory
        size_t gpu_image_pitch = 0;
        cudaMallocPitch<unsigned char>(&GPU_input,&gpu_image_pitch,width,height);
        cudaMallocPitch<unsigned char>(&GPU_output,&gpu_image_pitch,width,height);
    
        //Copy data from host to device.
        cudaMemcpy2D(GPU_input,gpu_image_pitch,CPUinput,widthStep,width,height,cudaMemcpyHostToDevice);
    
        //Bind the image to the texture. Now the kernel will read the input image through the texture cache.
        //Use tex2D function to read the image
        cudaBindTexture2D(NULL,tex8u,GPU_input,width,height,gpu_image_pitch);
    
        /*
         * Set the behavior of tex2D for out-of-range image reads.
         * cudaAddressModeBorder = Read Zero
         * cudaAddressModeClamp  = Read the nearest border pixel
         * We can skip this step. The default mode is Clamp.
         */
        tex8u.addressMode[0] = tex8u.addressMode[1] = cudaAddressModeBorder;
    
        /*
         * Specify a block size. 256 threads per block are sufficient.
         * It can be increased, but keep in mind the limitations of the GPU.
         * Older GPUs allow maximum 512 threads per block.
         * Current GPUs allow maximum 1024 threads per block
         */
    
        dim3 block_size(16,16);
    
        /*
         * Specify the grid size for the GPU.
         * Make it generalized, so that the size of grid changes according to the input image size
         */
    
        dim3 grid_size;
        grid_size.x = (width + block_size.x - 1)/block_size.x;  /*< Greater than or equal to image width */
        grid_size.y = (height + block_size.y - 1)/block_size.y; /*< Greater than or equal to image height */
    
        //Launch the kernel
        box_filter_kernel_8u_c1<<<grid_size,block_size>>>(GPU_output,width,height,gpu_image_pitch,filterWidth,filterHeight);
    
        //Copy the results back to CPU
        cudaMemcpy2D(CPUoutput,widthStep,GPU_output,gpu_image_pitch,width,height,cudaMemcpyDeviceToHost);
    
        //Release the texture
        cudaUnbindTexture(tex8u);
    
        //Free GPU memory
        cudaFree(GPU_input);
        cudaFree(GPU_output);
    }
    

    好消息是您不必自己实现过滤器。 CUDA 工具包带有免费的信号和图像处理库,名为 NVIDIA Performance Primitives aka NPP,由 NVIDIA 开发。 NPP 利用支持 CUDA 的 GPU 来加速处理。平均滤波器已经在 NPP 中实现。当前版本的 NPP (5.0) 支持 8 位、1 channel 和 4 channel 图像。
    功能是:
  • nppiFilterBox_8u_C1R对于 1 个 channel 图像。
  • nppiFilterBox_8u_C4R对于 4 channel 图像。
  • 关于image - Cuda 图像平均滤波器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14334251/

    相关文章:

    向量化双求和的 Pythonic 方法

    image - 将图像转换为矩阵

    c++ - 使用 Boost.GIL 将图像转换为 "raw"字节

    css - 样式图像链接成为焦点

    java - ImageJ 与 BufferedImage 的兼容性

    java - 在java中显示灰度图像

    c++ - Windows Media Foundation:IMFSourceReader::SetCurrentMediaType 执行时间问题

    javascript - 页面向下滚动时调整其高度的图像

    R 共现矩阵水平数据

    c - 如何在 C 语言中转置二维矩阵?