c - 为什么 clEnqueueMapBuffer 返回随机数据?

标签 c memory opencl

我试图通过编写一个内核来了解 OpenCL 中的 clEnqueueMapBuffer,该内核在输入缓冲区中找到值的平方,但使用 clEnqueueMapBuffer 在输出缓冲区中一次只返回两个项目。据我了解,此函数返回主机内存中的指针,该指针指向设备中的缓冲内存。然后 clEnqueueUnmapMemObject 必须取消映射此缓冲区以允许内核继续其计算。现在,当我调用 clEnqueueMapBuffer 时,它正在返回随机数据。

这是我的内核

__kernel void testStream(
    __global int *input_vector,
    __global int *output_vector,
    __global int *mem_flag) // informs the host when the workload is finished
{
    mem_flag[0] = 1;
}

和我的来源

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <CL/opencl.h>

#include "utils.h"

int main(void)
{
    unsigned int n = 24;
    int BUFF_SIZE = 2;
    // Input and output vectors
    int num_bytes = sizeof(int) * n;
    int *output_buffer = (int *) malloc(num_bytes);
    int output_buffer_offset = 0;
    int *mapped_data = NULL;
    // use mapped_flag for determining if the job on the device is finished
    int *mapped_flag = NULL;
    int *host_in = (int *) malloc(num_bytes);
    int *host_out = (int *) malloc(num_bytes);
    // Declare cl variables
    cl_mem device_in;
    cl_mem device_out;
    cl_mem device_out_flag;

    // Declare cl boilerplate
    cl_platform_id   platform = NULL;
    cl_device_id     device   = NULL;
    cl_command_queue queue    = NULL;
    cl_context       context  = NULL;
    cl_program       program  = NULL;
    cl_kernel        kernel   = NULL;

    // Located in utils.c -- the source is irrelevant here
    char *kernel_source = read_kernel("kernels/test.cl");

    // Initialize host_in
    int i;
    for (i = 0; i < n; i++) {
        host_in[i] = i + 1;
    }

    // Set up opencl
    cl_int error;
    error   = clGetPlatformIDs(1, &platform, NULL);
    printf("clGetPlatformIDs: %d\n", (int) error);
    error   = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL);
    printf("clGetDeviceIDs: %d\n", (int) error);
    context = clCreateContext(NULL, 1, &device, NULL, NULL, &error);
    printf("clCreateContext: %d\n", (int) error);
    queue   = clCreateCommandQueue(context, device, 0, &error);
    printf("clCreateCommandQueue: %d\n", (int) error);
    program = clCreateProgramWithSource(context, 1,
        (const char**)&kernel_source, NULL, &error);
    printf("clCreateProgramWithSource: %d\n", (int) error);
    clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
    kernel = clCreateKernel(program, "testStream", &error);
    printf("clCreateKernel: %d\n", (int) error);
    // Create the buffers
    device_in = clCreateBuffer(context, CL_MEM_READ_ONLY,
        num_bytes, NULL, NULL);
    device_out = clCreateBuffer(context,
        CL_MEM_WRITE_ONLY,
        sizeof(int) * BUFF_SIZE, NULL, NULL);
    device_out_flag = clCreateBuffer(context,
        CL_MEM_WRITE_ONLY,
        sizeof(int) * 2, NULL, NULL);

    // Write the input buffer
    clEnqueueWriteBuffer(
        queue, device_in, CL_FALSE, 0, num_bytes,
        host_in, 0, NULL, NULL);

    // Set the kernel arguments
    error = clSetKernelArg(kernel, 0, sizeof(cl_mem), &device_in);
    error = clSetKernelArg(kernel, 1, sizeof(cl_mem), &device_out);
    error = clSetKernelArg(kernel, 2, sizeof(cl_mem), &device_out_flag);

    // Execute the kernel over the entire range of data
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, 
        (const size_t *) &n, NULL, 0, NULL, NULL);

    // Map and unmap until the flag is set to true
    int break_flag = 0;
    while(1) {
        // Map the buffers
        mapped_data = (int *) clEnqueueMapBuffer(
            queue, device_out, CL_TRUE, CL_MAP_READ, 0,
            sizeof(int) * BUFF_SIZE, 0, NULL, NULL, &error);
        mapped_flag = (int *) clEnqueueMapBuffer(
            queue, device_out_flag, CL_TRUE, CL_MAP_READ, 0,
            sizeof(int) , 0, NULL,NULL, &error);
        // Extract the data out of the buffer
        printf("mapped_flag[0] = %d\n", mapped_flag[0]);
        // Set the break_flag
        break_flag = mapped_flag[0];
        // Unmap the buffers
        error = clEnqueueUnmapMemObject(queue, device_out, mapped_data, 0,
            NULL, NULL);
        error = clEnqueueUnmapMemObject(queue, device_out_flag, mapped_flag,
            0, NULL, NULL);
        if (break_flag == 1) {break;}
        usleep(1000*1000);
    }

   return 0;
}

当我运行程序时,我得到类似的输出

clGetPlatformIDs: 0
clGetDeviceIDs: 0
clCreateContext: 0
clCreateCommandQueue: 0
clCreateProgramWithSource: 0
clCreateKernel: 0
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144
mapped_flag[0] = 45366144

为什么会这样?


编辑

我在 HP dm1z 上运行此代码,内核为 3.13.7-100.fc19.x86_64,fedora 19 64 位。这是 clinfo 的输出

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 1.2 AMD-APP (1214.3)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU
  Device ID:                     4098
  Board name:                    AMD Radeon HD 6310 Graphics
  Device Topology:               PCI[ B#0, D#1, F#0 ]
  Max compute units:                 2
  Max work items dimensions:             3
    Max work items[0]:               256
    Max work items[1]:               256
    Max work items[2]:               256
  Max work group size:               256
  Preferred vector width char:           16
  Preferred vector width short:          8
  Preferred vector width int:            4
  Preferred vector width long:           2
  Preferred vector width float:          4
  Preferred vector width double:         0
  Native vector width char:          16
  Native vector width short:             8
  Native vector width int:           4
  Native vector width long:          2
  Native vector width float:             4
  Native vector width double:            0
  Max clock frequency:               492Mhz
  Address bits:                  32
  Max memory allocation:             134217728
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            16
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      2048
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     No
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    None
  Cache line size:               0
  Cache size:                    0
  Global memory size:                201326592
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 32768
  Kernel Preferred work group size multiple:     32
  Error correction support:          0
  Unified memory for Host and Device:        1
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Platform ID:                   0x00007fd434852fc0
  Name:                      Loveland
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 1.2 
  Driver version:                1214.3
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 AMD-APP (1214.3)
  Extensions:                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_amd_image2d_from_buffer_read_only 

此外,可能值得注意的是,当我开始使用 OpenCL 时,我运行了一个测试程序来计算内积,但结果很奇怪。最初我以为是程序错误而忘记了它,但 OpenCL 实现是否有可能有问题?如果有帮助,OpenGL 实现有多个错误,导致随机数据 block 出现在我的桌面背景上,但这也可能是 Linux 问题。

最佳答案

您正在将 NULL 作为全局工作大小传递给您的 clEnqueueNDRangeKernel 调用:

// Execute the kernel over the entire range of data
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, NULL, NULL, 0, NULL, NULL);

如果您正在检查此调用返回的错误代码(您总是应该这样做),您会得到对应于 CL_INVALID_GLOBAL_WORK_SIZE 的错误代码。您始终需要指定全局工作大小,因此您的调用应如下所示:

// Execute the kernel over the entire range of data
size_t global[1] = {1};
error = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global, NULL, 0, NULL, NULL);
// check error == CL_SUCCESS!

您对映射和取消映射缓冲区的调用没有问题;我已经使用上述修复程序测试了这段代码,它对我有用。


修复上述问题的更新代码如下所示:

unsigned int n = 24;
...
// Execute the kernel over the entire range of data
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, 
    (const size_t *) &n, NULL, 0, NULL, NULL);

这不是将全局工作大小参数传递给内核的安全方法。例如,unsigned int n 变量可能占用 32 位,而 size_t 可能占用 64 位。这意味着当您传递 n 的地址并转换为 const size_t* 时,该实现将读取一个 64 位值,该值将包含 的 32 位code>n 加上 32 个具有任意值的其他位。您应该在将 n 传递给 clEnqueueNDRangeKernel 之前将其分配给 size_t 变量,或者将其更改为 size_t 本身。

这可能与您遇到的问题有关,也可能无关。例如,您可能不小心启动了大量 工作项,这可能解释了为什么代码似乎在 CPU 上阻塞。

关于c - 为什么 clEnqueueMapBuffer 返回随机数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23416649/

相关文章:

OpenCL 2.0 设备命令队列不断填​​满并停止执行

c - 隐藏设备应用程序执行失败

java - 使用 Z3 的 Java 中的段错误

python - 如何将一个大文件分割成多个文件?

c++ - 存储分配器 - 它是什么?

c - 如何在内存中查找返回指令

c - MQ C API - 如何在多线程环境中中止等待的 `MQGET`

c - 在 linux 中模块化编程和编译 C 程序

opencl - GPU 上的矢量化代码

macos - 如何使用 OSX 在 AMD GPU 上获取 OpenCL 内核的汇编代码