c - 基本 CUDA 代码的奇怪行为。

我无法理解以下简单 CUDA 代码的输出。代码所做的就是分配两个整数数组:一个在主机上，一个在设备上，每个数组的大小均为 16。然后它将设备数组元素设置为整数值 3，然后将这些值复制到 host_array 中，其中所有元素都是然后打印出来。

#include <stdlib.h>
#include <stdio.h>

int main(void)
{
  int num_elements = 16;
  int num_bytes = num_elements * sizeof(int);

  int *device_array = 0;
  int *host_array = 0;

  // malloc host memory
  host_array = (int*)malloc(num_bytes);

  // cudaMalloc device memory
  cudaMalloc((void**)&device_array, num_bytes);

  // Constant out the device array with cudaMemset
  cudaMemset(device_array, 3, num_bytes);

  // copy the contents of the device array to the host
  cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);

  // print out the result element by element
  for(int i = 0; i < num_elements; ++i)
    printf("%i\n", *(host_array+i));

  // use free to deallocate the host array
  free(host_array);

  // use cudaFree to deallocate the device array
  cudaFree(device_array);

  return 0;
}

这个程序的输出是 50529027 逐行打印 16 次。

这个数字是从哪里来的？当我在 cudaMemset 调用中用 0 替换 3 时，我得到了正确的行为。 IE。 0 逐行打印 16 次。

我在带有 CUDA 4.0 的 Ubuntu 10.10 上用 nvcc test.cu 编译了代码

最佳答案

我不是 cuda 专家，但 50529027 是十六进制的 0x03030303。这意味着 cudaMemset 将数组中的每个 byte 设置为 3 而不是每个 int。考虑到 cuda memset 的签名(传递要设置的字节数)和 memset 操作的一般语义，这并不奇怪。

编辑:至于您(我猜)关于如何实现您的意图的隐含问题，我认为您必须编写一个循环并初始化每个数组元素。

关于c - 基本 CUDA 代码的奇怪行为。，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8290037/

c - 基本 CUDA 代码的奇怪行为。

上一篇：c - 为什么必须将 int 指针绑定(bind)到变量而不是 char 指针？

下一篇：我们可以在结构声明中使用#define 常量作为数组大小吗？