Cuda - 大数组初始化

为 gpu 初始化大量整数的最佳方法(有效)是什么？我需要为前两个元素分配 1，为其他元素分配 0(对于埃拉托色尼筛法)。

cudaMemcpy
cudaMemset + 设置内核中前两个元素的值
直接在内核中初始化
其他的

注意:数组大小是动态的(n 作为参数传递)。

我当前的版本:

int array = (int*) malloc(array_size);
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
    array[i] = 0;
}
HANDLE_ERROR(cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice));
kernel<<<10, 10>>>(dev_array);

我将不胜感激。

最佳答案

一种可能是在 GPU 上直接初始化 __device__ 数组(如果它具有恒定大小)，方法是在文件范围内(即在任何函数之外)添加以下声明:

__device__ int dev_array[SIZE] = {1, 1};

其余元素将用零初始化(您可以检查 PTX 程序集以确保这一点)。

然后，它可以像这样在内核中使用:

__global__ void kernel(void)
{
    int tid = ...;
    int elem = dev_array[tid];
    ...
}

在可变大小的情况下，您可以将 cudaMalloc() 与 cudaMemset() 结合使用:

int array_size = ...;
int *dev_array;

cudaMalloc((void **) &dev_array, array_size * sizeof(int));
cudaMemset(dev_array, 0, array_size * sizeof(int));

然后将前两个元素设置为:

int helper_array[2] = {1, 1};
cudaMemcpy(dev_array, helper_array, 2 * sizeof(int), cudaMemcpyHostToDevice);

从计算能力 2.0 开始，您还可以通过 malloc() 设备函数直接在内核中分配整个数组:

__global__ void kernel(int array_size)
{
    int *dev_array;
    int tid = ...;

    if (tid == 0) {
        dev_array = (int *) malloc(array_size * sizeof(int));
        if (dev_array == NULL) {
            ...
        }
        memset(dev_array, 0, array_size * sizeof(int));
        dev_array[0] = dev_array[1] = 1;  
    }
    __syncthreads();

    ...
}

请注意，来自不同 block 的线程不知道屏障同步。

来自CUDA C Programming Guide :

The CUDA in-kernel malloc() function allocates at least size bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request. The returned pointer is guaranteed to be aligned to a 16-byte boundary.

不幸的是，calloc() 函数没有实现，因此您无论如何都需要对其进行内存设置。分配的内存具有 CUDA 上下文的生命周期，但您可以随时从该内核或后续内核显式调用 free():

The memory allocated by a given CUDA thread via malloc() remains allocated for the lifetime of the CUDA context, or until it is explicitly released by a call to free(). It can be used by any other CUDA threads even from subsequent kernel launches.

综上所述，我不太介意补充 cudaMemcpy()，因为它只是要复制的两个元素，而且可能需要不到 0.01总执行时间的百分比(很容易分析)。选择使您的代码清晰的任何方式。否则它是 premature optimization .

关于Cuda - 大数组初始化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31034373/

Cuda - 大数组初始化

上一篇：oracle - TopLink Essentials 和 EclipseLink 有什么区别

下一篇：R/ggplot2 : Evaluate object inside expression