c++ - CUDA AtomicCAS 死锁

我有一个数组matrix值为 0，我想将其中一些元素增加 1。 matrix 的索引我想要增加的存储在数组 indices 中。我需要多次增加一些元素，因此我尝试为 matrix 中的每个元素使用互斥体数组。。但是当我启动代码时，程序挂起并陷入僵局。

我被这个问题困扰了。我最终想要做的是使用 CUDA 绘制一个自身重叠的连续笔触，因此我需要并行访问 Canvas 图像的相同像素。

这是我的代码:

#include <iostream>
using namespace std;

__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x; // thread id
    int ind = indices[index]; // indices of target array A to increment    

    if (index < nof_indices) {
        while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
        matrix[ind] += 1;
        atomicExch(&d_semaphores[ind], 0);
        __syncthreads();
    }
}

int main()
{
    int nof_indices = 6; // length of an array B
    int indices[6] = { 0,1,2,3,4,1 }; // array B; stores indices of an array A which to increment
    int canvas[10]; // array A
    int semaphores[10]; // mutex array with individual mutexes for each of array A elements

    int* d_canvas;
    int* d_indices;
    int* d_semaphores;

    memset(canvas, 0, sizeof(canvas)); // set all array A elements to 0
    memset(semaphores, 0, sizeof(semaphores)); // set all array A elements to 0    

    cudaMalloc(&d_canvas, sizeof(canvas));
    cudaMalloc(&d_semaphores, sizeof(semaphores));
    cudaMalloc(&d_indices, sizeof(indices));

    cudaMemcpy(d_canvas, &canvas, sizeof(canvas), cudaMemcpyHostToDevice);
    cudaMemcpy(d_indices, &indices, sizeof(indices), cudaMemcpyHostToDevice);
    cudaMemcpy(d_semaphores, &semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);

    add_kernel << <1, 6 >> > (d_canvas, d_indices, d_semaphores, nof_indices);

    cudaMemcpy(&canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);

    for (int it = 0; it < nof_indices; it++) {
        cout << canvas[it] << endl;
    }

    cudaFree(d_canvas);
    cudaFree(d_indices);
    cudaFree(d_semaphores);

    return 0;
}

在此示例中，结果数组 matrix应该看起来像这样:{1, 2 ,1 ,1,1,0} ，但只有当我运行尺寸为 << 6,1 >> 的内核时才得到它.

我使用的是 CUDA 12.1、Geforce RTX 3060

谢谢

(仅当我将每个 block 的线程大小设置为 1 时才有效，但这不是我想要的)

最佳答案

在 volta 之前的执行模型中，这行代码是有问题的:

    while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);

该主题一般在 this blog 中讨论。 “独立线程调度”以及各种SO问题，例如this one和 this one .

但是，正如博客(和其他地方)所示，volta 执行模型应该允许更灵活的范例。我相信这里的问题是由于 feature 引起的nvcc:

To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal’s thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.

如果您针对 pre-volta 架构进行编译，则向编译器表明您需要 pre-volta 语义。这可能会影响代码的执行行为，例如，您在 volta 或更新的架构上执行，但针对 volta 之前的目标进行编译。

根据我的测试，如果我使用 CUDA 12.1 上的默认开关(默认情况下选择 sm_52 目标(包括 PTX))进行编译，sm_75 上的代码会出现死锁。但是，如果我针对 sm_75 目标进行编译，则代码会“正常”运行。

我认为，如果您针对 Volta 或更新的目标进行编译，您的代码不会在 RTX 3060 上死锁。除非您有理由不这样做，否则一般建议是编译时指定您希望运行的目标。

关于c++ - CUDA AtomicCAS 死锁，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/76243426/

c++ - CUDA AtomicCAS 死锁

上一篇：reactjs - React 没有被使用，但仍然被合并到一个包中

下一篇：python - 确定数据框中行之间的转换和持续时间