c++ - 如何在不使用原子的情况下同步 CUDA 中的线程

我正在通过在线 UDACITY 类(class)学习 CUDA 编程。第二节课给出了一个示例代码，它有两个基本的内核，第一个 __global__ void increment_naive(int *g) 简单地将 1 添加到数组 *g 驻留在全局内存中。

根据UDACITY的完整代码如下:

#include <stdio.h>
#include "gputimer.h"

#define NUM_THREADS 1000000
#define ARRAY_SIZE  100

#define BLOCK_WIDTH 1000

void print_array(int *array, int size)
{
    printf("{ ");
    for (int i = 0; i < size; i++)  { printf("%d ", array[i]); }
    printf("}\n");
}

__global__ void increment_naive(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 

    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    i = i % ARRAY_SIZE;  
    g[i] = g[i] + 1;
}

__global__ void increment_atomic(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 

    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    i = i % ARRAY_SIZE;  
    atomicAdd(& g[i], 1);
}

int main(int argc,char **argv)
{   
    GpuTimer timer;
    printf("%d total threads in %d blocks writing into %d array elements\n",
           NUM_THREADS, NUM_THREADS / BLOCK_WIDTH, ARRAY_SIZE);

    // declare and allocate host memory
    int h_array[ARRAY_SIZE];
    const int ARRAY_BYTES = ARRAY_SIZE * sizeof(int);

    // declare, allocate, and zero out GPU memory
    int * d_array;
    cudaMalloc((void **) &d_array, ARRAY_BYTES);
    cudaMemset((void *) d_array, 0, ARRAY_BYTES); 

    // launch the kernel - comment out one of these
    timer.Start();


    increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);
    //increment_atomic<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);
    timer.Stop();

    // copy back the array of sums from GPU and print
    cudaMemcpy(h_array, d_array, ARRAY_BYTES, cudaMemcpyDeviceToHost);
    print_array(h_array, ARRAY_SIZE);
    printf("Time elapsed = %g ms\n", timer.Elapsed());

    // free GPU memory allocation and exit
    cudaFree(d_array);
    return 0;
}

根据程序，1000个 block 的100万个线程正在写入10个数组元素。因此，每个数组元素的结果都是 100000。

第一个内核无法产生所需的输出，因为线程不是同步访问产生不良结果。这可以使用诸如 __syncthreads 之类的屏障或使用原子操作来解决。

第二个 kerell 工作正常并产生如下正确的输出:

1000000 total threads in 1000 blocks writing into 100 array elements
{ 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 }
Time elapsed = 0.367648 ms

如前所述，第一个 kerell 每次都会产生错误的输出。

1000000 total threads in 1000 blocks writing into 100 array elements
{ 75 75 75 75 78 78 78 78 73 73 73 73 82 82 82 82 85 85 85 85 92 92 92 92 104 104 104 104 107 107 107 107 89 89 89 89 88 88 88 88 95 95 95 95 103 103 103 103 106 106 106 106 107 107 107 107 105 105 105 105 113 113 113 113 96 96 96 96 95 95 95 95 95 95 95 95 100 100 100 100 98 98 98 98 104 104 104 104 110 110 110 110 126 126 126 126 90 90 90 90 }
Time elapsed = 0.23392 ms

我试图通过在计算的不同阶段放置障碍来修复第一个内核，但未能获得必要的输出。我修复第一个内核的尝试如下:

    __global__ void increment_naive(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    __syncthreads();
    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    //i = i % ARRAY_SIZE;
    int temp = i%ARRAY_SIZE;
    __syncthreads();
    i = temp;
    __syncthreads();
    //g[i] = g[i] + 1;
    int temp1 = g[i]+1;
    __syncthreads();
    g[i] = temp1;
     __syncthreads();

}

我希望有人能指导我解决这个问题，因为这个问题困扰着我很多，阻碍了我进一步进步的信心。

最佳答案

__syncthreads() 函数确保 block 中的所有线程都在代码中的同一位置。使用那些不会达到你想要的。更糟糕的是——假设 CUDA 是一个完美的并行机器，所有线程都在同步工作。您将永远不需要任何 __syncthreads。尽管如此，你还是会得到不同的结果。考虑以下伪代码和对正在发生的事情的解释:

__perfect_parallel_machine__ void increment_naive(int *g)
{
    int idx = thisThreadIdx % ARRAY_SIZE;
    int local = g[idx];
                               //*all* threads load the initial value of g[idx]
                               //each thread holds a separate copy of 'local' variable
                               //local=0 in each thread
    local = local + 1;
                               //each thread increment its own private copy of 'local'
                               //local=1 for all threads
    g[idx] = local;
                               //each thread stores the same value (1) into global array
                               //g = {1, 1, 1, 1, 1, ...., 1}
}

由于 CUDA 不是完美的并行机器，所以事情会乱序发生，您最终会在数组中获得更高的值。设置更多同步障碍将使您更接近理想的 {1, 1, ... , 1} 结果。

还有其他的屏障函数，比如__threadfence()。这会停止当前线程(仅当前线程!)，直到保证对全局数组的存储对其他线程可见。这个和L1/L2缓存有关，和线程同步无关。例如，将 __threadfence 与 atomics 结合使用来标记您已完成某些数据的填充是很常见的。

我想你和导师之间一定有什么误会。我建议与他交谈以澄清...

关于c++ - 如何在不使用原子的情况下同步 CUDA 中的线程，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35955753/

c++ - 如何在不使用原子的情况下同步 CUDA 中的线程

上一篇：c++ - fatal error LNK1104 : cannot open file

下一篇：c++ - 打印不同的字符串以读取不同的对象变量