cuda - __syncthreads() 是否同步网格中的所有线程？

__syncthreads() 是同步网格中的所有线程还是仅同步当前扭曲或 block 中的线程？

此外，当特定 block 中的线程遇到(在内核中)以下行时

__shared__  float srdMem[128];

他们只会声明这个空间一次(每个 block )吗？

它们显然都是异步操作的，因此如果 block 22 中的线程 23 是第一个到达该行的线程，然后 block 22 中的线程 69 是最后一个到达该行的线程，则线程 69 将知道它已经被声明？

最佳答案

__syncthreads()命令是一个 block 级同步屏障。这意味着当 block 中的所有线程都到达屏障时，可以安全地使用它。也可以使用__syncthreads()在条件代码中，但仅当所有线程对此类代码进行相同的评估时，否则执行可能会挂起或产生意外的副作用 [4] .

使用__syncthreads()的示例:(source)

__global__ void globFunction(int *arr, int N) 
{
    __shared__ int local_array[THREADS_PER_BLOCK];  //local block memory cache           
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;

    //...calculate results
    local_array[threadIdx.x] = results;

    //synchronize the local threads writing to the local memory cache
    __syncthreads();

    // read the results of another thread in the current thread
    int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];

    //write back the value to global memory
    arr[idx] = val;        
}

要同步网格中的所有线程，当前没有 native API 调用。在网格级别上同步线程的一种方法是使用连续的内核调用，因为此时所有线程都会从同一点结束并重新开始。它通常也称为 CPU 同步或隐式同步。因此它们都是同步的。

使用此技术的示例 ( source ):

CPU synchronization

关于第二问题。是的，它确实声明了每个 block 指定的共享内存量。请考虑到可用共享内存的数量是按SM测量的。因此，人们应该非常小心如何使用共享内存以及启动配置。

关于cuda - __syncthreads() 是否同步网格中的所有线程？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15240432/

cuda - __syncthreads() 是否同步网格中的所有线程？

上一篇：r - 如何获得 R 帮助？

下一篇：Emacs shell模式: how to send region to shell?