cuda - CUDA 编译器如何知道扭曲的发散行为？

CUDA 编程指南 (v4.1) 在第 5.4.2 节中描述了有关谓词指令的内容:

The compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by the branch condition is less or equal to a certain threshold: If the compiler determines that the condition is likely to produce many divergent warps, this threshold is 7, otherwise it is 4.

一个条件如何产生许多不同的扭曲？给定条件可以仅将一根经线分成两部分。这里的许多是什么意思？
即使上面的内容有意义，编译器如何知道运行时扭曲的发散行为？

最佳答案

扭曲永远不会“ split ”。它们要么需要“条件执行”(即屏蔽非参与线程的执行)来为条件不同的代码路径提供服务，要么不需要。

至于一个条件如何产生多个不同的扭曲，请考虑以下人为的示例:

if (threadIdx.x < 128) {
   // Only first four warps process here
   int modthirtytwo = threadIdx.x % 32;

   if (modthirtytwo == 0) {
      // Action A only first thread in the warp
   } else {
      // Action B for the other threads in the warp
   }
}

这里，代码可以产生多个不同的扭曲，并且编译器应该能够在编译时对行为进行建模。如果为内核的编译器指定启动边界，那就更好了。将此情况与仅使用一个扭曲的共享内存减少进行比较。

if (threadIdx.x < 32) {
   if (threadIdx.x < 16)  shm[threadIdx.x] += shm[threadIdx.x+16];
   if (threadIdx.x < 8)   shm[threadIdx.x] += shm[threadIdx.x+8];
   if (threadIdx.x < 4)   shm[threadIdx.x] += shm[threadIdx.x+4];
   if (threadIdx.x < 2)   shm[threadIdx.x] += shm[threadIdx.x+2];
   if (threadIdx.x == 0)  shm[0] += shm[1];
}

这里的分歧仅限于每个 block 的单个扭曲。这段文字的意思是，两种情况下的编译器行为可能不同。

似乎"new"编译器(它已经用于 OpenCL 几年了)对于在分支变得更经济之前应该使用多少谓词指令具有启发式。而且指令管道中的大量分支似乎对性能不利，因此当编译器可以计算出代码将产生更高的“分支密度”时，它会更喜欢更多的谓词指令而不是分支。

关于cuda - CUDA 编译器如何知道扭曲的发散行为？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10119796/

cuda - CUDA 编译器如何知道扭曲的发散行为？

上一篇：python - 从 PDF(库或实用程序)中提取表单数据

下一篇：php - 如何从网站 https 端的 httpdocs 加载文件