cuda - 如何理解GPU中的 "All threads in a warp execute the same instruction at the same time."?

标签 cuda nvidia gpu multiple-gpu

我正在阅读 Professional CUDA C Programming ,并在 GPU 架构概述 部分:

CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data. Each SM partitions the thread blocks assigned to it into 32-thread warps that it then schedules for execution on available hardware resources.

The SIMT architecture is similar to the SIMD (Single Instruction, Multiple Data) architecture. Both SIMD and SIMT implement parallelism by broadcasting the same instruction to multiple execution units. A key difference is that SIMD requires that all vector elements in a vector execute together in a unifed synchronous group, whereas SIMT allows multiple threads in the same warp to execute independently. Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior. SIMT enables you to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. The SIMT model includes three key features that SIMD does not:
➤ Each thread has its own instruction address counter.
➤ Each thread has its own register state.
➤ Each thread can have an independent execution path.

第一段提到“warp 中的所有线程同时执行相同的指令。”,而在第二段中,它说“即使 warp 中的所有线程从相同的程序地址开始,各个线程可能有不同的行为。”。这让我很困惑,上面的说法似乎自相矛盾。谁能解释一下?

最佳答案

没有矛盾。 warp 中的所有线程始终以锁步方式执行相同的指令。为了支持条件执行和分支,CUDA 在 SIMT 模型中引入了两个概念

  1. 预测执行(参见 here)
  2. 指令回放/序列化(参见 here)

谓词执行意味着条件指令的结果可用于屏蔽线程,使其无法在没有分支的情况下执行后续指令。指令重放是处理经典条件分支的方式。所有线程通过重放指令执行条件执行代码的所有分支。不遵循特定执行路径的线程将被屏蔽并执行与 NOP 等效的操作。这就是 CUDA 中所谓的分支发散惩罚,因为它对性能有重大影响。

这就是锁步执行如何支持分支。

关于cuda - 如何理解GPU中的 "All threads in a warp execute the same instruction at the same time."?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41009824/

相关文章:

c++ - OpenCV GPU 库使用

opengl - CUDA从内核直接写入OpenGL默认BackBuffer的ColorAttachment(GL_BACK)

memory - CUDA 常量内存分配是如何工作的?

大型矩阵的 CUDA 矩阵乘法中断

Kubernetes Autoscaler如何始终保持一个节点空闲

c++ - 运行图像标签示例时,tensorflow 错误 : Executor failed to create kernel. 未注册 'Snapshot' 用于 GPU 设备的 OpKernel

iOS Metal Swift newComputePipelineStateWithFunction 不工作错误

c++ - CUDA无法将类拆分为 header 和实现

CUDA,运行时找出内核中的寄存器个数

c++ - 未找到 OpenCL 平台