caching - CUDA - 多处理器、Warp 大小和每个 block 的最大线程数 : What is the exact relationship?

我知道 CUDA GPU 上有包含 CUDA 内核的多处理器。在我的工作场所，我正在使用 GTX 590，它包含 512 个 CUDA 内核、16 个多处理器，并且 warp 大小为 32。所以这意味着每个多处理器中有 32 个 CUDA 内核，它们在同一个 warp 中的相同代码上运行.最后，每个 block 大小的最大线程数是 1024。

我的问题是 block 大小和多处理器计数 - warp 大小是如何完全相关的。说一下我对这种情况的理解:例如我在 GTX 590 上分配了 N 个最大 threadPerBlock 大小为 1024 的 block 。据我从 CUDA 编程指南和其他来源了解，这些 block 首先由硬件枚举.在这种情况下，N 个 block 中的 16 个被分配给不同的多处理器。每个 block 包含 1024 个线程，硬件调度程序将其中的 32 个线程分配给单个多处理器中的 32 个内核。同一多处理器(warp)中的线程处理同一行代码并使用当前多处理器的共享内存。如果当前的 32 个线程遇到像内存读写这样的片外操作，它们将被当前 block 中的另一组 32 个线程替换。因此，实际上单个 block 中有 32 个线程完全在任何给定时间在多处理器上并行运行，而不是全部 1024。最后，如果一个 block 完全由多处理器处理，N个线程 block 列表中的一个新线程 block 被插入到当前的多处理器中。最后，在 CUDA 内核执行期间，GPU 中总共有 512 个线程并行运行。 (我知道如果一个 block 使用的寄存器多于单个多处理器上可用的寄存器，那么它会被划分为在两个多处理器上工作，但在我们的例子中假设每个 block 都可以适合单个多处理器。)

那么，我的 CUDA 并行执行模型是否正确？如果没有，有什么问题或遗漏？我想微调我正在处理的当前项目，所以我需要整个事情中最正确的工作模型。

最佳答案

In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

GTX590 包含您提到的数字的 2 倍，因为卡上有 2 个 GPU。下面，我重点介绍单芯片。

Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors.

block 不一定均匀分布在多处理器 (SM) 上。如果您恰好安排了 16 个 block ，则一些 SM 可以获得 2 或 3 个 block ，而其中一些则空闲。我不知道为什么。

Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor.

线程和内核之间的关系并不是那么直接。每个 SM 中有 32 个“基本”ALU。处理单精度浮点和大多数 32 位整数和逻辑指令的指令。但是只有16个加载/存储单元，所以如果当前正在处理的warp指令是加载/存储，则必须调度两次。而且只有 4 个特殊功能单元，可以做三角函数之类的事情。所以这些指令必须安排 32/4 = 8 次。

The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor.

不，一个 SM 中可以同时“运行”的线程超过 32 个。

If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024.

不，不仅仅是内存操作会导致扭曲被替换。 ALU 也是深度流水线的，因此当仍在流水线中的值发生数据依赖关系时，新的扭曲将被交换。因此，如果代码包含两条指令，其中第二条使用第一条的输出，那么当第一条指令的值通过管道时，扭曲将被搁置。

Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor.

一个多处理器一次可以处理多个 block ，但一个 block 一旦开始处理就不能移动到另一个 MP。 block 中当前正在运行的线程数取决于该 block 使用了多少资源。 CUDA 占用计算器会根据您的特定内核的资源使用情况告诉您同时有多少 block 正在运行。

And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

不，一个 block 不能被划分为在两个多处理器上工作。整个 block 总是由单个多处理器处理。如果给定的多处理器没有足够的资源来处理您的内核至少一个 block ，您将收到内核启动错误并且您的程序将根本无法运行。

这取决于您如何将线程定义为“正在运行”。 GPU 通常会有超过 512 个线程同时消耗芯片上的各种资源。

请参阅@harrism 在此问题中的回答:CUDA: How many concurrent threads in total?

关于caching - CUDA - 多处理器、Warp 大小和每个 block 的最大线程数 : What is the exact relationship?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11564608/

caching - CUDA - 多处理器、Warp 大小和每个 block 的最大线程数 : What is the exact relationship?

上一篇：debugging - 如何分析windbg中的<unclassified>内存使用情况

下一篇：android - 关于android中最大堆大小和可用内存的两个问题