cuda - 并发 block 如何运行单个 GPU 流多处理器？

我在学习CUDA编程结构，学习后的感受是；创建块和线程后，每个块都分配给每个流式多处理器(例如，我使用的是 GForce 560Ti，它具有 14 个流式多处理器，因此一次可以将 14 个块分配给所有流式多处理器)。但是当我浏览一些在线 Material 时，例如这个:

http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf

已经提到可以在一个多处理器上同时运行多个块。我基本上对流式多处理器上的线程和块的执行感到非常困惑。我知道块的分配和线程的执行是绝对任意的，但我想块和线程的映射实际上是如何发生的，以便并发执行可以发生。

最佳答案

流式多处理器 (SM) 可以使用硬件多线程(类似于 Hypter-Threading 的进程)一次执行多个块。 .

CUDA C Programming Guide在第 4.2 节中对此进行了描述:

4.2 Hardware Multithreading

The execution context (program counters, registers, etc) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.

In particular, each multiprocessor has a set of 32-bit registers that are partitioned among the warps, and a parallel data cache or shared memory that is partitioned among the thread blocks.

The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.

关于cuda - 并发 block 如何运行单个 GPU 流多处理器？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12212003/

cuda - 并发 block 如何运行单个 GPU 流多处理器？

上一篇：android - onBackPressed中需要popBackStack

下一篇：AppleScript 在终端中打开新标签不再在 Mountain Lion 中工作？