cuda - GPU 上的静态和动态调度是什么?

标签 cuda gpu nvidia

GTX 4xx、5xx (Fermi) 采用动态调度,GTX 6xx (Kepler) 切换为静态调度。

  • GPU 环境中的静态和动态调度是什么?
  • 静态与动态的设计选择如何影响现实世界计算工作负载的性能?
  • 是否可以通过代码来优化静态或动态调度算法?

最佳答案

我假设您指的是静态/动态 instruction scheduling在硬件方面。

动态指令调度意味着处理器可以在运行时对各个指令重新排序。这通常涉及一些硬件,这些硬件将尝试预测指令管道中任何内容的最佳顺序。在您提到的 GPU 上,这指的是每个单独扭曲的指令重新排序。

从动态调度程序切换回静态调度程序的原因在 GK110 Architecture Whitepaper 中进行了描述。如下:

We also looked for opportunities to optimize the power in the SMX warp scheduler logic. For example, both Kepler and Fermi schedulers contain similar hardware units to handle the scheduling function, including:

  • Register scoreboarding for long latency operations (texture and load)

  • Inter‐warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates)

  • Thread block level scheduling (e.g., the GigaThread engine)

However, Fermi’s scheduler also contains a complex hardware stage to prevent data hazards in the math datapath itself. A multi‐port register scoreboard keeps track of any registers that are not yet ready with valid data, and a dependency checker block analyzes register usage across a multitude of fully decoded warp instructions against the scoreboard, to determine which are eligible to issue.

For Kepler, we recognized that this information is deterministic (the math pipeline latencies are not variable), and therefore it is possible for the compiler to determine up front when instructions will be ready to issue, and provide this information in the instruction itself. This allowed us to replace several complex and power‐expensive blocks with a simple hardware block that extracts the pre‐determined latency information and uses it to mask out warps from eligibility at the inter‐warp scheduler stage.

所以基本上,他们正在以芯片复杂性(即更简单的调度程序)来换取效率。但编译器现在可以发现潜在的效率损失,它可以预测最佳顺序,至少对于数学管道来说是这样。

至于您的最后一个问题,即可以在代码中执行哪些操作来优化静态或动态调度算法,我个人的建议是使用任何内联汇编器,而只是让编译器/调度程序做它的事情。

关于cuda - GPU 上的静态和动态调度是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14665432/

相关文章:

wpf - CUDA 与 Windows 窗体/WPF 的集成

c++ - 使用特定于机器的 `.file` 路径 - CUDA .ptx 文件是否可移植?

caching - cuda原子操作可以使用L1缓存吗?

c - 将代码加载到 GPU(Intel Sandy Bridge)

cuda - CUresult 与 cudaError - 如何获取可读的错误描述?

c++ - 一个大内核与多个小内核和内存复制(CUDA)的并发

python - TensorFlow 似乎不使用 GPU

ffmpeg - 不支持编解码器 h264_cuvid

gpu - GCE 上 100% GPU 利用率,无需任何进程

opencl - 内存分配 Nvidia vs AMD