c++ - 我应该汇集 CUDA 流吗？

标签 c++ parallel-processing cuda stream pool

创建和销毁 CUDA 流的操作有多轻？例如。对于 CPU 线程，这些操作很繁重，因此它们通常汇集 CPU 线程。我也应该汇集 CUDA 流吗？还是在每次需要时创建一个流然后销毁它是否快速？

最佳答案

NVIDIA 的指导是您应该汇集 CUDA 流。这里是马口的评论，https://github.com/pytorch/pytorch/issues/9646 :

There is a cost to creating, retaining, and destroying CUDA streams in PyTorch master. In particular:

Tracking CUDA streams requires atomic refcounting

Destroying a CUDA stream can (rarely) cause implicit device synchronization

The refcounting issue has been raised as a concern for expanding stream tracing to allow streaming backwards, for example, and it's clearly best to avoid implicit device synchronization as it causes an often unexpected performance degradation.

For static frameworks the recommended best practice is to create all the needed streams upfront and destroy them after the work is done. This pattern is not immediately applicable to PyTorch, but a per device stream pool would achieve a similar effect.

关于c++ - 我应该汇集 CUDA 流吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50895752/

上一篇：c++ - C++ 中 int 的最小大小是多少？

下一篇：C++代码在for循环的条件下给出运行时错误而如果它被具有相同意义的代码替换则编译正确

c++ - 使用boost库找到图形的最小切割

windows - 同时执行多个批处理文件并监控其进程是否完成

从 c 调用 cuda 导致错误

c++ - 在 CUDA 中从主机访问设备上的类成员数组指针

c - 为什么我的c程序突然用了30g的虚拟内存？

c++ - 具有已知键数的字符串的完美哈希

c++ - 根节点以上内容的 Qt xml 文件问题

R:DEoptim 并行优化 - 核心数

parallel-processing - 如何在 julia 中编写并行循环？