创建和销毁 CUDA 流的操作有多轻?例如。对于 CPU 线程,这些操作很繁重,因此它们通常汇集 CPU 线程。我也应该汇集 CUDA 流吗?还是在每次需要时创建一个流然后销毁它是否快速?
最佳答案
NVIDIA 的指导是您应该汇集 CUDA 流。这里是马口的评论,https://github.com/pytorch/pytorch/issues/9646 :
There is a cost to creating, retaining, and destroying CUDA streams in PyTorch master. In particular:
- Tracking CUDA streams requires atomic refcounting
- Destroying a CUDA stream can (rarely) cause implicit device synchronization
- The refcounting issue has been raised as a concern for expanding stream tracing to allow streaming backwards, for example, and it's clearly best to avoid implicit device synchronization as it causes an often unexpected performance degradation.
For static frameworks the recommended best practice is to create all the needed streams upfront and destroy them after the work is done. This pattern is not immediately applicable to PyTorch, but a per device stream pool would achieve a similar effect.
关于c++ - 我应该汇集 CUDA 流吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50895752/