nvidia - cuda内核调用是同步的还是异步的

我读到可以使用内核启动来同步不同的 block ，即，如果我希望所有 block 在进行操作 2 之前完成操作 1，我应该将操作 1 放在一个内核中，将操作 2 放在另一个内核中。这样，我可以实现 block 之间的全局同步。但是，cuda c 编程指南提到内核调用是异步的，即。 CPU 不会等待第一个内核调用完成，因此 CPU 也可以在第一个内核完成之前调用第二个内核。但是，如果这是真的，那么我们就不能使用内核启动来同步块(synchronized block)。请让我知道我哪里出错了

最佳答案

接受的答案并不总是正确的。
在大多数情况下，内核启动是异步的。但在以下情况下，它是同步 .而且很容易被人们忽视。

环境变量 CUDA_LAUNCH_BLOCKING等于 1。

使用分析器(nvprof)，不启用并发内核分析

memcpy 涉及未页面锁定的主机内存。

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.

来自 NVIDIA CUDA 编程指南 (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device)。

关于nvidia - cuda内核调用是同步的还是异步的，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8473617/

nvidia - cuda内核调用是同步的还是异步的

上一篇：doxygen - Doxygen 可以轻松配置为识别 TODO 和 FIXME 行吗？

下一篇：ajax - 在 JSF 2.0 中验证 ajax 更新的最佳方式是什么？