运行时 API 应用程序中的 cuda 上下文创建和资源关联

我想了解 cuda 上下文是如何在 cuda 运行时 API 应用程序中创建并与内核相关联的？

我知道它是由驱动程序 API 在后台完成的。但我想了解创作的时间线。

首先，我知道 cudaRegisterFatBinary 是第一个 cuda api 调用，它在运行时注册了一个 fatbin 文件。紧随其后的是一些在驱动层调用 cuModuleLoad 的 cuda 函数注册 API。但是，如果我的 Cuda 运行时 API 应用程序调用 cudaMalloc，那么提供给这个函数的指针是如何与上下文相关联的，我认为它应该是事先创建的。如何获得这个已经创建的上下文的句柄并将 future 的运行时 API 调用与它相关联？请揭开内部运作的神秘面纱。

引用 NVIDIA 的文档

CUDA Runtime API calls operate on the CUDA Driver API CUcontext which is bound to the current host thread.

If there exists no CUDA Driver API CUcontext bound to the current thread at the time of a CUDA Runtime API call which requires a CUcontext then the CUDA Runtime will implicitly create a new CUcontext before executing the call.

If the CUDA Runtime creates a CUcontext then the CUcontext will be created using the parameters specified by the CUDA Runtime API functions cudaSetDevice, cudaSetValidDevices, cudaSetDeviceFlags, cudaGLSetGLDevice, cudaD3D9SetDirect3DDevice, cudaD3D10SetDirect3DDevice, and cudaD3D11SetDirect3DDevice. Note that these functions will fail with cudaErrorSetOnActiveProcess if they are called when a CUcontext is bound to the current host thread.

The lifetime of a CUcontext is managed by a reference counting mechanism. The reference count of a CUcontext is initially set to 0, and is incremented by cuCtxAttach and decremented by cuCtxDetach.

If a CUcontext is created by the CUDA Runtime, then the CUDA runtime will decrement the reference count of that CUcontext in the function cudaThreadExit. If a CUcontext is created by the CUDA Driver API (or is created by a separate instance of the CUDA Runtime API library), then the CUDA Runtime will not increment or decrement the reference count of that CUcontext.

All CUDA Runtime API state (e.g, global variables' addresses and values) travels with its underlying CUcontext. In particular, if a CUcontext is moved from one thread to another (using cuCtxPopCurrent and cuCtxPushCurrent) then all CUDA Runtime API state will move to that thread as well.

但我不明白的是 cuda 运行时如何创建上下文？为此使用了哪些 API 调用？ nvcc 编译器是在编译时插入一些 API 调用来执行此操作，还是完全在运行时完成？如果前者是真的，那么什么运行时 API 用于此上下文管理？后者是真的，它究竟是如何完成的？

如果上下文与主机线程相关联，我们如何访问该上下文？它是否自动与线程处理的所有变量和指针引用相关联？

最终如何在上下文中完成模块加载？

最佳答案

CUDA 运行时维护要加载的模块的全局列表，并在每次将使用 CUDA 运行时的 DLL 或 .so 加载到进程中时添加到该列表中。但是在创 build 备之前，模块并没有真正加载。

上下文创建和初始化是由 CUDA 运行时“懒惰地”完成的——每次调用像 cudaMemcpy() 这样的函数时，它都会检查 CUDA 是否已经初始化，如果没有，它会创建一个上下文(在之前由 cudaSetDevice() 指定的设备，或者如果 cudaSetDevice() 从未调用过，则为默认设备)并加载所有模块。从那时起，上下文与该 CPU 线程相关联，直到它被 cudaSetDevice() 更改。

您可以使用驱动程序 API 中的上下文/线程管理函数，例如 cuCtxPopCurrent()/cuCtxPushCurrent()，以使用来自不同线程的上下文。

你可以调用 cudaFree(0);强制这种延迟初始化发生。

我强烈建议在应用程序初始化时这样做，以避免竞争条件和未定义的行为。继续在您的应用程序中尽早枚举和初始化设备；完成后，在 CUDA 4.0 中，您可以从任何 CPU 线程调用 cudaSetDevice() ，它将选择由初始化代码创建的相应上下文。

关于运行时 API 应用程序中的 cuda 上下文创建和资源关联，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7534892/

运行时 API 应用程序中的 cuda 上下文创建和资源关联

上一篇：flash - 在 crossdomain.xml 中使用 secure=false 有什么风险

下一篇：debugging - 禁用 JVM 热插拔