c++ - 为 CUDA 编程计算设备和主机的执行时间

我需要计算 GPU 运行时代码，以及总运行代码(主机和设备)。在我的代码中，我有两个 gpu 内核在运行，并且在一个用于复制数据的主机 for 循环之间，下面的示例可以显示我的代码的样子

cuda event start

//FIRST kernel code call <<...>>

// cuda memory copy result back from device to host

CudadeviceSyncronize()

// copy host data to host array (CPU funtion loop)

// cuda memory copy from host to device

// SECOND Kernel call <<...>>

cuda event stop

//memory copy back from device to host

我所知道的是我使用事件来计算内核，事件精确地测量内核在 GPU 上花费的实际时间。所以我的问题和目标是:

1- 我把事件调用放在上面所示的方式是:将仅记录内核并忽略主机功能？

2- 主机循环调用会影响 cuda 事件计时吗？

3- 我的目标是只计算 GPU，还有 GPU+CPU 一起计算，上面会实现它还是我应该使用 clock_gettime(CLOCK_REALTIME, timer) 来计算主机？

最佳答案

像这样的序列:

float et;
cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);
kernel1<<<...>>>(...);
cudaDeviceSynchronize();
host_code_routine(...);
kernel2<<<...>>>(...);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&et, start, stop);

将在 et 中返回浮点运行时间(以毫秒为单位)，即(大约)总和:

kernel1 执行时间
与host_code_routine相关的(主机)执行时间
kernel2 执行时间

如果您只想生成上面的 1 和 3 的总和，您需要(仅)用 cudaEvent 时序将每个内核括起来，然后在主机代码中手动对这两个值求和。

要回答你的问题，那么:

1- is my way i put the event calling above shown : will be recording the kernel Only and neglecting the host functions ?

不，您描述的记录将捕获序列中的主机和设备耗时。

2- will the host loop call affect the cuda events timing?

是

3- my goal is to calculate the GPU only , and also GPU+CPU together, the above will it achieve it or should i use clock_gettime(CLOCK_REALTIME, timer) to calculate the host ?

如果您想要单独的时间和各种总和，我建议您独立地为内核计时，并使用一些基于主机的方法为主机代码计时，然后以您希望的任何方式组合各种组件。

关于c++ - 为 CUDA 编程计算设备和主机的执行时间，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30439157/

c++ - 为 CUDA 编程计算设备和主机的执行时间

上一篇：C++ 运行时库设置为/MT，但 api-ms-win*.dll 仍然丢失

下一篇：c++ - 如何查找和删除 vector 元素中的某个字符