c++ - Cuda 内存传输开销

众所周知，将数据复制到 GPU 很慢，我想知道将数据传递到 GPU 时具体“重要”的是什么。

__global__
void add_kernel(float* a, float* b, float* c, int size) {
   for (int i = 0; i < size; ++i) {
       a[i] = b[i] + c[i];
   }

int main() {
int size = 100000; //Or any arbitrarily large number
int reps = 1000;   //Or any arbitrarily large number


extern float* a; //float* of [size] allocated on the GPU 
extern float* b; //float* of [size] allocated on the GPU 
extern float* c; //float* of [size] allocated on the GPU 

 for (int i = 0; i < reps; ++i)
add_kernel<<<blocks, threads>>>(a, b, c, size); 

}

诸如将 size 传递给内核之类的操作是否会产生(显着的)开销？或者“数据传输”更具体地指将大型数组从 heap 复制到 GPU。

IE 这个变体会(多)快吗

__global__
void add_kernel(float* a, float* b, float* c, int size, int reps) {
for (int j = 0; i < reps; ++j)
   for (int i = 0; i < size; ++i) {
       a[i] = b[i] + c[i];
   }

int main() {
int size = 100000; //Or any arbitrarily large number
int reps = 1000; //Or any arbitrarily large number

extern float* a; //float* of [size] allocated on the GPU 
extern float* b; //float* of [size] allocated on the GPU 
extern float* c; //float* of [size] allocated on the GPU 

add_kernel<<<blocks, threads>>>(a, b, c, size, reps); 
}

“理想”CUDA 程序中的 IE(再次)应该是程序员试图在纯 CUDA 内核中编写大部分计算程序，或者编写 CUDA 内核，然后从 CPU 调用(在从堆栈不会产生显着的开销)。

为清晰起见进行了编辑。

最佳答案

一切都很重要。为了运行内核，CPU 需要以某种方式传递调用哪个内核以及使用哪些参数。在“微观层面”，如果你的内核只执行几个操作，这些都是相当大的开销。在现实生活中，如果您的内核做了很多工作，则它们可以忽略不计。

如果这些小操作没有流水线化，那么服务费用可能会比较高。您可以在 NVidia 的 Visual Profiler 中看到这一点。我不知道/不记得确切的数字，但顺序如下。 CPU 和 GPU 之间的带宽可以是 1 GB/s，也就是 1 字节/纳秒。但实际上发送 4 字节的数据包并获得确认将花费大约 1 微秒的时间。所以发送 10000 个字节——大约 11 微秒。

操作的执行也针对 GPU 上的大规模执行进行了优化，因此使用一个 32 线程 warp 执行 10 个连续操作可能需要大约 200 个 GPU 时钟周期(例如 0.2 微秒)。并说 0.5 微秒用于在内核执行之前发送命令以执行内核。

在现实生活中，问题通常在于对 1 亿个数字求和，由于带宽限制，您将花费 0.4 秒，而计算本身则需要 0.1 微秒。因为顶级 GPU 可以在接近 1 纳秒长的每个周期内执行大约 1000 次操作。

关于c++ - Cuda 内存传输开销，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48201889/

c++ - Cuda 内存传输开销

上一篇：c++ - 将 Boost 库安装到 ~ (home) 中的自定义目录

下一篇：c++ - 如何以相反的顺序读取 C++ 程序的输入？