c++ - CUDA 动态索引

我注意到，使用动态索引会将 CUDA 代码的速度降低 12 倍 - 请参阅以下示例:

__global__ void static3Ops(int start, int end, const float* p, const int* prog_dont_use, float* c)
{
    int i = threadIdx.x;
    float buf[5];
    buf[0] = 1.0e7;
    buf[1] = c[i];
    const int prog[] = { 0,1,2,3,4,5 };

    for (long j = start; j < end; j++) {
        buf[2] = p[j];
        buf[3] = buf[prog[0]] + buf[prog[1]];
        buf[4] = buf[prog[2]] - buf[prog[3]];
        buf[1] = buf[prog[4]] * buf[prog[5]];
    }
    c[i] = buf[1];
}

比

快 12 倍

__global__ void static3Ops(int start, int end, const float* p, const int* prog, float* c)
{
    int i = threadIdx.x;
    float buf[5];
    buf[0] = 1.0e7;
    buf[1] = c[i];

    for (long j = start; j < end; j++) {
        buf[2] = p[j];
        buf[3] = buf[prog[0]] + buf[prog[1]];
        buf[4] = buf[prog[2]] - buf[prog[3]];
        buf[1] = buf[prog[4]] * buf[prog[5]];
    }
    c[i] = buf[1];
}

有什么提示可以最小化这种开销吗？动态特性是我的代码的核心特征...所以没有它我几乎无法行走...

请注意，CPU 开销仅为 20% 左右。

最佳答案

我能想到的两种可能性:

如果 prog 是一个小数组:使用您自己的解决方案!即，如果 prog 实际上是一个包含少量元素的数组(如您的示例)，则使用 prog 就像在上面的示例中定义它一样。但是您对“动态性质是我的代码的核心功能”的评论使得这听起来不像您的选择。当我将 const int prog[] = { 0,1,2,3,4,5 } 更改为 int prog_0 = 0, prog_1 = 1, ... 并使用 prog_0, prog_1, ... 而不是 prog[]，我得到了相同的性能。这表明prog[]的值直接存储在寄存器中，而不涉及全局内存。如果 prog 不是一个小数组或在编译时未知，则此方法可能会导致大量使用本地内存并显着降低性能。

如果 prog 是一个大数组:让线程并行地将 prog 加载到共享内存中，然后在内核的其余部分中相应地访问共享内存( block -级别 tiling )。

__shared__  int prog_sh[6]; // or dynamically allocate if size is not known
int i = threadIdx.x;
if (i < 6)
    prog_sh[i] = prog[i];
__syncthreads();

// and then use prog_sh instead of prog....

请注意，这对于具有已知值(如您的示例)的小数组来说确实没有意义，但您会惊讶地发现在处理大型数组时通过平铺获得了多少 yield 。尽管如此，您应该确保在处理共享内存时能够实现并发访问的高内存带宽(请参阅 this link )。

关于c++ - CUDA 动态索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60988072/

c++ - CUDA 动态索引

上一篇：c++ - 在 Cmake 中为 clang-tidy 提供 header 过滤器的正确方法是什么？

下一篇：c++ - 如何在 C++ 中创建一个 Bitset 数组