cuda - CUDA恒定内存最佳实践

我在这里介绍一些代码

__constant__ int array[1024];

__global__ void kernel1(int *d_dst) {
   int tId = threadIdx.x + blockIdx.x * blockDim.x;
   d_dst[tId] = array[tId];
}

__global__ void kernel2(int *d_dst, int *d_src) {
   int tId = threadIdx.x + blockIdx.x * blockDim.x;
   d_dst[tId] = d_src[tId];
}

int main(int argc, char **argv) {
   int *d_array;
   int *d_src;
   cudaMalloc((void**)&d_array, sizeof(int) * 1024);
   cudaMalloc((void**)&d_src, sizeof(int) * 1024);

   int *test = new int[1024];
   memset(test, 0, sizeof(int) * 1024);

   for (int i = 0; i < 1024; i++) {
     test[i] = 100;
   }

   cudaMemcpyToSymbol(array, test, sizeof(int) * 1024);
   kernel1<<< 1, 1024 >>>(d_array);

   cudaMemcpy(d_src, test, sizeof(int) * 1024, cudaMemcpyHostToDevice);
   kernel2<<<1, 32 >>>(d_array, d_src),

   free(test);
   cudaFree(d_array);
   cudaFree(d_src);

   return 0;
}

它仅显示恒定内存和全局内存使用情况。在执行时，“kernel2”的执行速度(在时间方面)比“kernel1”快4倍

我从Cuda C编程指南中了解到，这是因为对常量内存的访问正在序列化。这使我想到，如果warp访问单个常量值(例如整数，浮点数， double 数等)，而访问数组根本没有好处，那么可以最好地利用常量内存。换句话说，我可以说warp必须访问一个地址，以便从不断的内存访问中获得任何有益的优化/加速 yield 。这个对吗？

我还想知道，如果我在常量内存中保留结构而不是简单类型。线程通过扭曲对结构的任何访问；还被认为是单个内存访问还是更多？我的意思是结构可能包含多个简单类型和数组。访问这些简单类型时，这些访问是否也已序列化？

最后一个问题是，如果我确实有一个带有常量值的数组，则需要通过warp中的不同线程来访问它；为了加快访问速度，应将其保存在全局内存中，而不要保存在常量内存中。那是对的吗？

任何人都可以引用一些示例代码，其中显示了有效的恒定内存使用率。

问候，

最佳答案

I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?

是的，这通常是正确的，这是使用常量内存/常量缓存的主要目的。恒定高速缓存可以“一次”为每个SM服务一个数量。 precise wording如下:

The constant memory space resides in device memory and is cached in the constant cache.

A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.

The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

上面的文字中有一个重要的意义，就是希望在整个经纱上均匀地接触以达到最佳性能。如果warp向其中的warp中的不同线程访问不同位置的__constant__内存发出请求，则这些请求将被序列化。因此，如果warp中的每个线程都访问相同的值:

int i = array[20];

那么您将有机会从不断的缓存/内存中受益匪浅。如果经纱中的每个线程都在访问唯一数量:

int i = array[threadIdx.x];

那么访问将被序列化，并且在性能方面恒定的数据使用量将令人失望。

I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more?

您当然可以将结构放在恒定内存中。相同的规则适用:

int i = constant_struct_ptr->array[20];

有机会受益，但是

int i = constant_struct_ptr->array[threadIdx.x];

才不是。如果跨线程访问相同的简单类型结构元素，则这对于不断使用缓存非常理想。

Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?

是的，如果您知道一般而言您的访问将使恒定内存每个周期规则减少一个32位数量，那么最好将数据保留在普通的全局内存中。
有多种cuda sample codes演示__constant__数据的用法。这里有一些:

graphics volumeRender

imaging bilateralFilter

imaging convolutionTexture

finance MonteCarloGPU

还有其他。
编辑:如果我们在常量内存中具有这样的结构，将对评论中的一个问题做出响应:

struct Simple { int a, int b, int c} s;

我们这样访问它:

int p = s.a + s.b + s.c;
          ^     ^     ^
          |     |     |
cycle:    1     2     3

我们将充分利用常量内存/缓存。当C代码被编译时，在幕后它将生成与上图中的1,2,3相对应的机器代码访问。假设访问1首先发生。由于访问1指向相同的内存位置，而与翘曲中的哪个线程无关，因此在周期1中，所有线程都将接收s.a中的值，并且它将利用缓存来获得最大的好处。对于访问2和3同样如此。如果另一方面，我们有:

struct Simple { int a[32], int b[32], int c[32]} s;
...
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int p = s.a[idx] + s.b[idx] + s.c[idx];

这不会充分利用恒定的内存/缓存。相反，如果这是我们对s的访问的典型方式，那么在普通的全局内存中定位s可能会有更好的性能。

关于cuda - CUDA恒定内存最佳实践，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18020647/

cuda - CUDA恒定内存最佳实践

上一篇：php - json 资源和资源集合有什么区别？在 Laravel

下一篇：crystal-reports - Crystal Reports和Visual Studio 2013预览