单线程内CUDA内存操作顺序

来自 CUDA 编程指南(v. 5.5):

The CUDA programming model assumes a device with a weakly-ordered memory model, that is:

The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device is not necessarily the order in which the data is observed being written by another CUDA or host thread;

The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device is not necessarily the order in which the read instructions appear in the program for instructions that are independent of each other

但是，我们能否保证从单个线程看到的(依赖的)内存操作实际上是一致的？如果我这样做 - 说:

arr[x] = 1;
int z = arr[y];

其中 x 恰好等于 y，并且没有其他线程正在接触内存，我是否可以保证z 是 1？或者我还需要在这两个操作之间放置一些 volatile 或屏障吗？

回应 Orpedo 的回答。

If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...

我的问题是允许哪些优化(由编译器或硬件完成)？可能会发生这种情况 --- 例如 --- store 指令是非阻塞的，而后面的 load 指令以某种方式由内存 Controller 管理更快比已经排队的存储。

我不了解 CUDA 硬件。我是否可以保证上述情况永远不会发生？

最佳答案

CUDA 编程指南简单地指出，您无法预测线程的执行顺序，但每个线程仍将作为顺序线程运行。在您所说的示例中，其中 x 和 y 相同并且没有其他线程接触内存，您可以保证 z = 1。这里的要点是，如果您有多个线程对相同数据(例如数组)执行操作，则不能保证线程 #9 在 #10 之前执行。

举个例子:

__device__ void sum_all(float *x, float *result, int size N){
  x[threadId.x] = threadId.x;
  result[threadId.x] = 0;
  for(int i = 0; i < N; i++)
    result[threadId.x] += x[threadID.x];
}

这里我们有一些愚蠢的函数，它应该用 m ... n 中的数字填充共享数组(x)(从一个数字读取到另一个数字)，然后将已经放入数组中的数字相加，将结果存储在另一个数组中。鉴于您的最低索引线程是枚举线程 #0，您会期望代码第一次运行此代码 x 应包含

x[] = {0, 0, 0 ... 0} 且结果[] = {0, 0, 0 ... 0}

线程 #1 的下一个

x[] = {0, 1, 0 ... 0} 且结果[] = {0, 1, 0 ... 0}

线程 #2 的下一个

x[] = {0, 1, 2 ... 0} 且结果[] = {0, 1, 3 ... 0}

等等。但这并不能保证。你无法知道是否例如线程 #3 首先运行，因此在线程 #0 运行之前更改数组 x[]。实际上，您甚至不知道在执行代码时数组是否被其他线程更改。

关于单线程内CUDA内存操作顺序，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21205155/

单线程内CUDA内存操作顺序

上一篇：matlab - 在 MATLAB 中 : How to save a graph created inside of a loop for each repetition?

下一篇：r - 对 R 中每列的矩阵行执行成对统计测试