c++ - 使用CUDA进行残差计算

我有两个 vector (oldvector 和 newvector)。我需要计算由以下伪代码定义的残差值:

residual = 0;
forall i : residual += (oldvector[i] - newvector[i])^2

目前，我正在使用两个 CUDA Thrust 操作来计算此值，这些操作本质上是在执行以下操作:

forall i : oldvector[i] = oldvector[i] - newvector[i]

后跟一个 thrust::transform_reduce ，其中一个平方作为一元运算符，其作用是:

residual = 0;
forall i : residual += oldvector[i]^2;

这个问题显然是在transform_reduce之前到全局内存的中间存储。是否有更有效的方法来解决这个问题，融合这两个步骤？除了编写自己的 CUDA 内核之外，还有其他选择吗？

我想到的一种方法是使用 zip 迭代器编写一个 thrust::reduce 。这样做的问题是运算符的返回类型必须与其输入类型相同。根据我的说法，这意味着归约运算符将返回一个元组，这意味着额外的加法。

如果我确实编写了缩减 CUDA 内核，缩减内核的 CUDA 1.1 示例是否有任何改进？

最佳答案

thrust::inner_product将在单个函数调用中完成它。您最初的想法也可以实现(将两个 vector 压缩在一起并使用 thrust::transform_reduce)此代码演示了这两种方法:

#include <iostream>

#include <thrust/tuple.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/functional.h>

#define N 2

struct zdiffsq{
template <typename Tuple>
  __host__ __device__ float operator()(Tuple a)
  {
    float result = thrust::get<0>(a) - thrust::get<1>(a);
    return result*result;
  }
};

struct diffsq{
  __host__ __device__ float operator()(float a, float b)
  {
    return (b-a)*(b-a);
  }
};

int main(){

  thrust::device_vector<float> oldvector(N);
  thrust::device_vector<float> newvector(N);
  oldvector[0] = 1.0f;  oldvector[1] = 2.0f;
  newvector[0] = 2.0f;  newvector[1] = 5.0f;

  float result = thrust::inner_product(oldvector.begin(), oldvector.end(), newvector.begin(), 0.0f, thrust::plus<float>(), diffsq());
  std::cout << "Result: " << result << std::endl;

  float result2 = thrust::transform_reduce(thrust::make_zip_iterator(thrust::make_tuple(oldvector.begin(), newvector.begin())), thrust::make_zip_iterator(thrust::make_tuple(oldvector.end(), newvector.end())), zdiffsq(), 0.0f, thrust::plus<float>());
  std::cout << "Result2: " << result2 << std::endl;
}

您还可以通过使用推力 placeholders 来研究消除内积示例中使用的仿函数定义。 .

即使您想编写自己的 CUDA 代码，对于并行归约和排序等常用算法，现在的标准建议是使用 cub .

是的，CUDA parallel reduction sample和 accompanying presentation仍然是快速并行缩减的一个很好的基本介绍。

关于c++ - 使用CUDA进行残差计算，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23597823/

c++ - 使用CUDA进行残差计算

上一篇：python - 使用 C++ 和 Python 访问数组

下一篇：c++ - 名称与类方法冲突