c++ - 这种使用 SSE 处理数组尾部的方法是否矫枉过正？

我正在研究 SSE，试图编写一个函数来将单精度 float 组的所有值相加。我希望它适用于所有长度的数组，而不仅仅是网络上几乎所有示例中假设的 4 的倍数。我想到了这样的事情:

float sse_sum(const float *x, const size_t n)
{
    const size_t
        steps = n / 4,
        rem = n % 4,
        limit = steps * 4;

    __m128 
        v, // vector of current values of x
        sum = _mm_setzero_ps(0.0f); // sum accumulator

    // perform the main part of the addition
    size_t i;
    for (i = 0; i < limit; i+=4)
    {
        v = _mm_load_ps(&x[i]);
        sum = _mm_add_ps(sum, v);
    }

    // add the last 1 - 3 odd items if necessary, based on the remainder value
    switch(rem)
    {
        case 0: 
            // nothing to do if no elements left
            break;
        case 1: 
            // put 1 remaining value into v, initialize remaining 3 to 0.0
            v = _mm_load_ss(&x[i]);
            sum = _mm_add_ps(sum, v);
            break;
        case 2: 
            // set all 4 to zero
            v = _mm_setzero_ps();
            // load remaining 2 values into lower part of v
            v = _mm_loadl_pi(v, (const __m64 *)(&x[i]));
            sum = _mm_add_ps(sum, v);
            break;
        case 3: 
            // put the last one of the remaining 3 values into v, initialize rest to 0.0
            v = _mm_load_ss(&x[i+2]);
            // copy lower part containing one 0.0 and the value into the higher part
            v = _mm_movelh_ps(v,v);
            // load remaining 2 of the 3 values into lower part, overwriting 
            // old contents                         
            v = _mm_loadl_pi(v, (const __m64*)(&x[i]));     
            sum = _mm_add_ps(sum, v);
            break;
    }

    // add up partial results
    sum = _mm_hadd_ps(sum, sum);
    sum = _mm_hadd_ps(sum, sum);
    __declspec(align(16)) float ret;
    /// and store the final value in a float variable
    _mm_store_ss(&ret, sum);
    return ret; 
}

然后我开始怀疑这是否有点矫枉过正。我的意思是，我陷入了 SIMD 模式，只需要用 SSE 处理尾部。这很有趣，但是将尾部相加并使用常规浮点运算计算结果不是同样好(而且更简单)吗？我在 SSE 中这样做有什么收获吗？

最佳答案

我会查看 Agner Fog 的 vector 类。请参阅 VectorClass.pdf 的 “当数据大小不是 vector 大小的倍数” 部分。他列出了五种不同的方法，并讨论了每种方法的优缺点。 http://www.agner.org/optimize/#vectorclass

一般来说，我这样做的方式是从以下链接获得的。 http://fastcpp.blogspot.no/2011/04/how-to-unroll-loop-in-c.html

#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
 void add(float* result, const float* a, const float* b, int N) {
 int i = 0;
 for(; i < ROUND_DOWN(N, 4); i+=4) {
    __m128 a4 = _mm_loadu_ps(a + i);
    __m128 b4 = _mm_loadu_ps(b + i);
    __m128 sum = _mm_add_ps(a4, b4);
    _mm_storeu_ps(result + i, sum);
  }
  for(; i < N; i++) {
      result[i] = a[i] + b[i];
  }
}

关于c++ - 这种使用 SSE 处理数组尾部的方法是否矫枉过正？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15512370/

c++ - 这种使用 SSE 处理数组尾部的方法是否矫枉过正？

上一篇：c++ - 在编译时比较静态字段指针

下一篇：c++ - 两个 .lib 包含具有相同名称的不同函数 : how to choose the right one?