c++ - SSE 加载/存储内存事务

在使用 SSE 内在函数时，内存-寄存器交互有两种方式:

中间指针:

void f_sse(float *input, float *output, unsigned int n)
{
   _m128 *input_sse = reinterpret_cast<__m128*>(input);//Input intermediate pointer
   _m128 *output_sse = reinterpret_cast<__m128*>(output);//Output intermediate pointer
   _m128 s = _mm_set1_ps(0.1f);
   auto loop_size = n/4; 
   for(auto i=0; i<loop_size; ++i)
      output_sse[i] = _mm_add_ps(input_sse[i], s);
}

显式获取/存储:

void f_sse(float *input, float *output, unsigned int n)
{
   _m128 input_sse, output_sse, result;
   _m128 s = _mm_set1_ps(0.1f); 
   for(auto i=0; i<n; i+=4)
   {
      input_sse  = _mm_load_ps(input+i);
      result     = _mm_add_ps(input_sse, s);
      _mm_store_ps(output+i, result);
   }
}

上述方法之间有什么区别，哪种方法在性能方面更好？输入和输出指针由 _mm_malloc() 对齐。

最佳答案

在O3优化级别用g++编译，内循环的汇编代码(使用objdump -d)是

20:   0f 28 04 07             movaps (%rdi,%rax,1),%xmm0
24:   0f 58 c1                addps  %xmm1,%xmm0
27:   0f 29 04 06             movaps %xmm0,(%rsi,%rax,1)
2b:   48 83 c0 10             add    $0x10,%rax
2f:   48 39 d0                cmp    %rdx,%rax
32:   75 ec                   jne    20 <_Z5f_ssePfS_j+0x20>

和

10:   0f 28 04 07             movaps (%rdi,%rax,1),%xmm0
14:   83 c1 04                add    $0x4,%ecx
17:   0f 58 c1                addps  %xmm1,%xmm0
1a:   0f 29 04 06             movaps %xmm0,(%rsi,%rax,1)
1e:   48 83 c0 10             add    $0x10,%rax
22:   39 ca                   cmp    %ecx,%edx
24:   77 ea                   ja     10 <_Z5f_ssePfS_j+0x10>

它们非常相似。在第一个 g++ 中设法只使用一个计数器(只有一个 add 指令)。所以我想它更好。

关于c++ - SSE 加载/存储内存事务，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17625180/

c++ - SSE 加载/存储内存事务

上一篇：c++ - 如何为 D3DXCreateTextureFromFileEx 方法设置图像资源路径？

下一篇：c++ - 在 C/C++ 中覆盖符号