c++ - 简单的 SSE 循环比非 SSE 版本慢

标签 c++ gcc sse

我正在尝试比较 SSE float[4] 添加与标准 float[4] 添加。作为演示,我在使用和不使用 SSE 的情况下计算求和分量的总和:

#include <iostream>
#include <vector>

struct Point4
{
  Point4()
  {
    data[0] = 0;
    data[1] = 0;
    data[2] = 0;
    data[3] = 0;
  }

  float data[4];
};

void Standard()
{
  Point4 a;
  a.data[0] = 1.0f;
  a.data[1] = 2.0f;
  a.data[2] = 3.0f;
  a.data[3] = 4.0f;

  Point4 b;
  b.data[0] = 1.0f;
  b.data[1] = 6.0f;
  b.data[2] = 3.0f;
  b.data[3] = 5.0f;

  float total = 0.0f;
  for(unsigned int i = 0; i < 1e9; ++i)
  {
    for(unsigned int component = 0; component < 4; ++component)
    {
      total += a.data[component] + b.data[component];
    }
  }

  std::cout << "total: " << total << std::endl;
}

void Vectorized()
{
  typedef float v4sf __attribute__ (( vector_size(4*sizeof(float)) ));

  v4sf a;
  float* aPointer = (float*)&a;
  aPointer[0] = 1.0f; aPointer[1] = 2.0f; aPointer[2] = 3.0f; aPointer[3] = 4.0f;

  v4sf b;
  float* bPointer = (float*)&b;
  bPointer[0] = 1.0f; bPointer[1] = 6.0f; bPointer[2] = 3.0f; bPointer[3] = 5.0f;

  v4sf result;
  float* resultPointer = (float*)&result;
  resultPointer[0] = 0.0f;
  resultPointer[1] = 0.0f;
  resultPointer[2] = 0.0f;
  resultPointer[3] = 0.0f;

  for(unsigned int i = 0; i < 1e9; ++i)
  {
    result += a + b; // Vectorized operation
  }

  // Sum the components of the result (this is done with the "total += " in the Standard() loop
  float total = 0.0f;
  for(unsigned int component = 0; component < 4; ++component)
  {
    total += resultPointer[component];
  }
  std::cout << "total: " << total << std::endl;
}

int main()
{

//  Standard();

  Vectorized();

  return 0;
}

但是,使用标准方法的代码似乎比使用矢量化方法(~.4 秒)更快(~.2 秒)。是因为 for 循环对 v4sf 值求和吗?有没有更好的操作我可以用来计算这两种技术之间的差异并仍然比较输出以确保两者之间没有差异?

最佳答案

然后你的版本因为 SSE 变慢的原因是你必须在每次迭代中从 SSE 寄存器解包到标量寄存器 4 次,这比你从矢量化加法中获得的开销更多。看看反汇编,你应该得到一个更清晰的画面。

我想你想要做的是以下(使用 SSE 速度更快):

for(unsigned int i = 0; i < 1e6; ++i)
{
    result += a + b; // Vectorized operation
}

// Sum the components of the result (this is done with the "total += " in the Standard() loop
for(unsigned int component = 0; component < 4; ++component)
{
    total += resultPointer[component];
}

还有以下可能会更快:

for(unsigned int i = 0; i < 1e6/4; ++i)
{
    result0 += a + b; // Vectorized operation
    result1 += a + b; // Vectorized operation
    result2 += a + b; // Vectorized operation
    result3 += a + b; // Vectorized operation
}

// Sum the components of the result (this is done with the "total += " in the Standard() loop
for(unsigned int component = 0; component < 4; ++component)
{
    total += resultPointer0[component];
    total += resultPointer1[component];
    total += resultPointer2[component];
    total += resultPointer3[component];
}

关于c++ - 简单的 SSE 循环比非 SSE 版本慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12186193/

相关文章:

c++ - 为什么私有(private)继承对象允许成员函数将 derived* 强制转换为 base* 但外部人员不允许?

C++:如何在堆栈上创建对象数组?

c++ - "const"声明有助于编译器 (GCC) 生成更快的代码吗?

linux - 在 Linux 中创建无窗口应用程序

multithreading - pthreads v。SSE弱内存排序

c++ - 将 __m128i 值转换为 std::tuple

c++ - 为什么两个连续的收集指令比等效的基本操作执行得更差?

c++ - QNetworkAccessManager没有完成信号

c++ - 不使用 push_back 的 std::vector bad_alloc

c - 英特尔 asm 语法与 GCC : undefined reference