c++ - 为什么 C 数组比 std::array 快这么多？

我们目前正在用 C++ 编写一些对性能至关重要的代码，这些代码可在许多大型矩阵和 vector 上运行。关于我们的研究，std::array 和标准 C 数组之间应该没有太大的性能差异 (见 This question 或 this)。然而，在测试过程中，通过使用 C 数组而不是 std::array，我们体验到了巨大的性能提升。这是我们的演示代码:

#include <iostream>
#include <array>
#include <sys/time.h>

#define ROWS 784
#define COLS 100
#define RUNS 50

using std::array;

void DotPComplex(array<double, ROWS> &result, array<double, ROWS> &vec1, array<double, ROWS> &vec2){
  for(int i = 0; i < ROWS; i++){
    result[i] = vec1[i] * vec2[i];
  }
}

void DotPSimple(double result[ROWS], double vec1[ROWS], double vec2[ROWS]){
  for(int i = 0; i < ROWS; i++){
    result[i] = vec1[i] * vec2[i];
  }
}

void MatMultComplex(array<double, ROWS> &result, array<array<double, COLS>, ROWS> &mat, array<double, ROWS> &vec){
  for (int i = 0; i < COLS; ++i) {
      for (int j = 0; j < ROWS; ++j) {
        result[i] += mat[i][j] * vec[j];
      }
  }
}

void MatMultSimple(double result[ROWS], double mat[ROWS][COLS], double vec[ROWS]){
  for (int i = 0; i < COLS; ++i) {
      for (int j = 0; j < ROWS; ++j) {
        result[i] += mat[i][j] * vec[j];
      }
  }
}

double getTime(){
    struct timeval currentTime;
    gettimeofday(&currentTime, NULL);
    double tmp = (double)currentTime.tv_sec * 1000.0 + (double)currentTime.tv_usec/1000.0;
    return tmp;
}

array<double, ROWS> inputVectorComplex = {{ 0 }};
array<double, ROWS> resultVectorComplex = {{ 0 }};
double inputVectorSimple[ROWS] = { 0 };
double resultVectorSimple[ROWS] = { 0 };

array<array<double, COLS>, ROWS> inputMatrixComplex = {{0}};
double inputMatrixSimple[ROWS][COLS] = { 0 };

int main(){
  double start;
  std::cout << "DotP test with C array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    DotPSimple(resultVectorSimple, inputVectorSimple, inputVectorSimple);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "DotP test with C++ array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    DotPComplex(resultVectorComplex, inputVectorComplex, inputVectorComplex);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "MatMult test with C array : " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    MatMultSimple(resultVectorSimple, inputMatrixSimple, inputVectorSimple);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "MatMult test with C++ array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    MatMultComplex(resultVectorComplex, inputMatrixComplex, inputVectorComplex);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;
}

编译:icpc demo.cpp -std=c++11 -O0 结果如下:

DotP test with C array: 
Duration: 0.289795 ms
DotP test with C++ array: 
Duration: 1.98413 ms
MatMult test with C array : 
Duration: 28.3459 ms
MatMult test with C++ array: 
Duration: 175.15 ms

带有 -O3 标志:

DotP test with C array: 
Duration: 0.0280762 ms
DotP test with C++ array: 
Duration: 0.0288086 ms
MatMult test with C array : 
Duration: 1.78296 ms
MatMult test with C++ array: 
Duration: 4.90991 ms

在没有编译器优化的情况下，C 数组实现要快得多。为什么？使用编译器优化，点积同样快。但是对于矩阵乘法，使用 C 数组时仍然有显着的加速。有没有办法在使用 std::array 时实现同等性能？

更新:

使用的编译器:icpc 17.0.0

使用 gcc 4.8.5 我们的代码运行速度比使用任何优化级别的英特尔编译器慢得多。因此，我们主要对 intel 编译器的行为感兴趣。

根据 Jonas 的建议我们用以下结果调整了 RUNS 50.000(英特尔编译器):

带有-O0标志:

DotP test with C array: 
Duration: 201.764 ms
DotP test with C++ array: 
Duration: 1020.67 ms
MatMult test with C array : 
Duration: 15069.2 ms
MatMult test with C++ array: 
Duration: 123826 ms

带有 -O3 标志:

DotP test with C array: 
Duration: 16.583 ms
DotP test with C++ array: 
Duration: 15.635 ms
MatMult test with C array : 
Duration: 980.582 ms
MatMult test with C++ array: 
Duration: 2344.46 ms

最佳答案

首先，您使用的运行次数太少了。就个人而言，我没有意识到(在运行代码之前)您的“持续时间”测量值以毫秒

为单位

通过将 DotPSimple 和 DotPComplex 的 RUNS 增加到 5,000,000，时间类似于:

DotP test with C array:

Duration: 1074.89

DotP test with C++ array:

Duration: 1085.34

也就是说，它们非常接近于同样快。事实上，由于基准测试的随机性，哪个测试速度最快会因测试而异。 MatMultSimple 和 MatMultComplex 也是如此，尽管它们只需要运行 50,000 次。

如果您真的想测量和了解更多信息，您应该接受此基准的随机性，并近似“持续时间”测量的分布。包括函数的随机顺序，以消除任何排序偏差。

编辑: assembly code (来自user2079303的回答)完全证明启用优化没有区别。因此，零成本抽象实际上是启用优化的零成本，这是一个合理的要求。

更新:

我使用的编译器:

g++ (Debian 6.3.0-6) 6.3.0 20170205

使用以下命令:

g++ -Wall -Wextra -pedantic -O3 test.cpp

使用此处理器:

Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz

关于c++ - 为什么 C 数组比 std::array 快这么多？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43183012/

c++ - 为什么 C 数组比 std::array 快这么多？

上一篇：c++ - 通过引用函数来传递新构造的对象是否合法？

下一篇：c++ - 使用 Clang 的函数原型(prototype)中不允许使用“自动”