c++ - Eigen 3.3.0 与 3.2.10 的性能回归？

我们正处于 porting our codebase 的过程中转到 Eigen 3.3(所有 32 字节对齐问题都是一项艰巨的任务)。然而，有几个地方的性能似乎受到了严重影响，这与预期相反(鉴于对 FMA 和 AVX 的额外支持，我期待一些加速......)。这些包括特征值分解和 matrix*matrix.transpose()*vector 产品。我已经编写了两个最小的工作示例来进行演示。

所有测试都在最新的 Arch Linux 系统上运行，使用 Intel Core i7-4930K CPU (3.40GHz)，并使用 g++ 版本 6.2.1 编译。

1。特征值分解:

使用 Eigen 3.3.0 进行简单的自伴随特征值分解所需的时间是使用 3.2.10 的两倍。

文件test_eigen_EVD.cpp:

#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#include <Eigen/Eigenvalues>

#define SIZE 200
using namespace Eigen;

int main (int argc, char* argv[])
{
  MatrixXf mat = MatrixXf::Random(SIZE,SIZE);
  SelfAdjointEigenSolver<MatrixXf> eig;

  for (int n = 0; n < 1000; ++n)
    eig.compute (mat);

  return 0;
}

测试结果:

eigen-3.2.10:

g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD

real    0m5.136s
user    0m5.133s
sys     0m0.000s

Eigen 3.3.0:

g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD

real    0m11.008s
user    0m11.007s
sys     0m0.000s

不确定是什么原因造成的，但如果有人能看到使用 Eigen 3.3 保持性能的方法，我很想知道!

2。矩阵矩阵.transpose() vector 积:

这个特殊的例子在 Eigen 3.3.0 中花费了 200 倍的时间......

文件test_eigen_products.cpp:

#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>

#define SIZE 200
using namespace Eigen;

int main (int argc, char* argv[])
{
  MatrixXf mat = MatrixXf::Random(SIZE,SIZE);
  VectorXf vec = VectorXf::Random(SIZE);

  for (int n = 0; n < 50; ++n)
    vec = mat * mat.transpose() * VectorXf::Random(SIZE);

  return vec[0] == 0.0;
}

测试结果:

eigen-3.2.10:

g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products

real    0m0.040s
user    0m0.037s
sys     0m0.000s

Eigen 3.3.0:

g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products

real    0m8.112s
user    0m7.700s
sys     0m0.410s

像这样在循环中的行中添加括号:

    vec = mat * ( mat.transpose() * VectorXf::Random(SIZE) );

有很大的不同，两个 Eigen 版本的性能都一样好(实际上 3.3.0 稍微好一点)，并且比未加括号的 3.2.10 更快。所以有一个修复。不过，奇怪的是 3.3.0 会为此苦苦挣扎。

我不知道这是否是一个错误，但我认为值得报告，以防需要修复。或者也许我只是做错了......

任何想法表示赞赏。干杯，唐纳德。

编辑

作为pointed out by ggael ，如果使用 clang++ 或 -O3 和 g++ 编译，Eigen 3.3 中的 EVD 会更快。所以问题 1 已解决。

问题 2 并不是真正的问题，因为我可以用括号强制执行最有效的操作顺序。但只是为了完整性:在对这些操作的评估中似乎确实存在缺陷。 Eigen 是一款令人难以置信的软件，我认为这可能值得修复。这是 MWE 的修改版本，只是为了表明它不太可能与从循环中取出的第一个临时产品相关(至少据我所知):

#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#include <iostream>

#define SIZE 200
using namespace Eigen;

int main (int argc, char* argv[])
{
  VectorXf vec (SIZE), vecsum (SIZE);
  MatrixXf mat (SIZE,SIZE);

  for (int n = 0; n < 50; ++n) {
    mat = MatrixXf::Random(SIZE,SIZE);
    vec = VectorXf::Random(SIZE);
    vecsum += mat * mat.transpose() * VectorXf::Random(SIZE);
  }

  std::cout << vecsum.norm() << std::endl;
  return 0;
}

在这个例子中，操作数都在循环内初始化，结果累积在 vecsum 中，因此编译器无法预先计算任何东西，或优化掉不必要的计算。这显示了完全相同的行为(这次使用 clang++ -O3(版本 3.9.0)进行测试:

$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
5467.82

real    0m0.060s
user    0m0.057s
sys     0m0.000s

$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
5467.82

real    0m4.225s
user    0m3.873s
sys     0m0.350s

结果相同，但执行时间却大不相同。值得庆幸的是，这很容易通过在正确的位置放置括号来解决，但在 Eigen 3.3 的操作评估中似乎确实存在回归。在 mat.transpose() * VectorXf::Random(SIZE) 部分加上括号，两个 Eigen 版本的执行时间都减少到 0.020 秒左右(因此 Eigen 3.2.10 显然也有利于此案件)。至少这意味着我们可以继续从 Eigen 获得出色的性能!

与此同时，我会接受 ggael 的回答，这是我继续前进所需要知道的一切。

最佳答案

对于 EVD，我无法用 clang 重现。使用 gcc，您需要 -O3 来避免内联问题。然后，对于这两个编译器，Eigen 3.3 将提供 33% 的加速。

编辑我之前关于matrix*matrix*vector 产品的回答是错误的。这是 Eigen 3.3.0 的一个缺点，将在 Eigen 3.3.1 中修复。作为记录，我把我之前的分析留在这里，它仍然部分有效:

As you noticed you should really add the parenthesis to perform two matrix*vector products instead of a big matrix*matrix product. Then the speed difference is easily explained by the fact that in 3.2, the nested matrix*matrix product is immediately evaluated (at nesting time), whereas in 3.3 it is evaluated at evaluation time, that is in operator=. This means that in 3.2, the loop is equivalent to:
for (int n = 0; n < 50; ++n) {
  MatrixXf tmp = mat * mat.transpose();
  vec = tmp * VectorXf::Random(SIZE);
}
and thus the compiler can move tmp out of the loop. Production code should not rely on the compiler for this kind of task and rather explicitly moves constant expression outside loops.

这是真的，除了在实践中编译器不够聪明，无法将临时变量移出循环。

关于c++ - Eigen 3.3.0 与 3.2.10 的性能回归？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40805386/

c++ - Eigen 3.3.0 与 3.2.10 的性能回归？

1。特征值分解:

2。矩阵矩阵.transpose() vector 积:

编辑

上一篇：c++ - CRTP 复制方法警告潜在的内存泄漏

下一篇：c++ - 我应该返回 gsl::span<const T> 而不是 const std::vector<T>&

c++ - Eigen 3.3.0 与 3.2.10 的性能回归？

1。特征值分解:

2。矩阵*矩阵.transpose()* vector 积:

编辑

上一篇：c++ - CRTP 复制方法警告潜在的内存泄漏

下一篇：c++ - 我应该返回 gsl::span<const T> 而不是 const std::vector<T>&

2。矩阵矩阵.transpose() vector 积: