c++ - 为什么 C++ 优化器对这些临时变量有问题，或者更确切地说，为什么在紧密循环中应该避免 `v[]`？

在此code snippet ，我正在比较两个功能相同的循环的性能:

for (int i = 1; i < v.size()-1; ++i) {
  int a = v[i-1];
  int b = v[i];
  int c = v[i+1];

  if (a < b  &&  b < c)
    ++n;
}

和

for (int i = 1; i < v.size()-1; ++i) 
  if (v[i-1] < v[i]  &&  v[i] < v[i+1])
    ++n;

在优化标志设置为 O2 的许多不同 C++ 编译器中，第一个编译器的运行速度明显慢于第二个编译器:

现在使用 Clang 3.7.0，第二个循环大约 330% 慢
使用 gcc 4.9.3，第二个循环大约慢 2%
在 Visual C++ 2015 中，第二个循环大约慢 2%

我很困惑现代 C++ 优化器在处理这种情况时会遇到问题。任何线索为什么？我是否必须在不使用临时变量的情况下编写丑陋的代码才能获得最佳性能？

现在，使用临时变量可以使代码更快，有时甚至是显着的。怎么回事？

下面提供了我正在使用的完整代码:

#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>

using namespace std;
using namespace std::chrono;

vector<int> v(1'000'000);

int f0()
{
  int n = 0;

  for (int i = 1; i < v.size()-1; ++i) {
    int a = v[i-1];
    int b = v[i];
    int c = v[i+1];

    if (a < b  &&  b < c)
      ++n;
  }

  return n;
}


int f1()
{
  int n = 0;

  for (int i = 1; i < v.size()-1; ++i) 
    if (v[i-1] < v[i]  &&  v[i] < v[i+1])
      ++n;

  return n;
}


int main()
{
  auto benchmark = [](int (*f)()) {
    const int N = 100;

    volatile long long result = 0;
    vector<long long>  timings(N);

    for (int i = 0; i < N; ++i) {
      auto t0 = high_resolution_clock::now(); 
      result += f();
      auto t1 = high_resolution_clock::now(); 

      timings[i] = duration_cast<nanoseconds>(t1-t0).count();
    }

    sort(timings.begin(), timings.end());
    cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
    cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
  };

  mt19937                    generator   (31415);   // deterministic seed
  uniform_int_distribution<> distribution(0, 1023);

  for (auto& e: v) 
    e = distribution(generator);

  benchmark(f0);
  benchmark(f1);

  cout << "\ndone\n";

  return 0;
}

最佳答案

编译器似乎缺乏关于 std::vector<>::size() 之间关系的知识。和内部 vector 缓冲区大小。考虑std::vector成为我们的定制bugged_vector带有轻微错误的类 vector 对象 - 它的 ::size()有时可能比内部缓冲区大小大一 n , 但只有这样 v[n-2] >= v[n-1] .

然后两个片段再次具有不同的语义:第一个具有未定义的行为，因为我们访问元素 v[v.size() - 1] .然而，第二个没有:由于 && 的短路性质，我们从来没有读过v[v.size() - 1]在最后一次迭代中。

所以，如果编译器不能证明我们的 v不是 bugged_vector ，它必须短路，这会在机器代码中引入额外的跳转。

通过查看来自 clang 的程序集输出，我们可以看到它确实发生了。

来自 the Godbolt Compiler Explorer ，使用 clang 3.7.0 -O2，f0 中的循环是:

### f0: just the loop
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
    mov     edi, ecx
    cmp     edx, edi
    setl    r10b
    mov     ecx, dword ptr [r8 + 4*rsi + 4]
    lea     rsi, [rsi + 1]
    cmp     edi, ecx
    setl    dl
    and     dl, r10b
    movzx   edx, dl
    add     eax, edx
    cmp     rsi, r9
    mov     edx, edi
    jb      .LBB1_2

对于 f1 :

### f1: just the loop
.LBB2_2:                                # =>This Inner Loop Header: Depth=1
    mov     esi, r10d
    mov     r10d, dword ptr [r9 + 4*rdi]
    lea     rcx, [rdi + 1]
    cmp     esi, r10d
    jge     .LBB2_4                     # <== This is Extra Jump
    cmp     r10d, dword ptr [r9 + 4*rdi + 4]
    setl    dl
    movzx   edx, dl
    add     eax, edx
.LBB2_4:                                # %._crit_edge.3
    cmp     rcx, r8
    mov     rdi, rcx
    jb      .LBB2_2

我已经指出了 f1 中的额外跳转.正如我们(希望)知道的那样，紧密循环中的条件跳转对性能不利。 (有关详细信息，请参阅 x86 标签 wiki 中的性能指南。)

~~GCC 和 Visual Studio 都知道 std::vector表现良好，并且为两个片段生成几乎相同的程序集。~~ 编辑。原来clang在优化代码方面做得更好。所有三个编译器都不能证明阅读 v[i + 1] 是安全的在第二个示例中比较之前(或选择不比较)，但仅限 clang设法使用读取 v[i + 1] 的附加信息来优化第一个示例是有效的还是 UB。

2% 的性能差异可以忽略不计，这可以通过某些指令的不同顺序或选择来解释。

关于c++ - 为什么 C++ 优化器对这些临时变量有问题，或者更确切地说，为什么在紧密循环中应该避免 `v[]`？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36414959/

c++ - 为什么 C++ 优化器对这些临时变量有问题，或者更确切地说，为什么在紧密循环中应该避免 `v[]`？

上一篇：c++ - 如何在 C++ 中迭代一个充满字符串的 std::map

下一篇：c++ - bool 运算符可以与预处理器一起使用吗？