gcc - 为什么 `mov %eax, %eax; nop` 比 `nop` 快？

Apparently ，现代处理器可以判断您是否做了一些愚蠢的事情，例如将寄存器移动到自身 ( mov %eax, %eax ) 并将其优化。为了验证该声明，我运行了以下程序:

#include <stdio.h>
#include <time.h>

static inline void f1() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "mov %eax, %eax;"
            "nop;"
            );
}

static inline void f2() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "nop;"
            );
}

static inline void f3() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "mov %ebx, %eax;"
            "nop;"
            );
}

int main() {
   int NRUNS = 10;
   clock_t t, t1, t2, t3;

   t1 = t2 = t3 = 0;
   for (int run = 0; run < NRUNS; run++) {
      t = clock(); f1(); t1 += clock()-t;
      t = clock(); f2(); t2 += clock()-t;
      t = clock(); f3(); t3 += clock()-t;
   }

   printf("f1() took %f cycles on avg\n", (float) t1/ (float) NRUNS);
   printf("f2() took %f cycles on avg\n", (float) t2/ (float) NRUNS);
   printf("f3() took %f cycles on avg\n", (float) t3/ (float) NRUNS);

   return 0;
}

这给了我:

f1() took 175587.093750 cycles on avg
f2() took 188313.906250 cycles on avg
f3() took 194654.296875 cycles on avg

正如人们所料，f3()出来最慢。但令人惊讶的是(至少对我而言)，f1()比 f2() 快.这是为什么？

更新 : 编译 -falign-loops定性地给出相同的结果:

f1() took 164271.000000 cycles on avg
f2() took 173783.296875 cycles on avg
f3() took 177765.203125 cycles on avg

最佳答案

The part of the linked article that made me think that this can be optimized away is: "the move function takes care of checking for equivalent locations"

那就是说(move r x) SBCL 中的函数，而不是 x86 mov操作说明。它谈论的是从低级中间语言生成代码期间的优化，而不是硬件在运行时的优化。

都没有 mov %eax, %eax也不是 nop完全免费。它们都消耗前端吞吐量，并且 mov %eax,%eax在 64 位模式下甚至不是 NOP(它将 EAX 零扩展到 RAX，并且因为它是相同的寄存器，所以在 Intel CPU 上移动消除失败。)

见 Can x86's MOV really be "free"? Why can't I reproduce this at all?有关前端/后端吞吐量瓶颈与延迟的更多信息。

您可能会看到代码对齐的一些副作用，或者可能是像 Adding a redundant assignment speeds up code when compiled without optimization 中那样时髦的 Sandybridge 系列存储转发延迟效应。因为您还在禁用优化的情况下编译，让您的编译器制作反优化代码以进行一致的调试，从而将循环计数器保留在内存中。 (通过存储/重新加载约 6 个循环循环携带的依赖链，而不是对于正常的小循环每个时钟进行 1 次迭代。)

如果您的结果可以通过更大的迭代次数重现，那么您所看到的可能有一些微体系结构的解释，但这可能与您尝试测量的任何内容无关。

当然，您还需要修复 mov %ebx, %eax; f3 中的错误在启用优化的情况下成功编译。在不告诉编译器的情况下破坏 EAX 将踩到编译器生成的代码。你没有解释你试图用它测试什么，所以 IDK 如果它是一个错字。

关于gcc - 为什么 `mov %eax, %eax; nop` 比 `nop` 快？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52992723/

gcc - 为什么 `mov %eax, %eax; nop` 比 `nop` 快？

上一篇：ActiveX 不能在客户端机器上工作

下一篇：jsf - p :button example doesn't read the parameter passed as in the example