c++ - 为什么 icc 会为一个简单的 main 生成奇怪的程序集？

我有一个简单的 program :

int main()
{
    return 2*7;
}

打开优化的 GCC 和 clang 都能愉快地生成 2 个指令二进制文件，但 icc 给出了奇怪的输出。

     push      rbp                                           #2.1
     mov       rbp, rsp                                      #2.1
     and       rsp, -128                                     #2.1
     sub       rsp, 128                                      #2.1
     xor       esi, esi                                      #2.1
     mov       edi, 3                                        #2.1
     call      __intel_new_feature_proc_init                 #2.1
     stmxcsr   DWORD PTR [rsp]                               #2.1
     mov       eax, 14                                       #3.12
     or        DWORD PTR [rsp], 32832                        #2.1
     ldmxcsr   DWORD PTR [rsp]                               #2.1
     mov       rsp, rbp                                      #3.12
     pop       rbp                                           #3.12
     ret

最佳答案

我不知道为什么ICC选择按2个缓存行对齐堆栈:

and       rsp, -128                                     #2.1
sub       rsp, 128                                      #2.1

这很有趣。 L2 缓存有一个相邻行预取器，它喜欢将成对的行(在 128 字节对齐的组中)拉入 L2。但是 main 的堆栈框架通常不会被大量使用。也许在某些程序中重要的变量被分配在那里。 (这也解释了设置 rbp 以保存旧的 RSP，以便它可以在 ANDing 之后返回。gcc 在函数中使用 RBP 来对齐堆栈。)

剩下的就是因为main()是特殊的，ICC 启用 -ffast-math默认情况下。 (这是英特尔的“肮脏”小 secret 之一，让它开箱即用地自动矢量化更多浮点代码。)

这包括将代码添加到 main 的顶部设置 MXCSR(SSE 状态/控制寄存器)中的 DAZ/FTZ 位。有关这些位的更多信息，请参阅 Intel 的 x86 手册，但它们实际上并不复杂:

DAZ:非正规数为零:作为 SSE/AVX 指令的输入，非正规数被视为零。
FTZ:清零:舍入 SSE/AVX 指令的结果时，次正规结果清零。

相关:SSE "denormals are zeros" option

(ISO C++ 禁止程序回调 main()，因此允许编译器将运行一次的内容放在 main 本身而不是 CRT 启动文件中。gcc/clang 与-ffast-math 指定用于设置 MXCSR 的 CRT 启动文件中的链接链接。但是当使用 gcc/clang 编译时，它只影响允许优化的代码生成。即，当不同的临时文件时，将 FP add/mul 视为关联的意味着它真的不是。这与设置 DAZ/FTZ 完全无关)。

非正规在这里被用作次正规的同义词:具有最小指数和尾数的 FP 值，其中隐含的前导位是 0 而不是 1。即，一个幅度小于 FLT_MIN or DBL_MIN 的值。，最小的可表示标准化 float / double 。

https://en.wikipedia.org/wiki/Denormal_number .

产生次正常结果的指令可能很多慢:为了优化延迟，一些硬件中的快速路径假设结果是规范化的，如果结果不能规范化则采用微码辅助。使用 perf stat -e fp_assist.any计算此类事件。

来自 Bruce Dawson 的优秀 FP 文章系列:That’s Not Normal–the Performance of Odd Floats 。还有:

Why does changing 0.1f to 0 slow down performance by 10x?

Avoiding denormal values in C++

Agner Fog 做了一些测试(参见他的 microarch pdf )，并为 Haswell/Broadwell 报告:

Underflow and subnormals

Subnormal numbers occur when floating point operations are close to underflow. The handling of subnormal numbers is very costly in some cases because the subnormal results are handled by microcode exceptions.

The Haswell and Broadwell have a penalty of approximately 124 clock cycles in all cases where an operation on normal numbers gives a subnormal result. There is a similar penalty for a multiplication between a normal and a subnormal number, regardless of whether the result is normal or subnormal. There is no penalty for adding a normal and a subnormal number, regardless of the result. There is no penalty for overflow, underflow, infinity or not- a-number results.

The penalties for subnormal numbers are avoided if the "flush-to-zero" mode and the "denormals-are-zero" mode are both set in the MXCSR register.

所以在某些情况下，现代 Intel CPU 即使在次正规情况下也能避免惩罚，但是

关于c++ - 为什么 icc 会为一个简单的 main 生成奇怪的程序集？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52141947/

c++ - 为什么 icc 会为一个简单的 main 生成奇怪的程序集？

上一篇：c++ - 以下示例中给出的结构有什么不幸之处？

下一篇：c++ - 模板特化中的静态断言即使未实例化也会失败