c++ - 为什么 asm 中的这种差异对性能很重要(在未优化的 ptr++ 与++ptr 循环中)？

TL;博士 :第一个循环在 Haswell CPU 上运行速度提高了约 18%。为什么？循环来自 gcc -O0 (未优化)循环使用 ptr++对比 ++ptr ，但问题是为什么生成的 asm 表现不同，而不是关于如何编写更好的 C。

假设我们有这两个循环:

    movl    $0, -48(%ebp)     //Loop counter set to 0
    movl    $_data, -12(%ebp) //Pointer to the data array
    movl    %eax, -96(%ebp)
    movl    %edx, -92(%ebp)
    jmp L21
L22:
    // ptr++
    movl    -12(%ebp), %eax   //Get the current address
    leal    4(%eax), %edx     //Calculate the next address
    movl    %edx, -12(%ebp)   //Store the new (next) address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx   //Get the loop counter to edx
    movl    %edx, (%eax)      //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
    addl    $1, -48(%ebp)     //Increase the counter
L21:
    cmpl    $999999, -48(%ebp)
    jle     L22

第二个:

    movl    %eax, -104(%ebp)
    movl    %edx, -100(%ebp)
    movl    $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
    movl    $0, -48(%ebp)       //Set the loop counter to 0
    jmp L23
L24:
    // ++ptr
    addl    $4, -12(%ebp)       //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
    movl    -12(%ebp), %eax     //Store in eax the address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx     //Store in edx the current loop counter
    movl    %edx, (%eax)        //Move the loop counter value to the current stored address location
    addl    $1, -48(%ebp)       //Increase the loop counter
L23:
    cmpl    $999999, -48(%ebp)
    jle L24

这些循环在做完全相同的事情，但方式略有不同，请参阅评论以了解详细信息。

该汇编代码由以下两个 C++ 循环生成:

    //FIRST LOOP:
    for(;index<size;index++){
        *(ptr++) = index;
    }
    //SECOND LOOP:
    ptr = data - 1;
    for(index = 0;index<size;index++){
        *(++ptr) = index;
    }

现在，第一个循环比第二个循环快约 18%，无论循环执行的顺序如何，ptr++比 ++ptr 快.

为了运行我的基准测试，我只是收集了不同大小的循环的运行时间，并执行它们嵌套在其他循环中以频繁重复操作。

ASM分析

查看 ASM 代码，第二个循环包含较少的指令，我们有 3 个 movl 和 2 个 addl，而在第一个循环中我们有 4 个 movl，一个 addl 和一个 leal，所以我们多了一个 movl 和 一个 leal 而不是 addl
LEA 是否正确？计算正确地址的操作比 ADD 快得多(+4) 方法？这是性能差异的原因吗？

据我所知，一旦在可以引用内存之前计算出新地址，必须经过一些时钟周期，因此 addl $4,-12(%ebp) 之后的第二个循环需要稍等片刻才能继续，而在第一个循环我们可以立即引用内存，同时 LEAL 将计算下一个地址(这里有某种更好的管道性能)。

这里有一些重新排序吗？我不确定我对这些循环的性能差异的解释，我能有你的意见吗？

最佳答案

首先对-O0进行性能分析编译器输出通常不是很有趣或有用。

Is that correct that the LEAL operation for computing the correct address is much faster than the ADDL (+4) method? Is this the reason for the difference in performance?

不，add可以在任何 x86 CPU 上的每个 ALU 执行端口上运行。 lea通常与简单寻址模式下的延迟一样低，但吞吐量不那么好。在 Atom 上，它运行在与普通 ALU 指令不同的流水线阶段，因为它实际上名副其实，并在该有序微体系结构上使用了 AGU。

见 x86标记 wiki 以了解是什么让代码在不同的微架构上变慢或变快，尤其是。 Agner Fog's microarchitecture pdf and instruction tables .

add更糟糕的是因为它让 gcc -O0通过将其与内存目标一起使用然后从中加载来制作更糟糕的代码。

编译 -O0甚至不尝试使用最好的工作说明。例如你会得到 mov $0, %eax而不是 xor %eax,%eax你总是得到优化的代码。您不应该通过查看未优化的编译器输出来推断什么是好的。
-O0代码总是充满瓶颈，通常在加载/存储或存储转发上。不幸的是 IACA 没有考虑到存储转发延迟，所以它没有意识到这些循环实际上在

As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding,

是的，mov -12(%ebp) 的负载在作为 add 一部分的负载之后的大约 6 个周期内不会准备好的读-修改-写。

whereas in the first loop we can immediately refer the memory

是的

and in the meanwhile LEAL will compute the next address

不。

您的分析很接近，但您错过了下一次迭代仍然必须加载我们存储到 -12(%ebp) 中的值这一事实。 .所以循环携带的依赖链长度相同，下一次迭代的lea实际上不能比使用 add 在循环中更早开始

延迟问题可能不是循环吞吐量瓶颈:

需要考虑 uop/执行端口吞吐量。在这种情况下，OP 的测试表明它实际上是相关的。 (或来自资源冲突的延迟。)

当 gcc -O0工具ptr++ ，就像您说的那样，它将旧值保留在寄存器中。因此，可以提前知道存储地址，并且需要 AGU 的负载单元减少了一个。

假设使用 Intel SnB 系列 CPU:

## ptr++: 1st loop
movl    -12(%ebp), %eax   //1 uop (load)
leal    4(%eax), %edx     //1 uop (ALU only)
movl    %edx, -12(%ebp)   //1 store-address, 1 store-data
//   no load from -12(%ebp) into %eax
... rest the same.


 ## ++ptr:  2nd loop
addl    $4, -12(%ebp)       // read-modify-write: 2 fused-domain uops.  4 unfused: 1 load + 1 store-address + 1 store-data
movl    -12(%ebp), %eax     // load: 1 uop.   ~6 cycle latency for %eax to be ready
... rest the same

所以第二个循环的指针增量部分还有一个加载单元。可能是 AGU 吞吐量(地址生成单元)的代码瓶颈。 IACA 表示 arch=SNB 就是这种情况，但 HSW 对存储数据吞吐量(而不是 AGU)存在瓶颈。

然而，在不考虑存储转发延迟的情况下，IACA 表示第一个循环可以每 3.5 个周期运行一次迭代，而第二个循环每 4 个周期运行一次。这比 addl $1, -48(%ebp) 的 6 个循环循环携带依赖项更快循环计数器，这表明循环因延迟低于最大 AGU 吞吐量而受到瓶颈。 (资源冲突可能意味着它实际上运行速度比每 6c 一次迭代慢，见下文)。

我们可以测试这个理论:

向 lea 添加额外的负载 uop偏离关键路径的版本将占用更多吞吐量，但不会成为循环延迟链的一部分。例如

movl    -12(%ebp), %eax   //Get the current address
leal    4(%eax), %edx     //Calculate the next address
movl    %edx, -12(%ebp)   //Store the new (next) address

mov     -12(%ebp), %edx

%edx即将被 mov 覆盖，因此不依赖于此加载的结果。 (mov 的目的地是只写的，因此由于寄存器重命名，它打破了依赖链。)。

所以这个额外的负载会带来 lea循环到与 add 相同数量和风格的 uops循环，但延迟不同 .如果额外的负载对速度没有影响，我们就知道第一个循环在加载/存储吞吐量方面没有瓶颈。

更新:OP 的测试确认额外未使用的负载会减慢 lea循环到与 add 大致相同的速度环形。

当我们没有遇到执行端口吞吐量瓶颈时，为什么额外的 uops 很重要

uops 以最旧的优先顺序安排 (在已准备好操作数的 uops 之外)，而不是按关键路径优先顺序。稍后可以在备用周期中完成的额外 uops 实际上会延迟关键路径上的 uops(例如，循环携带依赖项的一部分)。这称为 资源冲突 ，并且可以增加关键路径的延迟。

即，不是等待关键路径延迟离开加载端口无事可做的周期，未使用的加载将在它是加载地址准备就绪的最旧加载时运行。这将延迟其他负载。

同样，在 add额外负载是关键路径的一部分的循环，额外负载会导致更多资源冲突，延迟关键路径上的操作。

其他猜测:

因此，也许更快地准备好存储地址就是这样做的，因此内存操作可以更好地流水线化。 (例如，当接近页面边界时，TLB-miss 页面遍历可以更快开始。即使是正常的硬件预取也不会跨越页面边界，即使它们在 TLB 中很热。循环接触 4MiB 的内存，这对于这种类型的内存来说已经足够了重要的事情。L3 延迟足够高，可能会造成管道泡沫。或者如果您的 L3 很小，那么主内存肯定是。

或者，额外的延迟可能只会让乱序执行更难做好。

关于c++ - 为什么 asm 中的这种差异对性能很重要(在未优化的 ptr++ 与++ptr 循环中)？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37250289/

c++ - 为什么 asm 中的这种差异对性能很重要(在未优化的 ptr++ 与++ptr 循环中)？

上一篇：c++ - 使用 decltype(auto) 的基于范围的 for 循环

下一篇：c++ - gcc 错误？它莫名其妙地将数组衰减为指针，而clang没有