c++ - 确定代码块需要多少个时钟周期

标签 c++ performance optimization cpu performance-testing

是否有工具或方法可以告诉我代码块使用了多少时钟周期? 手动调试和计数对于较大的代码块来说是一种痛苦。

最佳答案

在 x86 上,Intel's IACA (Intel Architecture Code Analyzer是我所知道的唯一静态分析器。它假定零缓存未命中和各种其他简化,但有点用处。

我认为它还假定除最后一个分支外的所有分支都未被采用,因此它可能对具有已采用分支的循环体没有用。

IACA 的数据也有一些错误,例如它认为 shld 在 Sandybridge 上运行缓慢。它确实知道一些不明显的事情,比如 SnB-family CPUs can't micro-fuse 2-register addressing modes .

自 Haswell 更新以来,它基本上已被放弃。 Skylake 可以在比 Haswell 更多的执行端口上运行一些指令(参见 Agner Fog's instruction tables ),但管道非常相似,结果应该相当有用。另请参阅 上的其他链接标记 wiki,包括英特尔的优化手册,以帮助您理解输出。


我喜欢使用这个 iaca.sh 包装器脚本来使 -64 成为默认值(我可以用 -32 覆盖它)。我忘记了我写了多少(可能只是最后的 if (($# >= 1)) 位)以及 LD_LIBRARY_PATH 部分来自哪里。

iaca.sh:

#!/bin/bash
myname=$(realpath "$0")
mypath=$(dirname "$myname")
ld_lib="$LD_LIBRARY_PATH"
app_loc="../lib"

if [ "$LD_LIBRARY_PATH" = "" ]
then
export LD_LIBRARY_PATH="$mypath/$app_loc"
else
export LD_LIBRARY_PATH="$mypath/$app_loc:$LD_LIBRARY_PATH"
fi

if (($# >= 1));then
    exec "$mypath/iaca" -64 "$@"
else
    exec "$mypath/iaca"  # there is no -help, just run with no args for help output
fi

示例:就地前缀和,来自 SIMD prefix sum on Intel cpu :

#include <immintrin.h>

#ifdef IACA_MARKS_OFF
  #define IACA_START
  #define IACA_END
#else
  #include <iacaMarks.h>
#endif

// In-place rewrite an array of values into an array of prefix sums.
// This makes the code simpler, and minimizes cache effects.
int prefix_sum_sse(int data[], int n)
{

//    const int elemsz = sizeof(data[0]);
#define elemsz sizeof(data[0])   // clang-3.5 doesn't allow const int foo = ... as an imm8 arg to intrinsics

    __m128i *datavec = (__m128i*)data;
    const int vec_elems = sizeof(*datavec)/elemsz;
    // to use this for int8/16_t, you still need to change the add_epi32, and the shuffle

    const __m128i *endp = (__m128i*) (data + n - 2*vec_elems);  // pointer to last full vector we can load
    __m128i carry = _mm_setzero_si128();
    for(; datavec <= endp ; datavec += 2) {
        IACA_START
        __m128i x0 = _mm_load_si128(datavec + 0);
        __m128i x1 = _mm_load_si128(datavec + 1); // unroll / pipeline by 1
//      __m128i x2 = _mm_load_si128(datavec + 2);
//      __m128i x3;

        x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, elemsz));
        x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, elemsz));

        x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, 2*elemsz));
        x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, 2*elemsz));

        // more shifting if vec_elems is larger

        x0 = _mm_add_epi32(x0, carry);  // this has to go after the byte-shifts, to avoid double-counting the carry.
        _mm_store_si128(datavec +0, x0); // store first to allow destructive shuffle (e.g. non-avx shufps for FP or pshufb for narrow integers)

        x1 = _mm_add_epi32(_mm_shuffle_epi32(x0, _MM_SHUFFLE(3,3,3,3)), x1);
        _mm_store_si128(datavec +1, x1);

        carry = _mm_shuffle_epi32(x1, _MM_SHUFFLE(3,3,3,3)); // broadcast the high element for next vector
    }
    // FIXME: scalar loop to handle the last few elements
    IACA_END
    return data[n-1];
    #undef elemsz
}

$ gcc -I/opt/iaca-2.1/include -Wall -O3 -c prefix-sum.c -march=nehalem -mtune=haswell
$ iaca.sh prefix-sum.o
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - prefix-sum.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 6.40 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 5.7  | 1.4    1.0  | 1.4    1.0  | 2.0  | 6.3  | 1.0  | 1.3  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | movdqa xmm3, xmmword ptr [rax]
|   1    | 1.0       |     |           |           |     |     |     |     |    | add rax, 0x20
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | movdqa xmm0, xmmword ptr [rax-0x10]
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm1, xmm3
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm1, 0x4
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm3
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm3, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm3, 0x4
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm4, xmm1
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm3, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm4, 0x8
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm0, xmm3
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm4
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pslldq xmm0, 0x8
|   1    |           | 1.0 |           |           |     |     |     |     |    | paddd xmm1, xmm2
|   1    |           | 0.8 |           |           |     | 0.2 |     |     | CP | paddd xmm0, xmm3
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | movaps xmmword ptr [rax-0x20], xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm1, 0xff
|   1    |           | 0.9 |           |           |     | 0.1 |     |     | CP | paddd xmm0, xmm1
|   2^   |           |     | 0.3       | 0.3       | 1.0 |     |     | 0.3 |    | movaps xmmword ptr [rax-0x10], xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xff
|   0*   |           |     |           |           |     |     |     |     |    | movdqa xmm2, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp rdx, rax
|   0F   |           |     |           |           |     |     |     |     |    | jnb 0xffffffffffffff94
Total Num Of Uops: 20

请注意,uop 总数不是对前端、ROB 和 4-wide issue/retire 宽度很重要的融合域 uops。它计算未融合域微指令,这对执行单元(和调度程序)很重要。但这有点愚蠢,因为在未融合的领域中,最重要的是 uop 需要哪个端口,而不是有多少。

这不是最好的例子,因为它在 Haswell 的 shuffle 端口上很容易成为瓶颈。不过,它确实展示了 IACA 如何显示移动消除、微融合存储和宏融合比较和分支。

当有选择时,微指令在端口之间的分配是相当随意的。不要指望它能与真正的硬件相匹配。我认为 IACA 根本没有为 ROB/调度程序建模,真的。此限制和其他限制已在之前的 SO 问题中讨论过。尝试在 IACA 上搜索,因为它是一个相当独特的字符串。

关于c++ - 确定代码块需要多少个时钟周期,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36263258/

相关文章:

java - 使用键复合键进行高效的 HashMap 检索(由 2 个枚举构建)

python - Keras:检索每个模型输出的准确性

database-design - 如何实现twitter的 'friends' timeline'功能

c - 哪种方法可以更快地确定奇偶性?

c++ - 将递归解决方案转换为迭代解决方案

c++ - 如何在 Excel::Range 中包装在 VARIANT 中传递的 Excel Range 对象

c++ - 使用 OpenMP 在 C++11 中并行构造距离矩阵

php - topbar 菜单的性能 - mysql 或 json

c++ - 类中字符串指针的编译错误

php - 我刚刚写了一个糟糕的 PHP 函数,我需要一些帮助(elseif 链 - 开关?)