vectorization - 如何理解icc编译器优化报告中的加速比?

标签 vectorization intel compiler-optimization simd icc

环境是:
icc版本19.0.0.117(gcc版本5.4.0兼容性)
英特尔并行工作室XE集群版2019
英特尔(R) 酷睿(TM) i7-4790 CPU @ 3.60GHz
Ubuntu 16.04

编译器标志是:
-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all

我使用 OpenMP simd 或 intel parama 对循环进行矢量化以获得加速。在icc生成的优化报告中,我通常会看到以下结果:

LOOP BEGIN at get_forces.c(3668,3)
   remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access   [ get_forces.c(3669,4) ]
   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3669,36) ]
   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3669,51) ]
   remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access   [ get_forces.c(3671,4) ]
   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3671,40) ]
   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3671,57) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 2
   remark #15309: vectorization support: normalized vectorization overhead 0.773
   remark #15300: LOOP WAS VECTORIZED
   remark #15450: unmasked unaligned unit stride loads: 3 
   remark #15451: unmasked unaligned unit stride stores: 2 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 21 
   remark #15477: vector cost: 11.000 
   remark #15478: estimated potential speedup: 1.050 
   remark #15488: --- end vector cost summary ---
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=1
LOOP END

我的问题是: 我不明白加速比是如何计算的

normalized vectorization overhead 0.773
scalar cost: 21 
vector cost: 11.000 

另一个更极端和令人困惑的案例可能是

LOOP BEGIN at get_forces.c(2690,8)
<Distributed chunk3>
   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,19) ]
   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,26) ]
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
   remark #15305: vectorization support: vector length 2
   remark #15309: vectorization support: normalized vectorization overhead 1.857
   remark #15448: unmasked aligned unit stride loads: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 7 
   remark #15477: vector cost: 3.500 
   remark #15478: estimated potential speedup: 0.770 
   remark #15488: --- end vector cost summary ---
   remark #25436: completely unrolled by 3  
LOOP END

现在,3.5+1.857=5.357 < 7
那么,我仍然可以 simd 这个循环并获得加速,或者我应该在报告中采用加速数字 0.770 而不是 simd?

如何了解icc编译器优化报告中的加速比?

最佳答案

“标量成本”是指“标量循环的一次迭代的成本”。

“向量成本”是指“向量化循环的一次迭代的成本除以 vector_length*unroll_factor”,即某种程度上相当于一次标量迭代的成本。

“向量化开销”显示了循环之前/之后向量初始化/终结的标准化(通过向量迭代成本)成本。

“估计的潜在加速比”是针对整个循环执行计算的。它显示了矢量化循环执行的归一化(按标量迭代成本)潜在增益 - 包括估计循环行程计数的剥离、余数和主循环。它无法从上面显示的标量和矢量成本中明确得出。

关于vectorization - 如何理解icc编译器优化报告中的加速比?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53203989/

相关文章:

performance - 写入组合缓冲区是否用于正常写入 Intel 上的 WB 内存区域?

haskell - 当构造函数静态已知时消除 GADT 上的模式匹配

java - 为什么 Java 编译器不优化一个简单的方法?

python - 如何使用 if 语句向量化在 numpy 数组中查找最大值?

python - 在 Numpy 中执行 `A[tuple(B.T)]` 的更快方法

linux-kernel - 鼓励 CPU 乱序执行 Meltdown 测试

x86 - Intel芯片组-GPIO编程

c++ - 为什么增加阵列对齐会降低性能?

matlab - 如何从该 matlab 代码中删除循环

c# - C# 是否具有零成本抽象?