linux - 在编译时启用 AVX512 支持会显着降低性能

标签 linux performance gcc x86-64 avx512

我有一个使用静态库的 C/C++ 项目。该库是为“skylake”架构而构建的。该项目是一个数据处理模块,即它执行许多算术运算、内存复制、搜索、比较等。
CPU为Xeon Gold 6130T,支持AVX512。我尝试使用 -march=skylake 编译我的项目和 -march=skylake-avx512然后与图书馆链接。
如果使用 -march=skylake-avx512与使用 -march=skylake 构建的项目相比,项目性能显着下降(平均下降 30%) .
这怎么解释?可能是什么原因?
信息:

  • Linux 3.10
  • gcc 9.2
  • 英特尔至强金牌 6130T
  • 最佳答案

    project performance is significantly decreased (by 30% on average)


    在无法轻松矢量化的代码中,零星的 AVX 指令会随处降低 CPU 的频率,但不会提供任何好处。在这种情况下,您可能希望完全关闭 AVX 指令。
    Advanced Vector Extensions, Downclocking :

    Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:

    • L0 (100%): The normal turbo boost limit.
    • L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
    • L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions. The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.

    Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL is an example of only using 256-bit operands in AVX-512, making it a sensible default for mixed loads.


    另见
  • On the dangers of Intel's frequency scaling .
  • Gathering Intel on Intel AVX-512 Transitions .
  • How to Fix Intel? .
  • 关于linux - 在编译时启用 AVX512 支持会显着降低性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63484266/

    相关文章:

    python - 时间测量精度

    c - 手动编译并使用 make 链接后为 "File or folder not found"

    c - 如何在考虑性能的情况下最好地用C编写体素引擎

    mysql - 对大量文档执行全文搜索

    c - 在GCC中编译C文件时出错(多个文件合而为一)

    c++ - strcpy_s 不适用于 gcc

    c++ - Red hat EnterPrise Edition 4 GCC 编译器支持带 BOM 的 utf8

    Python - 通过套接字与子进程通信

    linux - 使用 SSH 的嵌套 grep

    linux - 遍历进程树