c++ - 更快地近似数组的倒数平方根

如何在具有 popcnt 和 SSE4.2 的 cpu 上更快地计算数组的近似倒数平方根？

输入是存储在 float 组中的正整数(范围从 0 到大约 200,000)。

输出是一个 float 数组。

两个数组都针对 sse 进行了正确的内存对齐。

下面的代码只用了1个xmm寄存器，运行在linux上，可以通过gcc -O3 code.cpp -lrt -msse4.2编译

谢谢。

#include <iostream>
#include <emmintrin.h>
#include <time.h>

using namespace std;
void print_xmm(__m128 xmm){
    float out[4];
    _mm_storeu_ps(out,xmm);
    int i;
    for (i = 0; i < 4; ++i) std::cout << out[i] << " ";
    std::cout << std::endl;
}

void print_arr(float* ptr, size_t size){
    size_t i;
    for(i = 0; i < size; ++i){
        cout << ptr[i] << " ";
    }
    cout << endl;
}

int main(void){
    size_t size = 25000 * 4;
        // this has to be multiple of 4
    size_t repeat = 10000;
        // test 10000 cycles of the code
    float* ar_in = (float*)aligned_alloc(16, size*sizeof(float));
    float* ar_out = (float*)aligned_alloc(16, size*sizeof(float));
    //fill test data into the input array
    //the data is an array of positive numbers.
    size_t i;
    for (i = 0; i < size; ++i){
        ar_in[i] = (i+1) * (i+1);
    }
    //prepare for recipical square root.
    __m128 xmm0;
    size_t size_fix = size*sizeof(float)/sizeof(__m128);
    float* ar_in_end  = ar_in + size_fix;
    float* ar_out_now;
    float* ar_in_now;
    //timing
    struct timespec tp_start, tp_end;
    i = repeat;
    clock_gettime(CLOCK_MONOTONIC, &tp_start);
    //start timing
    while(--i){
        ar_out_now = ar_out;
        for(ar_in_now = ar_in;
            ar_in_now != ar_in_end;
            ar_in_now += 4, ar_out_now+=4){
            //4 = sizeof(__m128)/sizeof(float);
            xmm0 = _mm_load_ps(ar_in_now);
            //cout << "load xmm: ";
            //print_xmm(xmm0);
            xmm0 = _mm_rsqrt_ps(xmm0);
            //cout << "rsqrt xmm: ";
            //print_xmm(xmm0);
            _mm_store_ps(ar_out_now,xmm0);
        }
    }
    //end timing
    clock_gettime(CLOCK_MONOTONIC, &tp_end);
    double timing;
    const double nano = 0.000000001;

    timing = ((double)(tp_end.tv_sec  - tp_start.tv_sec )
           + (tp_end.tv_nsec - tp_start.tv_nsec) * nano)/repeat;

    cout << "    timing per cycle: " << timing << endl;
    /* 
    cout << "input array: ";
    print_arr(ar_in, size);
    cout << "output array: ";
    print_arr(ar_out,size);
    */
    //free mem
    free(ar_in);
    free(ar_out);
    return 0;
}

最佳答案

你的 float 组有多大？如果它在 L1(或可能是 L2)中已经很热，则此代码的 gcc5.3 输出会成为现代 Intel CPU 上 uop 吞吐量的瓶颈，因为它使用 6 个融合域 uops 进行循环，每次迭代执行一个 vector 。 (因此它将以每 2 个周期一个 vector 的速度运行)。

要在现代 Intel CPU 上实现每时钟 1 个 vector 的吞吐量，您需要展开循环(请参阅下文了解为什么未展开的 asm 无法工作)。让编译器为您完成它可能很好(而不是在 C++ 源代码中手动完成)。例如使用配置文件引导优化 ( gcc -fprofile-use )，或者只是盲目地使用 -funroll-loops .

理论上，每个时钟 16 个字节足以使一个内核的主内存带宽饱和。然而，IIRC Z Boson 观察到使用多核的带宽更好，这可能是因为多核保持更多未完成的请求，并且一个核上的停顿不会使内存空闲。但是，如果输入在内核的 L2 中很热，则最好使用该内核处理数据。

在 Haswell 或更高版本上，每个时钟加载和存储 16 个字节只是 L1 缓存带宽的一半，因此您需要一个 AVX 版本来最大化每个内核的带宽。

如果内存瓶颈，you might want to do a Newton-Raphson iteration to get a nearly-full accuracy 1/sqrt(x) ，尤其是当您对一个大数组使用多个线程时。(因为如果单个线程无法维持每个时钟的一个加载+存储也没关系。)

或者也许只使用 rsqrt稍后加载此数据时即时。它非常便宜，具有高吞吐量，但延迟仍然类似于 FP 添加。同样，如果它是一个不适合缓存的大数组，通过减少对数据的单独传递来增加计算强度是一件大事。 (Cache Blocking aka Loop Tiling 也是一个好主意，如果可以的话:在缓存大小的数据 block 上运行算法的多个步骤。)

只有在找不到有效利用缓存的方法时，才使用绕过缓存的 NT 存储作为最后的手段。如果您可以转换一些您将要使用的数据，那就更好了，这样下次使用时它就在缓存中。

对于 Intel SnB 系列 CPU，主循环(来自 .L31 to jne .L31 on the Godbolt compiler explorer)是 6 微指令，因为 indexed addressing modes don't micro-fuse . (不幸的是，这还没有记录在 Agner Fog's microarch pdf 中。)

Nehalem 上有 4 个融合域微指令，只有三个 ALU 微指令，因此 Nehalem 应该以每个时钟 1 的速度运行它。

.L31:    # the main loop: 6 uops on SnB-family, 4 uops on Nehalem
    rsqrtps xmm0, XMMWORD PTR [rbx+rax]       # tmp127, MEM[base: ar_in_now_10, index: ivtmp.51_61, offset: 0B]
    movaps  XMMWORD PTR [rbp+0+rax], xmm0       # MEM[base: ar_out_now_12, index: ivtmp.51_61, offset: 0B], tmp127
    add     rax, 16   # ivtmp.51,
    cmp     rax, 100000       # ivtmp.51,
    jne     .L31      #,

由于您想编写一个单独的目的地，因此无法将循环减少到 4 个融合域微指令，因此它可以在每个时钟一个 vector 上运行而无需展开。 (加载和存储都需要是单寄存器寻址模式，因此使用由 src - dst 索引的 current_dst 而不是递增 src 的技巧不起作用)。

修改您的 C++，使 gcc 使用指针递增只会节省一个 uop，因为您必须递增 src 和 dst。即 float *endp = start + length;和 for (p = start ; p < endp ; p+=4) {}会像这样循环

.loop:
    rsqrtps  xmm0, [rsi]
    add      rsi, 16
    movaps   [rdi], xmm0
    add      rdi, 16
    cmp      rdi, rbx
    jne      .loop

希望 gcc 在展开时做这样的事情，否则 rsqrtps + movaps如果它们仍然使用索引寻址模式，它们将是 4 个融合域微指令，并且再多的展开也不会使您的循环以每个时钟一个 vector 的速度运行。

关于c++ - 更快地近似数组的倒数平方根，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38622534/

c++ - 更快地近似数组的倒数平方根

上一篇：c++ - 如何检查按下按钮的时间？

下一篇：c++ - 为 STL 算法 lambda 适配 std::pair