c++ - __builtin_unreachable 促进了哪些优化？

从gcc的文档来看

If control flow reaches the point of the __builtin_unreachable, the program is undefined.

我认为 __builtin_unreachable 可以以各种创造性的方式用作优化器的提示。所以我做了个小实验

void stdswap(int& x, int& y)
{
    std::swap(x, y);
}

void brswap(int& x, int& y)
{
    if(&x == &y)
        __builtin_unreachable();
    x ^= y;
    y ^= x;
    x ^= y;
}

void rswap(int& __restrict x, int& __restrict y)
{
    x ^= y;
    y ^= x;
    x ^= y;
}

gets compiled to (g++ -O2)

stdswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        mov     edx, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], edx
        mov     DWORD PTR [rsi], eax
        ret
brswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], eax
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rsi], eax
        xor     DWORD PTR [rdi], eax
        ret
rswap(int&, int&):
        mov     eax, DWORD PTR [rsi]
        mov     edx, DWORD PTR [rdi]
        mov     DWORD PTR [rdi], eax
        mov     DWORD PTR [rsi], edx
        ret

我假设 stdswap 和 rswap 从优化器的角度来看是最优的。为什么 brswap 没有被编译成同样的东西？我可以用 __builtin_unreachable 让它编译成同样的东西吗？

最佳答案

__builtin_unreachable 的目的是帮助编译器:

删除死代码(程序员知道永远不会执行的代码)
通过让编译器知道路径是“冷的”来线性化代码(通过调用 noreturn 函数可以达到类似的效果)

考虑以下几点:

void exit_if_true(bool x);

int foo1(bool x)
{
    if (x) {
        exit_if_true(true);
        //__builtin_unreachable(); // we do not enable it here
    } else {
        std::puts("reachable");
    }

    return 0;
}
int foo2(bool x)
{
    if (x) {
        exit_if_true(true);
        __builtin_unreachable();  // now compiler knows exit_if_true
                                  // will not return as we are passing true to it
    } else {
        std::puts("reachable");
    }

    return 0;
}

生成的代码:

foo1(bool):
        sub     rsp, 8
        test    dil, dil
        je      .L2              ; that jump is going to change
        mov     edi, 1
        call    exit_if_true(bool)
        xor     eax, eax         ; that tail is going to be removed
        add     rsp, 8
        ret
.L2:
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
foo2(bool):
        sub     rsp, 8
        test    dil, dil
        jne     .L9              ; changed jump
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
.L9:
        mov     edi, 1
        call    exit_if_true(bool)

注意区别:

xor eax、eax 和 ret 已被删除，因为现在编译器知道这是死代码。
编译器交换了分支的顺序:使用 puts 调用的分支现在排在第一位，这样条件跳转可以更快(未采用的前向分支在预测和没有预测时都更快信息)。

这里假设以noreturn 函数调用或__builtin_unreachable 结束的分支将只执行一次或导致longjmp 调用或exception throw 两者都很少见，在优化过程中不需要优先考虑。

您正试图将它用于不同的目的 - 通过向编译器提供有关别名的信息(并且您可以尝试对对齐做同样的事情)。不幸的是，GCC 不理解这种地址检查。

正如您所注意到的，添加 __restrict__ 会有所帮助。所以 __restrict__ 适用于别名，__builtin_unreachable 则不然。

查看以下使用 __builtin_assume_aligned 的示例:

void copy1(int *__restrict__ dst, const int *__restrict__ src)
{
    if (reinterpret_cast<uintptr_t>(dst) % 16 == 0) __builtin_unreachable();
    if (reinterpret_cast<uintptr_t>(src) % 16 == 0) __builtin_unreachable();
        
    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

void copy2(int *__restrict__ dst, const int *__restrict__ src)
{
    dst = static_cast<int *>(__builtin_assume_aligned(dst, 16));
    src = static_cast<const int *>(__builtin_assume_aligned(src, 16));

    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

生成的代码:

copy1(int*, int const*):
        movdqu  xmm0, XMMWORD PTR [rsi]
        movups  XMMWORD PTR [rdi], xmm0
        ret
copy2(int*, int const*):
        movdqa  xmm0, XMMWORD PTR [rsi]
        movaps  XMMWORD PTR [rdi], xmm0
        ret

您可以假设编译器可以理解 dst % 16 == 0 表示指针是 16 字节对齐的，但事实并非如此。因此使用未对齐的存储和加载，而第二个版本生成需要对齐地址的更快的指令。

关于c++ - __builtin_unreachable 促进了哪些优化？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54764535/

c++ - __builtin_unreachable 促进了哪些优化？

上一篇：c++ - 用于实时信号处理的快速 C++ 正弦和余弦替代方案

下一篇：c++ - 为什么设置一个 const 变量(将以相同的值存储)在划分后会导致不同的结果？