c++ - 是否有一些有意义的统计数据来证明保持有符号整数算术溢出未定义?

标签 c++ c language-lawyer signed integer-overflow

C 标准明确指定有符号整数溢出具有未定义行为。然而,大多数 CPU 使用已定义的溢出语义实现有符号算术(除法溢出可能除外:x/0INT_MIN/-1)。

编译器编写者一直在利用这种溢出的未定义性来添加更积极的优化,这些优化往往会以非常微妙的方式破坏遗留代码。例如,此代码可能在较旧的编译器上工作,但在当前版本的 gccclang 上不再适用:

/* Increment a by a value in 0..255, clamp a to positive integers.
   The code relies on 32-bit wrap-around, but the C Standard makes
   signed integer overflow undefined behavior, so sum_max can now 
   return values less than a. There are Standard compliant ways to
   implement this, but legacy code is what it is... */
int sum_max(int a, unsigned char b) {
    int res = a + b;
    return (res >= a) ? res : INT_MAX;
}

是否有确凿的证据表明这些优化是值得的?是否有比较研究记录了实际示例甚至经典基准的实际改进?

我在看这个的时候想出了这个问题:C++Now 2018: John Regehr “Closing Keynote: Undefined Behavior and Compiler Optimizations”

我正在标记 cc++,因为这两种语言的问题相似,但答案可能不同。

最佳答案

我不了解研究和统计数据,但是是的,考虑到编译器实际所做的这一点,肯定有优化。是的,它们非常重要(例如 tldr 循环矢量化)。

除了编译器优化之外,还有另一个方面需要考虑。使用 UB,您可以获得 C/C++ 有符号整数的算术行为,就像您在数学上所期望的那样。例如 x + 10 > x现在成立(当然对于有效的代码),但不会在环绕行为。

我发现了一篇很棒的文章 How undefined signed overflow enables optimizations in GCC来自 Krister Walfridsson 的博客列出了一些将签名溢出 UB 考虑在内的优化。下面的例子来自它。我正在向它们添加 c++ 和汇编示例。

如果优化看起来过于简单、无趣或没有影响,请记住,这些优化只是更大的优化链中的步骤。蝴蝶效应确实会发生,因为在较早的步骤中看似不重要的优化可能会在以后的步骤中触发更具影响力的优化。

如果这些示例看起来很荒谬(谁会写 x * 10 > 0),请记住,您可以通过常量、宏、模板非常轻松地在 C 和 C++ 中找到此类示例。此外,编译器在其 IR 中应用转换和优化时可以得到此类示例。

有符号整数表达式简化

  • 与0比较消除乘法

    (x * c) cmp 0   ->   x cmp 0 
    
    bool foo(int x) { return x * 10 > 0 }
    
    foo(int):
            test    edi, edi
            setg    al
            ret
    
  • 乘法后除法

    (x * c1) / c2 -> x * (c1 / c2) if c1 is divisible by c2

    int foo(int x) { return (x * 20) / 10; }
    
    foo(int):
            lea     eax, [rdi+rdi]
            ret
    
  • 消除否定

    (-x) / (-y) -> x / y

    int foo(int x, int y) { return (-x) / (-y); }
    
    foo(int, int):
            mov     eax, edi
            cdq
            idiv    esi
            ret
    
  • 简化总是对或错的比较

    x + c < x       ->   false
    x + c <= x      ->   false
    x + c > x       ->   true
    x + c >= x      ->   true
    
    bool foo(int x) { return x + 10 >= x; }
    
    foo(int):
            mov     eax, 1
            ret
    
  • 消除比较中的否定

    (-x) cmp (-y)   ->   y cmp x
    
    bool foo(int x, int y) { return -x < -y; }
    
    foo(int, int):
            cmp     edi, esi
            setg    al
            ret
    
  • 减少常量的大小

    x + c > y       ->   x + (c - 1) >= y
    x + c <= y      ->   x + (c - 1) < y
    
    bool foo(int x, int y) { return x + 10 <= y; }
    
    foo(int, int):
            add     edi, 9
            cmp     edi, esi
            setl    al
            ret
    
  • 消除比较中的常量

    (x + c1) cmp c2         ->   x cmp (c2 - c1)
    (x + c1) cmp (y + c2)   ->   x cmp (y + (c2 - c1)) if c1 <= c2
    

    The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.

    bool foo(int x) { return x + 42 <= 11; }
    
    foo(int):
            cmp     edi, -30
            setl    al
            ret
    

指针算术和类型提升

If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.

One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.

这是一个非常重要的优化,因为循环向量化是最有效的优化算法之一。

这是一个将索引从无符号索引更改为有符号索引改进生成程序集的示例:

未签名版本

#include <cstddef>

auto foo(int* v, std::size_t start)
{
    int sum = 0;

    for (std::size_t i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}

在未签名的情况下start + 4必须考虑回绕,并生成一个分支来处理这种情况(分支不利于性能):

; gcc on x64 with -march=skylake

foo1(int*, unsigned long):
        cmp     rsi, -5
        ja      .L3
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
.L3:
        xor     eax, eax
        ret
; clang on x64 with -march=skylake

foo1(int*, unsigned long):                             # @foo1(int*, unsigned long)
        xor     eax, eax
        cmp     rsi, -4
        jae     .LBB0_2
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rsi + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
.LBB0_2:
        ret

附带说明,使用更窄的类型会导致更糟糕的汇编,从而禁止使用 SSE 矢量化指令:

#include <cstddef>

auto foo(int* v, unsigned start)
{
    int sum = 0;

    for (unsigned i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, unsigned int):
        cmp     esi, -5
        ja      .L3
        mov     eax, esi
        mov     eax, DWORD PTR [rdi+rax*4]
        lea     edx, [rsi+1]
        add     eax, DWORD PTR [rdi+rdx*4]
        lea     edx, [rsi+2]
        add     eax, DWORD PTR [rdi+rdx*4]
        lea     edx, [rsi+3]
        add     eax, DWORD PTR [rdi+rdx*4]
        ret
.L3:
        xor     eax, eax
        ret
; clang on x64 with -march=skylake

foo(int*, unsigned int):                              # @foo(int*, unsigned int)
        xor     eax, eax
        cmp     esi, -5
        ja      .LBB0_3
        mov     ecx, esi
        add     esi, 4
        mov     eax, dword ptr [rdi + 4*rcx]
        lea     rdx, [rcx + 1]
        cmp     rdx, rsi
        jae     .LBB0_3
        add     eax, dword ptr [rdi + 4*rcx + 4]
        add     eax, dword ptr [rdi + 4*rcx + 8]
        add     eax, dword ptr [rdi + 4*rcx + 12]
.LBB0_3:
        ret

签名版

然而,使用有符号索引会产生很好的矢量化无分支代码:

#include <cstddef>

auto foo(int* v, std::ptrdiff_t start)
{
    int sum = 0;

    for (std::ptrdiff_t i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, long):
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
; clang on x64 with -march=skylake

foo(int*, long):                              # @foo(int*, long)
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rsi + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

在使用较窄的有符号类型时仍然使用向量化指令:

#include <cstddef>

auto foo(int* v, int start)
{
    int sum = 0;

    for (int i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, int):
        movsx   rsi, esi
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
; clang on x64 with -march=skylake

foo(int*, int):                              # @foo(int*, int)
        movsxd  rax, esi
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rax + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rax]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

数值范围计算

The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as

int x = foo();
if (x > 0) {
  int y = x + 5;
  int z = y / 4;

it determines that x has the range [1, INT_MAX] after the if-statement, and can thus determine that y has the range [6, INT_MAX] as overflow is not allowed. And the next line can be optimized to int z = y >> 2; as the compiler knows that y is non-negative.

auto foo(int x)
{
    if (x <= 0)
        __builtin_unreachable();
    
    return (x + 5) / 4;
}
foo(int):
        lea     eax, [rdi+5]
        sar     eax, 2
        ret

The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form [INT_MIN, (INT_MIN+4)] or [6, INT_MAX] that prevents all useful comparisons with < or >), such as

  • Changing comparisons x<y to true or false if the ranges for x and y does not overlap
  • Changing min(x,y) or max(x,y) to x or y if the ranges do not overlap
  • Changing abs(x) to x or -x if the range does not cross 0
  • Changing x/c to x>>log2(c) if x>0 and the constant c is a power of 2
  • Changing x%c to x&(c-1) if x>0 and the constant c is a power of 2

循环分析与优化

The canonical example of why undefined signed overflow helps loop optimizations is that loops like

for (int i = 0; i <= m; i++)

are guaranteed to terminate for undefined overflow. This helps architectures that have specific loop instructions, as they do in general not handle infinite loops.

But undefined signed overflow helps many more loop optimizations. All analysis such as determining number of iteration, transforming induction variables, and keeping track of memory accesses are using everything in the previous sections in order to do its work. In particular, the set of loops that can be vectorized are severely reduced when signed overflow is allowed.

关于c++ - 是否有一些有意义的统计数据来证明保持有符号整数算术溢出未定义?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56047702/

相关文章:

c - 解析十六进制文件并将其拆分为 6 位整数

c - 想要将我的数组内容归零。仅数组中的第一个数字更改为零

c++ - std::chrono::system_clock::now() 可以抛出异常吗?

c - 在 C 接口(interface)中从 Prolog 获取列表元素

C++17 标准 - 抛弃 const of static

c - C是否检查指针是否越界而不取消引用指针?

c++ - 专门的模板是否需要专门的声明?

c++ - 如何在一个窗口中有多个布局?

c++ - malloc 的不常见使用

c++ - Qt4 QTableWidget 使 Column resize to contents, interactive and aligned to table border