c++ - 在 float32 中保存 float16 最大数量

标签 c++ floating-point

如何以 float32 ( https://en.wikipedia.org/wiki/Half-precision_floating-point_format ) 格式保存 float16 ( https://en.wikipedia.org/wiki/Single-precision_floating-point_format ) 最大数字?

我想要一个可以将 0x7bff 转换为 65504 的函数。0x7bff 是可以用浮点半精度表示的最大值:

0 11110 1111111111 -> decimal value: 65504 

我想要 0x7bff 来表示我程序中的实际位。

float fp16_max = bit_cast(0x7bff); 
# want "std::cout << fp16_max" to be 65504

我尝试实现了这样一个功能,但似乎没有用:

float bit_cast (uint32_t fp16_bits) {
    float i;
    memcpy(&i, &fp16_bits, 4);
    return i; 
}    
float test = bit_cast(0x7bff);
# print out test: 4.44814e-41

最佳答案

#include <cmath>
#include <cstdio>


/*  Decode the IEEE-754 binary16 encoding into a floating-point value.
    Details of NaNs are not handled.
*/
static float InterpretAsBinary16(unsigned Bits)
{
    //  Extract the fields from the binary16 encoding.
    unsigned SignCode        = Bits >> 15;
    unsigned ExponentCode    = Bits >> 10 & 0x1f;
    unsigned SignificandCode = Bits       & 0x3ff;

    //  Interpret the sign bit.
    float Sign = SignCode ? -1 : +1;

    //  Partition into cases based on exponent code.

    float Significand, Exponent;

    //  An exponent code of all ones denotes infinity or a NaN.
    if (ExponentCode == 0x1f)
        return Sign * (SignificandCode == 0 ? INFINITY : NAN);

    //  An exponent code of all zeros denotes zero or a subnormal.
    else if (ExponentCode == 0)
    {
        /*  Subnormal significands have a leading zero, and the exponent is the
            same as if the exponent code were 1.
        */
        Significand = 0 + SignificandCode * 0x1p-10;
        Exponent    = 1 - 0xf;
    }

    //  Other exponent codes denote normal numbers.
    else
    {
        /*  Normal significands have a leading one, and the exponent is biased
            by 0xf.
        */
        Significand = 1 + SignificandCode * 0x1p-10;
        Exponent    = ExponentCode - 0xf;
    }

    //  Combine the sign, significand, and exponent, and return the result.
    return Sign * std::ldexp(Significand, Exponent);
}


int main(void)
{
    unsigned Bits = 0x7bff;
    std::printf(
        "Interpreting the bits 0x%x as an IEEE-754 binary16 yields %.99g.\n",
        Bits,
        InterpretAsBinary16(Bits));
}

关于c++ - 在 float32 中保存 float16 最大数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56994878/

相关文章:

ios - 生成一个介于 0 和 1 之间的随机 float

objective-c - 如何将字符串转换为 float ?

c++ - 在 python 中实现类

c# - 通过 IntPtr 循环?

c++ - callgrind 有合理的替代品吗?

javascript - 关于javascript的toFixed()函数用法的问题

c - 随机生成带 float 的 C 程序

c++ - 试图让条形码扫描仪与 Arduino 一起工作

c++ - c++中分号前只有一个整数的一行代码

ruby-on-rails - Rails 3 GPS 坐标的 float 或十进制数