c++ - 范围缩减单精度浮点精度差

我正在尝试将范围缩减作为实现正弦函数的第一步。

我正在遵循论文 "ARGUMENT REDUCTION FOR HUGE ARGUMENTS" by K.C. NG 中描述的方法

当使用 x 从 0 到 20000 的输入范围时，我得到的错误大到 0.002339146。我的错误显然不应该那么大，我不确定如何减少它。我注意到误差幅度与输入余弦/正弦的 theta 幅度相关。

我能够获得论文提到的 Nearpi.c 代码，但我不确定如何将代码用于单精度浮点。如果有人感兴趣，可以在此链接中找到 Nearpi.c 文件:nearpi.c

这是我的 MATLAB 代码:

x = 0:0.1:20000;

% Perform range reduction
% Store constant 2/pi
twooverpi = single(2/pi);

% Compute y
y = (x.*twooverpi);

% Compute k (round to nearest integer
k = round(y);

% Solve for f
f = single(y-k);

% Solve for r
r = single(f*single(pi/2));

% Find last two bits of k
n = bitand(fi(k,1,32,0),fi(3,1,32,0));
n = single(n);

% Preallocate for speed
z(length(x)) = 0;
for i = 1:length(x)

    switch(n(i))
        case 0
            z(i)=sin(r(i));
        case 1
            z(i) = single(cos(r(i)));
        case 2
            z(i) = -sin(r(i));
        case 3
            z(i) = single(-cos(r(i)));
        otherwise
    end

end

maxerror = max(abs(single(z - single(sin(single(x))))))
minerror = min(abs(single(z - single(sin(single(x))))))

我已经编辑了程序 Nearpi.c 以便它可以编译。但是我不确定如何解释输出。该文件还需要一个输入，我必须手动输入，我也不确定输入的重要性。

这是工作的 Nearpi.c:

/*
 ============================================================================
 Name        : nearpi.c
 Author      : 
 Version     :
 Copyright   : Your copyright notice
 Description : Hello World in C, Ansi-style
 ============================================================================
 */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>


/*
 * Global macro definitions.
 */

# define hex( double )  *(1 + ((long *) &double)), *((long *) &double)
# define sgn(a)         (a >= 0 ? 1 : -1)
# define MAX_k          2500
# define D              56
# define MAX_EXP        127
# define THRESHOLD      2.22e-16

/*
 *  Global Variables
 */

int     CFlength,               /* length of CF including terminator */
        binade;
double  e,
        f;                      /* [e,f] range of D-bit unsigned int of f;
                                   form 1X...X */

// Function Prototypes
int dbleCF (double i[], double j[]);
void input (double i[]);
void nearPiOver2 (double i[]);


/*
 *  This is the start of the main program.
 */

int main (void)
{
    int     k;                  /* subscript variable */
    double  i[MAX_k],
            j[MAX_k];           /* i and j are continued fractions
                                   (coeffs) */


   // fp = fopen("/src/cfpi.txt", "r");


/*
 *  Compute global variables e and f, where
 *
 *      e = 2 ^ (D-1), i.e. the D bit number 10...0
 *  and
 *      f = 2 ^ D - 1, i.e. the D bit number 11...1  .
 */

    e = 1;
    for (k = 2; k <= D; k = k + 1)
        e = 2 * e;
    f = 2 * e - 1;

 /*
  *  Compute the continued fraction for  (2/e)/(pi/2)  , i.e.
  *  q's starting value for the first binade, given the continued
  *  fraction for  pi  as input; set the global variable CFlength
  *  to the length of the resulting continued fraction (including
  *  its negative valued terminator).  One should use as many
  *  partial coefficients of  pi  as necessary to resolve numbers
  *  of the width of the underflow plus the overflow threshold.
  *  A rule of thumb is 0.97 partial coefficients are generated
  *  for every decimal digit of  pi .
  *
  *  Note: for radix B machines, subroutine  input  should compute
  *  the continued fraction for  (B/e)/(pi/2)  where  e = B ^ (D - 1).
  */

    input (i);

/*
 *  Begin main loop over all binades:
 *  For each binade, find the nearest multiples of pi/2 in that binade.
 *
 *  [ Note: for hexadecimal machines ( B = 16 ), the rest of the main
 *  program simplifies(!) to
 *
 *                      B_ade = 1;
 *                      while (B_ade < MAX_EXP)
 *                      {
 *                          dbleCF (i, j);
 *                          dbleCF (j, i);
 *                          dbleCF (i, j);
 *                          CFlength = dbleCF (j, i);
 *                          B_ade = B_ade + 1;
 *                      }
 *                  }
 *
 *  because the alternation of source & destination are no longer necessary. ]
 */

    binade = 1;
    while (binade < MAX_EXP)
    {

/*
 *  For the current (odd) binade, find the nearest multiples of pi/2.
 */

        nearPiOver2 (i);

/*
 *  Double the continued fraction to get to the next (even) binade.
 *  To save copying arrays, i and j will alternate as the source
 *  and destination for the continued fractions.
 */

        CFlength = dbleCF (i, j);
        binade = binade + 1;

/*
 *  Check for main loop termination again because of the
 *  alternation.
 */

        if (binade >= MAX_EXP)
            break;

/*
 *  For the current (even) binade, find the nearest multiples of pi/2.
 */

        nearPiOver2 (j);

/*
 *  Double the continued fraction to get to the next (odd) binade.
 */

        CFlength = dbleCF (j, i);
        binade = binade + 1;
    }

    return 0;
}                               /* end of Main Program */

/*
 *  Subroutine  DbleCF  doubles a continued fraction whose partial
 *  coefficients are i[] into a continued fraction j[], where both
 *  arrays are of a type sufficient to do D-bit integer arithmetic.
 *
 *  In my case ( D = 56 ) , I am forced to treat integers as double
 *  precision reals because my machine does not have integers of
 *  sufficient width to handle D-bit integer arithmetic.
 *
 *  Adapted from a Basic program written by W. Kahan.
 *
 *  Algorithm based on Hurwitz's method of doubling continued
 *  fractions (see Knuth Vol. 3, p.360).
 *
 *  A negative value terminates the last partial quotient.
 *
 *  Note:  for the non-C programmers, the statement  break
 *  exits a loop and the statement  continue  skips to the next
 *  case in the same loop.
 *
 *  The call  modf ( l / 2, &l0 )  assigns the integer portion of
 *  half of L to L0.
 */

int dbleCF (double i[], double j[])
{
      double k,
                    l,
                    l0,
                    j0;
      int    n,
                    m;
    n = 1;
    m = 0;
    j0 = i[0] + i[0];
    l = i[n];
    while (1)
    {
        if (l < 0)
        {
            j[m] = j0;
            break;
        };
        modf (l / 2, &l0);
        l = l - l0 - l0;
        k = i[n + 1];
        if (l0 > 0)
        {
            j[m] = j0;
            j[m + 1] = l0;
            j0 = 0;
            m = m + 2;
        };
        if (l == 0) {
/*
 *  Even case.
 */
            if (k < 0)
            {
                m = m - 1;
                break;
            }
            else
            {
                j0 = j0 + k + k;
                n = n + 2;
                l = i[n];
                continue;
            };
        }
/*
 *  Odd case.
 */
        if (k < 0)
        {
            j[m] = j0 + 2;
            break;
        };
        if (k == 0)
        {
            n = n + 2;
            l = l + i[n];
            continue;
        };
        j[m] = j0 + 1;
        m = m + 1;
        j0 = 1;
        l = k - 1;
        n = n + 1;
        continue;
    };
    m = m + 1;
    j[m] = -99999;
    return (m);
}

/*
 *  Subroutine  input  computes the continued fraction for
 *  (2/e) / (pi/2) , where  e = 2 ^ (D-1) , given  pi 's
 *  continued fraction as input.  That is, double the continued
 *  fraction of  pi   D-3  times and place a zero at the front.
 *
 *  One should use as many partial coefficients of  pi  as
 *  necessary to resolve numbers of the width of the underflow
 *  plus the overflow threshold.  A rule of thumb is  0.97
 *  partial coefficients are generated for every decimal digit
 *  of  pi .  The last coefficient of  pi  is terminated by a
 *  negative number.
 *
 *  I'll be happy to supply anyone with the partial coefficients
 *  of  pi .  My ARPA address is  mcdonald@ucbdali.BERKELEY.ARPA .
 *
 *  I computed the partial coefficients of  pi  using a method of
 *  Bill Gosper's.  I need only compute with integers, albeit
 *  large ones.  After writing the program in  bc  and  Vaxima  ,
 *  Prof. Fateman suggested  FranzLisp .  To my surprise, FranzLisp
 *  ran the fastest!  the reason?   FranzLisp's  Bignum  package is
 *  hand coded in assembler.  Also,  FranzLisp  can be compiled.
 *
 *
 *  Note: for radix B machines, subroutine  input  should compute
 *  the continued fraction for  (B/e)/(pi/2)  where  e = B ^ (D - 1).
 *  In the case of hexadecimal ( B = 16 ), this is done by repeated
 *  doubling the appropriate number of times.
 */

void input (double i[])
{
    int     k;
    double  j[MAX_k];

/*
 *  Read in the partial coefficients of  pi  from a precalculated file
 *  until a negative value is encountered.
 */

    k = -1;
    do
    {
        k = k + 1;
        scanf ("%lE", &i[k]);
        printf("hello\n");
        printf("%d", k);
    } while (i[k] >= 0);

/*
 *  Double the continued fraction for  pi  D-3  times using
 *  i  and  j  alternately as source and destination.  On my
 *  machine  D = 56  so  D-3  is odd; hence the following code:
 *
 *  Double twice  (D-3)/2  times,
 */
    for (k = 1; k <= (D - 3) / 2; k = k + 1)
    {
        dbleCF (i, j);
        dbleCF (j, i);
    };
/*
 *  then double once more.
 */
    dbleCF (i, j);

/*
 *  Now append a zero on the front (reciprocate the continued
 *  fraction) and the return the coefficients in  i .
 */

    i[0] = 0;
    k = -1;
    do
    {
        k = k + 1;
        i[k + 1] = j[k];
    } while (j[k] >= 0);

/*
 *  Return the length of the continued fraction, including its
 *  terminator and initial zero, in the global variable CFlength.
 */

    CFlength = k;
}

/*
 *  Given a continued fraction's coefficients in an array  i ,
 *  subroutine  nearPiOver2  finds all machine representable
 *  values near a integer multiple of  pi/2  in the current binade.
 */

void nearPiOver2 (double i[])
{
    int     k,                  /* subscript for recurrences    (see
                                   handout) */
            K;                  /* like  k , but used during cancel. elim.
                                   */
    double  p[MAX_k],           /* product of the q's           (see
                                   handout) */
            q[MAX_k],           /* successive tail evals of CF  (see
                                   handout) */
            j[MAX_k],           /* like convergent numerators   (see
                                   handout) */
            tmp,                /* temporary used during cancellation
                                   elim. */
            mk0,                /* m[k - 1]                     (see
                                   handout) */
            mk,                 /* m[k] is one of the few ints  (see
                                   handout) */
            mkAbs,              /* absolute value of m sub k
                                */
            mK0,                /* like  mk0 , but used during cancel.
                                   elim. */
            mK,                 /* like  mk  , but used during cancel.
                                   elim. */
            z,                  /* the object of our quest (the argument)
                                */
            m0,                 /* the mantissa of z as a D-bit integer
                                */
            x,                  /* the reduced argument         (see
                                   handout) */
            ldexp (),           /* sys routine to multiply by a power of
                                   two  */
            fabs (),            /* sys routine to compute FP absolute
                                   value   */
            floor (),           /* sys routine to compute greatest int <=
                                   value   */
            ceil ();            /* sys routine to compute least int >=
                                   value   */

 /*
  *  Compute the q's by evaluating the continued fraction from
  *  bottom up.
  *
  *  Start evaluation with a big number in the terminator position.
  */

    q[CFlength] = 1.0 + 30;

    for (k = CFlength - 1; k >= 0; k = k - 1)
        q[k] = i[k] + 1 / q[k + 1];

/*
 *  Let  THRESHOLD  be the biggest  | x |  that we are interesed in
 *  seeing.
 *
 *  Compute the p's and j's by the recurrences from the top down.
 *
 *  Stop when
 *
 *        1                          1
 *      -----   >=  THRESHOLD  >   ------    .
 *      2 |j |                     2 |j  |
 *          k                          k+1
 */

    p[0] = 1;
    j[0] = 0;
    j[1] = 1;
    k = 0;
    do
    {
        p[k + 1] = -q[k + 1] * p[k];
        if (k > 0)
            j[1 + k] = j[k - 1] - i[k] * j[k];
        k = k + 1;
    } while (1 / (2 * fabs (j[k])) >= THRESHOLD);

/*
 *  Then  mk  runs through the integers between
 *
 *                  k        +                   k        +
 *              (-1)  e / p  -  1/2     &    (-1)  f / p  -  1/2  .
 *                         k                            k
 */

    for (mkAbs = floor (e / fabs (p[k]));
            mkAbs <= ceil (f / fabs (p[k])); mkAbs = mkAbs + 1)
    {

        mk = mkAbs * sgn (p[k]);

/*
 *  For each  mk ,  mk0  runs through integers between
 *
 *                    +
 *              m  q  -  p  THRESHOLD  .
 *               k  k     k
 */

        for (mk0 = floor (mk * q[k] - fabs (p[k]) * THRESHOLD);
                mk0 <= ceil (mk * q[k] + fabs (p[k]) * THRESHOLD);
                mk0 = mk0 + 1)
        {

/*
 *  For each pair  { mk ,  mk0 } , check that
 *
 *                             k
 *              m       =  (-1)  ( j   m  - j  m   )
 *               0                  k-1 k    k  k-1
 */
            m0 = (k & 1 ? -1 : 1) * (j[k - 1] * mk - j[k] * mk0);

/*
 *  lies between  e  and  f .
 */
            if (e <= fabs (m0) && fabs (m0) <= f)
            {

/*
 *  If so, then we have found an
 *
 *                              k
 *              x       =  ((-1)  m  / p  - m ) / j
 *                                 0    k    k     k
 *
 *                      =  ( m  q  - m   ) / p  .
 *                            k  k    k-1     k
 *
 *  But this later formula can suffer cancellation.  Therefore,
 *  run the recurrence for the  mk 's  to get  mK  with minimal
 *   | mK | + | mK0 |  in the hope  mK  is  0  .
 */
                K = k;
                mK = mk;
                mK0 = mk0;
                while (fabs (mK) > 0)
                {
                    p[K + 1] = -q[K + 1] * p[K];
                    tmp = mK0 - i[K] * mK;
                    if (fabs (tmp) > fabs (mK0))
                        break;
                    mK0 = mK;
                    mK = tmp;
                    K = K + 1;
                };

/*
 *  Then
 *              x       =  ( m  q  - m   ) / p
 *                            K  K    K-1     K
 *
 *  as accurately as one could hope.
 */
                x = (mK * q[K] - mK0) / p[K];

/*
 *  To return  z  and  m0  as positive numbers,
 *   x  must take the sign of  m0  .
 */
                x = x * sgn (m0);
                m0 = fabs (m0);

/*d
 *  Set  z = m0 * 2 ^ (binade+1-D) .
 */
                z = ldexp (m0, binade + 1 - D);

/*
 *  Print  z (hex),  z (dec),  m0 (dec),  binade+1-D,  x (hex), x (dec).
 */

                printf ("%08lx %08lx    Z=%22.16E    M=%17.17G    L+1-%d=%3d    %08lx %08lx    x=%23.16E\n", hex (z), z, m0, D, binade + 1 - D, hex (x), x);

            }
        }
    }
}

最佳答案

理论

首先让我们注意使用单精度算术的区别。

[公式 8] f 的最小值可以更大。由于 double 数是单精度数的超集，因此最接近的 single到 2/pi 的倍数只能离得更远了~2.98e-19 ，因此 f 的固定算术表示中的前导零的数量最多必须有 61 个前导零(但可能会更少)。表示此数量 fdigits .

[9 之前的方程] 因此，不是 121 位，y必须准确到 fdigits + 24(单精度非零有效位)+ 7(额外保护位)= fdigits + 31，最多 92。

[公式 9] “因此，加上 x 的指数的宽度，2/pi 必须包含 127(single 的最大指数)+ 31 + fdigits ，或 158 + 79104| 最多为 156 |位。

【2.5小节】fdigits的大小由 A 中的零数决定在二进制小数点之前(并且不受移动到 x 的影响)，而大小 single由 9 之前的公式确定。

大号C ( x >=2^24), x看起来像这样:[24 位，M 个零]。乘以 x ，其大小是第一个 A M 的位, 将产生一个整数(2/pi 的零只会将所有内容都转换为整数)。

选择 x从C开始一点点 M+d将导致产品 2/pi至多大小 x*C .在 double 中，d-24被选择为 174(而不是 24，我们有 53)，因此产品的大小最多为 121。在 d 中，选择single就够了使得 d ，或更准确地说，d-24 <= 92 .即，d-24 <= fdigits+31可以选择为d +55，或最多 116。

结果，fdigits大小最多为 116 位。

因此，我们面临两个问题:

计算B .这包括从链接的论文中阅读引用文献 6 并理解它。可能没那么容易。 :) 据我所知，这是唯一的地方 fdigits用来。

计算nearpi.c ，B的相关位.自 2/pi低于 127，我们可以计算 M 的前 127+116 位离线并将它们存储在一个数组中。见 Wikipedia .

计算2/pi .这涉及乘法y=x*B由 116 位数字组成。这就是使用第 3 节的地方。块的大小选择为 24，因为 2*24 + 2(将两个 24 位数字相乘，并添加 3 个这样的数字)小于 x 的精度, 53(因为 24 除以 96)。我们可以将大小为 11 位的块用于 double出于类似原因的算术。

注意 - single 的技巧仅适用于指数为正数 (x>=2^24) 的数字。

总结一下 - 首先，你必须用 B 解决问题精确。您的 double代码在 Matlab 中不起作用精度也是如此(尝试删除 double 并计算 single ，因为您的 sin(2^53) 只有 53 个有效位，而不是 175 个(无论如何，您不能在 Matlab 中直接乘以如此精确的数字)。其次，该方案应该是适应与 twooverpi 一起工作，再次，关键问题是足够精确地表示 single，并支持高精度数字的乘法。最后，当一切正常时，您可以尝试找出更好的 2/pi 来减少您必须存储和相乘的位数。

希望我没有完全离开 - 欢迎评论和矛盾。

例子

例如，让我们计算 fdigits哪里sin(x) , 在有效位之后没有零 (x = single(2^24-1) = 0)。这简化了查找 M , 如 B由 B 的前 116 位组成.自 2/pi具有 24 位精度和 x 116 位，产品

y = x * B

根据需要将具有 92 位精度。

链接论文中的第 3 节描述了如何以足够的精度执行该产品；相同的算法可用于大小为 11 的块来计算 B在我们的情况下。苦差事，我希望原谅我没有明确这样做，而是依靠y的符号数学工具箱。这个工具箱为我们提供了 Matlab函数，它允许我们以十进制数字指定数字的精度。所以，

vpa('2/pi', ceil(116*log10(2)))

将产生 vpa 的近似值至少 116 位精度。因为 2/pi仅接受整数作为其精度参数，我们通常无法准确指定数字的二进制精度，因此我们使用次佳。

以下代码计算 vpa根据论文，在 sin(x)精确 :

x = single(2^24-1);
y = x *  vpa('2/pi', ceil(116*log10(2)));    % Precision = 103.075
k = round(y);
f = single(y - k);
r = f * single(pi) / 2;
switch mod(k, 4)
    case 0 
        s = sin(r);
    case 1
        s = cos(r);
    case 2
        s = -sin(r);
    case 3
        s = -cos(r);
end
sin(x) - s                                   % Expected value: exactly zero.

(single 的精度是使用 y 获得的，结果证明它是比 Mathematica 更好的数值工具:))

在 Matlab
这个问题的另一个答案(已被删除)引导我在 libm 中实现，虽然适用于 double 数字，但非常彻底地遵循链接的论文。

查看文件 s_sin.c对于包装器(链接论文中的表 2 在文件末尾显示为 libm 语句)和 e_rem_pio2.c对于参数缩减代码(特别感兴趣的是包含 switch 的前 396 个十六进制数字的数组，从第 69 行开始)。

关于c++ - 范围缩减单精度浮点精度差，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9423516/

c++ - 范围缩减单精度浮点精度差

上一篇：c++ - glUniformMatrix4fv 失败，错误代码为 GL_INVALID_OPERATION

下一篇：c++ - 如何在 GCC 中链接 libc 和 libstdc++ 的调试版本？

c++ - 范围缩减 单精度浮点精度差

上一篇：c++ - glUniformMatrix4fv 失败，错误代码为 GL_INVALID_OPERATION

下一篇：c++ - 如何在 GCC 中链接 libc 和 libstdc++ 的调试版本？

c++ - 范围缩减单精度浮点精度差