algorithm - 何时使用 dsgesv 与 dgesv 来求解线性方程组

在 LAPACK 文档中，它指出 DSGESV(或复数的 ZCGESV)是:

The dsgesv and zcgesv are mixed precision iterative refinement subroutines for exploiting fast single precision hardware. They first attempt to factorize the matrix in single precision (dsgesv) or single complex precision (zcgesv) and use this factorization within an iterative refinement procedure to produce a solution with double precision (dsgesv) / double complex precision (zcgesv) normwise backward error quality (see below). If the approach fails, the method switches to a double precision or double complex precision factorization respectively and computes the solution.

The iterative refinement is not going to be a winning strategy if the ratio single precision performance over double precision performance is too small. A reasonable strategy should take the number of right-hand sides and the size of the matrix into account. This might be done with a call to ilaenv in the future. At present, iterative refinement is implemented.

但是我如何知道单精度性能与 double 性能的比率是多少？有人建议考虑矩阵的大小，但我不知道矩阵的大小到底如何导致此性能比的估计。

有人能澄清这些事情吗？

最佳答案

我的猜测是最好的方法是测试 dgesv() 和 dsgesv()...

查看函数源码dsgesv() Lapack 的 dsgesv() 尝试执行以下操作:

将矩阵 A 转换为 float As
调用sgetrf():LU 分解，单精度
通过调用 sgetrs() 使用 LU 分解求解系统 As.x=b
计算 double 余数r=b-Ax并再次使用sgetrs()求解As.x'=r，添加x=x+x'.

重复最后一步，直到达到 double (最多 30 次迭代)。定义成功的标准是:

$||r||_\infty\leq||x||_\infty.||A||_\infty.\varepsilon.n$

哪里 $\varepsilon$ 是 double float 的精度(大约 1e-13)和 $n$ 是矩阵的大小。如果失败，dsgesv() 将恢复到 dgesv()，因为它先调用 dgetrf()(因式分解)，然后调用 dgetrs() 。因此，dsgesv() 是一种混合精度算法。请参阅this article例如。

最后，对于少量右侧和大型矩阵，dsgesv() 预计将优于 dgesv() ，即分解 sgetrf()/dgetrf() $O(n^3)$ 的成本远高于替换 sgetrs()/dgetrs() $O(n^2.n_{rhs})$ 之一。由于 dsgesv() 中设置的最大迭代次数为 30，因此近似限制为

$n\geq30.n_{rhs}$

此外，sgetrf() 必须证明比 dgetrf() 快得多。由于可用内存带宽或向量处理有限，sgetrf() 可能会更快(查找 SIMD，来自 SSE 的示例:指令 ADDPS)。

可以测试dsgesv()的参数iter来检查迭代细化是否有用。如果为负数，则迭代细化失败，并且使用 dsgesv() 只是浪费时间!

这里是用于比较和计时 dgesv()、sgesv()、dsgesv() 的 C 代码。它可以通过 gcc main.c -o main -llapacke -llapack -lblas 编译，随意测试您自己的矩阵!

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>

#include <lapacke.h>

int main(void){

    srand (time(NULL));

    //size of the matrix
    int n=2000;
    // number of right-hand size
    int nb=3;

    int nbrun=1000*100*100/n/n;

    //memory initialization
    double *aaa=malloc(n*n*sizeof(double));
    if(aaa==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
    double *aa=malloc(n*n*sizeof(double));
    if(aa==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
    double *bbb=malloc(n*nb*sizeof(double));
    if(bbb==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
    double *x=malloc(n*nb*sizeof(double));
    if(x==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
    double *bb=malloc(n*nb*sizeof(double));
    if(bb==NULL){fprintf(stderr,"malloc failed\n");exit(1);}

    float *aaas=malloc(n*n*sizeof(float));
    if(aaas==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
    float *aas=malloc(n*n*sizeof(float));
    if(aas==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
    float *bbbs=malloc(n*n*sizeof(float));
    if(bbbs==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
    float *bbs=malloc(n*nb*sizeof(float));
    if(bbs==NULL){fprintf(stderr,"malloc failed\n");exit(1);}

    int *ipiv=malloc(n*nb*sizeof(int));
    if(ipiv==NULL){fprintf(stderr,"malloc failed\n");exit(1);}

    int i,j;

    //matrix initialization
    double cond=1e3;
    for(i=0;i<n;i++){
        for(j=0;j<n;j++){
            if(j==i){
                aaa[i*n+j]=pow(cond,(i+1)/(double)n);
            }else{
                aaa[i*n+j]=1.9*(rand()/(double)RAND_MAX-0.5)*pow(cond,(i+1)/(double)n)/(double)n;
                //aaa[i*n+j]=(rand()/(double)RAND_MAX-0.5)/(double)n;
                //aaa[i*n+j]=0;
            }
        }
        bbb[i]=i;
    }

    for(i=0;i<n;i++){
        for(j=0;j<n;j++){
            aaas[i*n+j]=aaa[i*n+j];
        }
        bbbs[i]=bbb[i];
    }

    int k=0;
    int ierr;


    //estimating the condition number of the matrix
    memcpy(aa,aaa,n*n*sizeof(double));
    double anorm;
    double rcond;
    //anorm=LAPACKE_dlange( LAPACK_ROW_MAJOR, 'i', n,n, aa, n);
    double work[n];
    anorm=LAPACKE_dlange_work(LAPACK_ROW_MAJOR, 'i', n, n, aa, n, work );
    ierr=LAPACKE_dgetrf( LAPACK_ROW_MAJOR, n, n,aa, n,ipiv );
    if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgetrf", ierr );}
    ierr=LAPACKE_dgecon(LAPACK_ROW_MAJOR, 'i', n,aa, n,anorm,&rcond );
    if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgecon", ierr );}
    printf("condition number is %g\n",anorm,1./rcond);

    //testing dgesv()
    clock_t t;
    t = clock();
    for(k=0;k<nbrun;k++){

        memcpy(bb,bbb,n*nb*sizeof(double));
        memcpy(aa,aaa,n*n*sizeof(double));



        ierr=LAPACKE_dgesv(LAPACK_ROW_MAJOR,n,nb,aa,n,ipiv,bb,nb);
        if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgesv", ierr );}

    }

    //testing sgesv()
    t = clock() - t;
    printf ("dgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);

    t = clock();
    for(k=0;k<nbrun;k++){

        memcpy(bbs,bbbs,n*nb*sizeof(float));
        memcpy(aas,aaas,n*n*sizeof(float));



        ierr=LAPACKE_sgesv(LAPACK_ROW_MAJOR,n,nb,aas,n,ipiv,bbs,nb);
        if(ierr<0){LAPACKE_xerbla( "LAPACKE_sgesv", ierr );}

    }

    //testing dsgesv()
    t = clock() - t;
    printf ("sgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);

    int iter;
    t = clock();
    for(k=0;k<nbrun;k++){

        memcpy(bb,bbb,n*nb*sizeof(double));
        memcpy(aa,aaa,n*n*sizeof(double));


        ierr=LAPACKE_dsgesv(LAPACK_ROW_MAJOR,n,nb,aa,n,ipiv,bb,nb,x,nb,&iter);
        if(ierr<0){LAPACKE_xerbla( "LAPACKE_dsgesv", ierr );}

    }
    t = clock() - t;
    printf ("dsgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);

    if(iter>0){
        printf("iterative refinement has succeded, %d iterations\n");
    }else{
        printf("iterative refinement has failed due to");
        if(iter==-1){
            printf(" implementation- or machine-specific reasons\n");
        }
        if(iter==-2){
            printf(" overflow in iterations\n");
        }
        if(iter==-3){
            printf(" failure of single precision factorization sgetrf() (ill-conditionned?)\n");
        }
        if(iter==-31){
            printf(" max number of iterations\n");
        }
    }
    free(aaa);
    free(aa);
    free(bbb);
    free(bb);
    free(x);


    free(aaas);
    free(aas);
    free(bbbs);
    free(bbs);

    free(ipiv);

    return 0;
}

n=2000 的输出:

condition number is 1475.26

dgesv()x2 took me 5260000 clicks (5.260000 seconds).

sgesv()x2 took me 3560000 clicks (3.560000 seconds).

dsgesv()x2 took me 3790000 clicks (3.790000 seconds).

iterative refinement has succeded, 11 iterations

关于algorithm - 何时使用 dsgesv 与 dgesv 来求解线性方程组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36241631/

algorithm - 何时使用 dsgesv 与 dgesv 来求解线性方程组

上一篇：algorithm - 在包含数百万个节点的链表中查找循环

下一篇：algorithm - 当 n > 1、M(1) = 1 时，Karatsuba 算法 M(n) = 3M(n/2) 的递推关系如何？