c++ - 现代硬件上的浮点与整数计算

我在 C++ 中做一些对性能至关重要的工作，我们目前正在使用整数计算来解决本质上是浮点的问题，因为“它更快”。这会导致很多烦人的问题，并添加很多烦人的代码。

现在，我记得在大约 386 天时读到过浮点计算如此缓慢的情况，我相信 (IIRC) 有一个可选的协处理器。但是现在，随着 CPU 的复杂性和强大程度呈指数级增长，如果进行浮点或整数计算，“速度”肯定没有区别吗？尤其是因为与导致管道停顿或从主内存中获取某些内容相比，实际计算时间很短？

我知道正确的答案是在目标硬件上进行基准测试，什么是测试这个的好方法？我编写了两个小型 C++ 程序，并将它们的运行时间与 Linux 上的“时间”进行了比较，但实际运行时间变化太大(无助于我在虚拟服务器上运行)。除了花费我一整天的时间来运行数百个基准测试、制作图表等，我能做些什么来对相对速度进行合理的测试吗？有什么想法或想法吗？我完全错了吗？

我使用的程序如下，它们无论如何都不相同:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{
    int accum = 0;

    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += rand( ) % 365;
    }
    std::cout << accum << std::endl;

    return 0;
}

方案二:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{

    float accum = 0;
    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += (float)( rand( ) % 365 );
    }
    std::cout << accum << std::endl;

    return 0;
}

提前致谢!

编辑:我关心的平台是在桌面 Linux 和 Windows 机器上运行的常规 x86 或 x86-64。

编辑 2(从下面的评论中粘贴):我们目前拥有广泛的代码库。真的，我已经反对我们“不能使用 float ，因为整数计算更快”的概括 - 我正在寻找一种方法(如果这甚至是真的)来反驳这个概括的假设。我意识到，如果不做所有工作并事后分析，我们就无法预测确切的结果。

无论如何，感谢您所有出色的回答和帮助。随意添加任何其他内容:)。

最佳答案

例如(数字越小速度越快)，

64 位 Intel Xeon X5550 @ 2.67GHz，gcc 4.1.2 -O3

short add/sub: 1.005460 [0]
short mul/div: 3.926543 [0]
long add/sub: 0.000000 [0]
long mul/div: 7.378581 [0]
long long add/sub: 0.000000 [0]
long long mul/div: 7.378593 [0]
float add/sub: 0.993583 [0]
float mul/div: 1.821565 [0]
double add/sub: 0.993884 [0]
double mul/div: 1.988664 [0]

32 位双核 AMD Opteron(tm) 处理器 265 @ 1.81GHz，gcc 3.4.6 -O3

short add/sub: 0.553863 [0]
short mul/div: 12.509163 [0]
long add/sub: 0.556912 [0]
long mul/div: 12.748019 [0]
long long add/sub: 5.298999 [0]
long long mul/div: 20.461186 [0]
float add/sub: 2.688253 [0]
float mul/div: 4.683886 [0]
double add/sub: 2.700834 [0]
double mul/div: 4.646755 [0]

作为 Dan pointed out ，即使您将时钟频率标准化(这本身在流水线设计中可能会产生误导)，结果会因 CPU 架构而有很大差异(个别 ALU/FPU 性能 strong>，以及在superscalar 设计中每个核心可用的实际ALU/FPU 数量 会影响independent operations can execute in parallel 的数量——后一个因素不受下面的代码，因为下面的所有操作都是顺序依赖的。)

穷人的FPU/ALU运算基准:

#include <stdio.h>
#ifdef _WIN32
#include <sys/timeb.h>
#else
#include <sys/time.h>
#endif
#include <time.h>
#include <cstdlib>

double
mygettime(void) {
# ifdef _WIN32
  struct _timeb tb;
  _ftime(&tb);
  return (double)tb.time + (0.001 * (double)tb.millitm);
# else
  struct timeval tv;
  if(gettimeofday(&tv, 0) < 0) {
    perror("oops");
  }
  return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec);
# endif
}

template< typename Type >
void my_test(const char* name) {
  Type v  = 0;
  // Do not use constants or repeating values
  //  to avoid loop unroll optimizations.
  // All values >0 to avoid division by 0
  // Perform ten ops/iteration to reduce
  //  impact of ++i below on measurements
  Type v0 = (Type)(rand() % 256)/16 + 1;
  Type v1 = (Type)(rand() % 256)/16 + 1;
  Type v2 = (Type)(rand() % 256)/16 + 1;
  Type v3 = (Type)(rand() % 256)/16 + 1;
  Type v4 = (Type)(rand() % 256)/16 + 1;
  Type v5 = (Type)(rand() % 256)/16 + 1;
  Type v6 = (Type)(rand() % 256)/16 + 1;
  Type v7 = (Type)(rand() % 256)/16 + 1;
  Type v8 = (Type)(rand() % 256)/16 + 1;
  Type v9 = (Type)(rand() % 256)/16 + 1;

  double t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v += v0;
    v -= v1;
    v += v2;
    v -= v3;
    v += v4;
    v -= v5;
    v += v6;
    v -= v7;
    v += v8;
    v -= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s add/sub: %f [%d]\n", name, mygettime() - t1, (int)v&1);
  t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v /= v0;
    v *= v1;
    v /= v2;
    v *= v3;
    v /= v4;
    v *= v5;
    v /= v6;
    v *= v7;
    v /= v8;
    v *= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s mul/div: %f [%d]\n", name, mygettime() - t1, (int)v&1);
}

int main() {
  my_test< short >("short");
  my_test< long >("long");
  my_test< long long >("long long");
  my_test< float >("float");
  my_test< double >("double");

  return 0;
}

关于c++ - 现代硬件上的浮点与整数计算，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/2550281/

c++ - 现代硬件上的浮点与整数计算

上一篇：c++ - 静态字段是否继承？

下一篇：c++ - 同时迭代两个或多个容器的最佳方法是什么