Java vs C++ (g++) vs C++ (Visual Studio) 性能

编辑:考虑到第一个答案，我删除了“myexp()”函数作为错误而不是讨论的重点

我有一段简单的代码并为不同的平台编译并获得不同的性能结果(执行时间):

Java 8/Linux:3.5 秒

执行命令:java -server Test
C++/gcc 4.8.3:6.22 秒

编译选项:O3
C++/Visual Studio 2015:1.7 秒

编译器选项:/Og/Ob2/Oi

似乎 VS 有这些额外的选项不适用于 g++ 编译器。

我的问题是:为什么 Visual Studio(具有这些编译器选项)在 Java 和 C++(具有 O3 优化，我认为这是最先进的)方面如此之快？

您可以在下面找到 Java 和 C++ 代码。

C++代码:

#include <cstdio>
#include <ctime>
#include <cstdlib>
#include <cmath>


static unsigned int g_seed;

//Used to seed the generator.
inline void fast_srand( int seed )
{
    g_seed = seed;
}

//fastrand routine returns one integer, similar output value range as C lib.
inline int fastrand()
{
    g_seed = ( 214013 * g_seed + 2531011 );
    return ( g_seed >> 16 ) & 0x7FFF;
}

int main()
{
    static const int NUM_RESULTS = 10000;
    static const int NUM_INPUTS  = 10000;

    double dInput[NUM_INPUTS];
    double dRes[NUM_RESULTS];

    fast_srand(10);

    clock_t begin = clock();

    for ( int i = 0; i < NUM_RESULTS; i++ )
    {
        dRes[i] = 0;

        for ( int j = 0; j < NUM_INPUTS; j++ )
        {
           dInput[j] = fastrand() * 1000;
           dInput[j] = log10( dInput[j] );
           dRes[i] += dInput[j];
        }
     }


    clock_t end = clock();

    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;

    printf( "Total execution time: %f sec - %f\n", elapsed_secs, dRes[0]);

    return 0;
}

Java 代码:

import java.util.concurrent.TimeUnit;


public class Test
{

    static int g_seed;

    static void fast_srand( int seed )
    {
        g_seed = seed;
    }

    //fastrand routine returns one integer, similar output value range as C lib.
    static int fastrand()
    {
        g_seed = ( 214013 * g_seed + 2531011 );
        return ( g_seed >> 16 ) & 0x7FFF;
    }


    public static void main(String[] args)
    {
        final int NUM_RESULTS = 10000;
        final int NUM_INPUTS  = 10000;


        double[] dRes = new double[NUM_RESULTS];
        double[] dInput = new double[NUM_INPUTS];


        fast_srand(10);

        long nStartTime = System.nanoTime();

        for ( int i = 0; i < NUM_RESULTS; i++ )
        {
            dRes[i] = 0;

            for ( int j = 0; j < NUM_INPUTS; j++ )
            {
               dInput[j] = fastrand() * 1000;
               dInput[j] = Math.log( dInput[j] );
               dRes[i] += dInput[j];
            }
        }

        long nDifference = System.nanoTime() - nStartTime;

        System.out.printf( "Total execution time: %f sec - %f\n", TimeUnit.NANOSECONDS.toMillis(nDifference) / 1000.0, dRes[0]);
    }
}

最佳答案

函数

static inline double myexp( double val )
{
    const long tmp = (long)( 1512775 * val + 1072632447 );
    return double( tmp << 32 );
}:

在 MSVC 中给出警告

warning C4293: '<<' : shift count negative or too big, undefined behavior

更改为之后:

static inline double myexp(double val)
{
    const long long tmp = (long long)(1512775 * val + 1072632447);
    return double(tmp << 32);
}

代码在 MSVC 中也需要大约 4 秒。

所以，显然 MSVC 优化了很多东西，可能是整个 myexp() 函数(甚至可能还有其他取决于这个结果的东西)——因为它可以(记住, 未定义的行为)。

吸取的教训:同时检查(并修复)警告。

请注意，如果我尝试在函数中打印结果，MSVC 优化版本会给我(每次调用):

tmp: -2147483648
result: 0.000000

即MSVC 优化了未定义的行为以始终返回 0。查看程序集输出以查看因此优化了哪些其他内容可能也很有趣。

因此，在检查程序集后，固定版本具有以下代码:

; 52   :             dInput[j] = myexp(dInput[j]);
; 53   :             dInput[j] = log10(dInput[j]);

    mov eax, esi
    shr eax, 16                 ; 00000010H
    and eax, 32767              ; 00007fffH
    imul    eax, eax, 1000
    movd    xmm0, eax
    cvtdq2pd xmm0, xmm0
    mulsd   xmm0, QWORD PTR __real@4137154700000000
    addsd   xmm0, QWORD PTR __real@41cff7893f800000
    call    __dtol3
    mov edx, eax
    xor ecx, ecx
    call    __ltod3
    call    __libm_sse2_log10_precise

; 54   :             dRes[i] += dInput[j];

在原始版本中，整个 block 都丢失了，即对 log10() 的调用显然也被优化了，并在末尾替换为常量(显然 - INF，它是 log10(0.0) 的结果——事实上，结果也可能未定义或实现已定义)。此外，整个 myexp() 函数被替换为 fldz 指令(基本上，“加载零”)。所以这解释了额外的速度:)

编辑

关于使用真正的 exp() 时的性能差异:汇编输出可能会提供一些线索。

特别是，对于 MSVC，您可以使用这些附加参数:

/FAs /Qvec-report:2

/FAs 生成程序集列表(连同源代码)

/Qvec-report:2 提供有关矢量化状态的有用信息:

test.cpp(49) : info C5002: loop not vectorized due to reason '1304'
test.cpp(45) : info C5002: loop not vectorized due to reason '1106'

此处提供原因代码:https://msdn.microsoft.com/en-us/library/jj658585.aspx - 特别是，MSVC 似乎无法正确矢量化循环。但根据汇编 list ，它仍然使用SSE2函数(这仍然是一种“矢量化”，大大提高了速度)。

GCC 的类似参数是:

-funroll-loops -ftree-vectorizer-verbose=1

这为我提供了结果:

Analyzing loop at test.cpp:42
Analyzing loop at test.cpp:46
test.cpp:30: note: vectorized 0 loops in function.
test.cpp:46: note: Unroll loop 3 times

显然 g++ 也不能向量化，但它确实循环展开(在程序集中我可以看到循环代码在那里重复了 3 次)，这也可以解释更好的性能。

不幸的是，这是 Java 缺乏 AFAIK 的地方，因为 Java 不进行任何矢量化、SSE2 或循环展开，因此它比优化的 C++ 版本慢得多。参见例如这里:Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?建议使用 JNI 以获得更好的性能(即，通过 Java 应用程序的 JNI 接口(interface)在 C/C++ DLL 中计算)。

关于Java vs C++ (g++) vs C++ (Visual Studio) 性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41140217/

Java vs C++ (g++) vs C++ (Visual Studio) 性能

编辑

上一篇：c++ - 编译器(GCC)如何处理 C++ 的访问控制？

下一篇：c++ - 我如何编写一个函数来填充文件中的 vector