java - 降低新 JVM 的性能

标签 java performance jvm jmh

在较新的 JVM 上,在 for 循环中对数组中的所有元素求和的性能(吞吐量)比 Java 1.8.0 JDK 中的 JVM 慢。我执行了 JHM 基准测试(下图)。在每次测试之前,源代码由提供的 javac.exe 编译并由 java.exe 运行,这两个二进制文件都由选定的 JDK 提供。测试是在 Windows 10 上执行的,并由 powershell 脚本启动,没有在后台运行任何程序(没有其他 jvm)。计算机配备了 32GB 的 RAM,因此未使用 HDD 上的虚拟内存。
数组中的 10M 个元素:
enter image description here
数组中的 100M 个元素:
JMH benchmark resulrs.
我的测试源代码:

@Param({"10000000", "100000000"})
public static int ELEMENTS;

public static void main(String[] args) throws RunnerException, IOException {
    File outputFile = new File(args[0]);

    int javaMajorVersion = Integer.parseInt(System.getProperty("java.version").split("\\.")[0]);

    ChainedOptionsBuilder builder = new OptionsBuilder()
            .include(IteratingBenchmark.class.getSimpleName())
            .mode(Mode.Throughput)
            .forks(2)
            .measurementTime(TimeValue.seconds(10))
            .measurementIterations(50)
            .warmupTime(TimeValue.seconds(2))
            .warmupIterations(10)
            .resultFormat(ResultFormatType.SCSV)
            .result(outputFile.getAbsolutePath());

    if (javaMajorVersion > 8) {
        builder = builder.jvmArgs("-Xms20g", "-Xmx20g", "--enable-preview");
    } else {
        builder = builder.jvmArgs("-Xms20g", "-Xmx20g");
    }

    new Runner(builder.build()).run();
}

@Benchmark
public static void cStyleForLoop(Blackhole bh, MockData data) {
    long sum = 0;
    for (int i = 0; i < data.randomInts.length; i++) {
        sum += data.randomInts[i];
    }

    bh.consume(sum);
}

@State(Scope.Thread)
public static class MockData {
    private int[] randomInts = new int[ELEMENTS];

    @Setup(Level.Iteration)
    public void setup() {
        Random r = new Random();
        this.randomInts = Stream.iterate(r.nextInt(), i -> i + r.nextInt(1022) + 1).mapToInt(Integer::intValue).limit(ELEMENTS).toArray();
    }
}
原始数据:
JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;331,446104;5,563589;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;33,757268;0,431403;"ops/s";100000000

JDK 11.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;322,728461;4,823611;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,075948;0,062830;"ops/s";100000000

JDK 12.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;322,914782;4,450969;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,095232;0,075051;"ops/s";100000000

JDK 13.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;325,103055;4,933257;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,228403;0,067954;"ops/s";100000000

JDK 14.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;300,861148;0,443404;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;29,863602;0,035781;"ops/s";100000000

OpenJDK 14.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;300,781930;0,481579;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;29,873509;0,033055;"ops/s";100000000

OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;343,530895;0,445551;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;34,287083;0,035028;"ops/s";100000000
是否有任何有效的解释,为什么较新版本的 Java 比 1.8 慢(OpenJDK 15 除外)?
更新 1:
我对不同的 Xmx/Xms 值运行相同的测试(对于每个测试 Xmx == Xms),结果如下:
enter image description here

更新 2:
  • 首先,我改变了Level.IterationLevel.Trial .
  • 其次,我强制G1垃圾收集器。
  • 第三,Xmx/Xms 设置为 8GB

  • 结果:
    enter image description here
    原始数据:
    JDK 1.8.0_241:
    "Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
    "benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;33,760346;0,089646;"ops/s";100000000
    
    JDK 11.0.2:
    "Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
    "benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,075120;0,086171;"ops/s";100000000
    
    JDK 12.0.1:
    "Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
    "benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,173939;0,044176;"ops/s";100000000
    
    JDK 13.0.1:
    "Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
    "benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,219283;0,062329;"ops/s";100000000
    
    JDK 14.0.1:
    "Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
    "benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;29,808609;0,072664;"ops/s";100000000
    
    OpenJDK 14.0.2:
    "Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
    "benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;29,845817;0,074315;"ops/s";100000000
    
    OpenJDK 15:
    "Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
    "benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;34,310620;0,087412;"ops/s";100000000
    
    更新 3:
    我做了 GitHub Repository包含基准源代码,以及使用我使用的 JMH 参数执行基准测试的脚本,它会自动生成 png 格式的图。
    此外,我在其他机器(Linux)上执行了基准测试。 Linux机器的结果似乎更乐观:
    enter image description here
    不幸的是,在我的 Windows 机器上,结果仍然显示性能下降(不包括 JDK 15)。
    更新 4:
    结果与 -XX:-UseCountedLoopSafepoints :
    enter image description here

    最佳答案

    即使在从 GitHub 逐字复制您的基准测试之后并使用相同的参数运行,我仍然无法重现结果。在我的环境中,JDK 14 的执行速度与 JDK 8 一样快(甚至快一点)。因此,在这个答案中,我将根据编译代码的反汇编来分析两个版本之间的差异。
    首先,让我们采用来自同一供应商的最新 OpenJDK 版本。
    我这里比较Liberica JDK 8u265+1Liberica JDK 14.0.2+13对于 Windows 64 位。
    JMH分数如下:

    Benchmark                         (ELEMENTS)   Mode  Cnt    Score   Error  Units
    IteratingBenchmark.cStyleForLoop    10000000  thrpt   30  263,137 ± 0,484  ops/s  # JDK 8
    IteratingBenchmark.cStyleForLoop    10000000  thrpt   30  264,406 ± 0,788  ops/s  # JDK 14
    
    现在让我们运行内置的 JMH -prof xperfasm profiler 查看对基准测试 HitTest 部分的反汇编。预计大约 99.5% 的 CPU 时间花费在 C2 编译 cStyleForLoop方法。
    JDK 8 上 HitTest 的区域
    ....[Hottest Region 1]..............................................................................
    C2, level 4, codes.dbg.IteratingBenchmark::cStyleForLoop, version 574 (71 bytes) 
    
                 0x0000028c5607fc5f: add     r10d,0fffffff9h
                 0x0000028c5607fc63: lea     rax,[r12+rcx*8]
                 0x0000028c5607fc67: mov     ebx,80000000h
                 0x0000028c5607fc6c: cmp     r9d,r10d
                 0x0000028c5607fc6f: cmovl   r10d,ebx
                 0x0000028c5607fc73: mov     r9d,1h
                 0x0000028c5607fc79: cmp     r10d,1h
             ╭   0x0000028c5607fc7d: jle     28c5607fccch
             │   0x0000028c5607fc7f: nop                       ;*lload_2
             │                                                 ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
      0,07%  │↗  0x0000028c5607fc80: movsxd  rbx,dword ptr [rax+r9*4+10h]
      0,06%  ││  0x0000028c5607fc85: add     rbx,r8
      8,93%  ││  0x0000028c5607fc88: movsxd  rcx,r9d
      0,41%  ││  0x0000028c5607fc8b: movsxd  r8,dword ptr [rax+rcx*4+2ch]
     25,02%  ││  0x0000028c5607fc90: movsxd  rdi,dword ptr [rax+rcx*4+14h]
      0,10%  ││  0x0000028c5607fc95: movsxd  rsi,dword ptr [rax+rcx*4+18h]
      8,56%  ││  0x0000028c5607fc9a: movsxd  rbp,dword ptr [rax+rcx*4+28h]
      0,58%  ││  0x0000028c5607fc9f: movsxd  r13,dword ptr [rax+rcx*4+1ch]
      0,41%  ││  0x0000028c5607fca4: movsxd  r14,dword ptr [rax+rcx*4+20h]
      0,20%  ││  0x0000028c5607fca9: movsxd  rcx,dword ptr [rax+rcx*4+24h]
      8,85%  ││  0x0000028c5607fcae: add     rdi,rbx
      0,38%  ││  0x0000028c5607fcb1: add     rsi,rdi
      0,15%  ││  0x0000028c5607fcb4: add     r13,rsi
      8,57%  ││  0x0000028c5607fcb7: add     r14,r13
     13,76%  ││  0x0000028c5607fcba: add     rcx,r14
      5,51%  ││  0x0000028c5607fcbd: add     rbp,rcx
      8,50%  ││  0x0000028c5607fcc0: add     r8,rbp            ;*ladd
             ││                                                ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
      8,95%  ││  0x0000028c5607fcc3: add     r9d,8h            ;*iinc
             ││                                                ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
      0,40%  ││  0x0000028c5607fcc7: cmp     r9d,r10d
             │╰  0x0000028c5607fcca: jl      28c5607fc80h      ;*if_icmpge
             │                                                 ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
             ↘   0x0000028c5607fccc: cmp     r9d,edx
                 0x0000028c5607fccf: jnl     28c5607fce4h
                 0x0000028c5607fcd1: nop                       ;*lload_2
                                                               ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
                 0x0000028c5607fcd4: movsxd  r10,dword ptr [rax+r9*4+10h]
                 0x0000028c5607fcd9: add     r8,r10            ;*ladd
                                                               ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
    ....................................................................................................
    
    JDK 14 上 HitTest 的区域
    ....[Hottest Region 1]..............................................................................
    c2, level 4, codes.dbg.IteratingBenchmark::cStyleForLoop, version 622 (147 bytes) 
    
                                                                             ; - codes.dbg.IteratingBenchmark::cStyleForLoop@23 (line 25)
                   0x000001e844438f72:   mov     r11d,r10d
                   0x000001e844438f75:   add     r11d,0fffffff9h
                   0x000001e844438f79:   lea     rax,[r12+r9*8]
                   0x000001e844438f7d:   mov     ebx,1h
                   0x000001e844438f82:   cmp     r11d,1h
                   0x000001e844438f86:   jle     1e8444390c0h                ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                                             ; - codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
             ╭     0x000001e844438f8c:   jmp     1e844438ffah
             │     0x000001e844438f8e:   nop
      0,04%  │↗    0x000001e844438f90:   mov     rsi,r8                      ;*lload_2 {reexecute=0 rethrow=0 return_oop=0}
             ││                                                              ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
      0,04%  ││ ↗  0x000001e844438f93:   movsxd  rdx,dword ptr [rax+rbx*4+10h]
      8,41%  ││ │  0x000001e844438f98:   movsxd  rbp,dword ptr [rax+rbx*4+14h]
      1,23%  ││ │  0x000001e844438f9d:   movsxd  r13,dword ptr [rax+rbx*4+18h]
      0,03%  ││ │  0x000001e844438fa2:   movsxd  r8,dword ptr [rax+rbx*4+2ch]
     23,87%  ││ │  0x000001e844438fa7:   movsxd  r11,dword ptr [rax+rbx*4+28h]
      8,22%  ││ │  0x000001e844438fac:   movsxd  r9,dword ptr [rax+rbx*4+24h]
      1,25%  ││ │  0x000001e844438fb1:   movsxd  rcx,dword ptr [rax+rbx*4+20h]
      0,14%  ││ │  0x000001e844438fb6:   movsxd  r14,dword ptr [rax+rbx*4+1ch]
      0,28%  ││ │  0x000001e844438fbb:   add     rdx,rsi
      7,82%  ││ │  0x000001e844438fbe:   add     rbp,rdx
      1,14%  ││ │  0x000001e844438fc1:   add     r13,rbp
      0,17%  ││ │  0x000001e844438fc4:   add     r14,r13
     14,57%  ││ │  0x000001e844438fc7:   add     rcx,r14
     11,05%  ││ │  0x000001e844438fca:   add     r9,rcx
      5,26%  ││ │  0x000001e844438fcd:   add     r11,r9
      6,32%  ││ │  0x000001e844438fd0:   add     r8,r11                      ;*ladd {reexecute=0 rethrow=0 return_oop=0}
             ││ │                                                            ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
      8,45%  ││ │  0x000001e844438fd3:   add     ebx,8h                      ;*iinc {reexecute=0 rethrow=0 return_oop=0}
             ││ │                                                            ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
      1,15%  ││ │  0x000001e844438fd6:   cmp     ebx,edi
             │╰ │  0x000001e844438fd8:   jl      1e844438f90h                ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
             │  │                                                            ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
             │  │  0x000001e844438fda:   mov     r11,qword ptr [r15+110h]    ; ImmutableOopMap {rax=Oop xmm0=Oop xmm1=Oop }
             │  │                                                            ;*goto {reexecute=1 rethrow=0 return_oop=0}
             │  │                                                            ; - (reexecute) codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
      0,00%  │  │  0x000001e844438fe1:   test    dword ptr [r11],eax         ;*goto {reexecute=0 rethrow=0 return_oop=0}
             │  │                                                            ; - codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
             │  │                                                            ;   {poll}
      0,02%  │  │  0x000001e844438fe4:   cmp     ebx,dword ptr [rsp]
             │ ╭│  0x000001e844438fe7:   jnl     1e844439028h
      0,00%  │ ││  0x000001e844438fe9:   mov     rsi,r8
             │ ││  0x000001e844438fec:   vmovq   r8,xmm0
             │ ││  0x000001e844438ff1:   vmovq   rdx,xmm1
      0,01%  │ ││  0x000001e844438ff6:   mov     r11d,dword ptr [rsp]
             ↘ ││  0x000001e844438ffa:   mov     ecx,r10d
               ││  0x000001e844438ffd:   sub     ecx,ebx
               ││  0x000001e844438fff:   add     ecx,0fffffff9h
      0,00%    ││  0x000001e844439002:   mov     r9d,1f40h
               ││  0x000001e844439008:   cmp     r9d,ecx
               ││  0x000001e84443900b:   mov     edi,1f40h
               ││  0x000001e844439010:   cmovnle edi,ecx
      0,02%    ││  0x000001e844439013:   add     edi,ebx
               ││  0x000001e844439015:   vmovq   xmm0,r8
               ││  0x000001e84443901a:   vmovq   xmm1,rdx
               ││  0x000001e84443901f:   mov     dword ptr [rsp],r11d
      0,01%    │╰  0x000001e844439023:   jmp     1e844438f93h
               ↘   0x000001e844439028:   vmovq   rdx,xmm1
                   0x000001e84443902d:   cmp     ebx,r10d
                   0x000001e844439030:   jnl     1e844439043h
                   0x000001e844439032:   nop                                 ;*lload_2 {reexecute=0 rethrow=0 return_oop=0}
                                                                             ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
                   0x000001e844439034:   movsxd  r11,dword ptr [rax+rbx*4+10h]
                   0x000001e844439039:   add     r8,r11                      ;*ladd {reexecute=0 rethrow=0 return_oop=0}
                                                                             ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
                   0x000001e84443903c:   inc     ebx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                                             ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
    ....................................................................................................
    
    正如我们所见,循环体在两个 JDK 上的编译方式相似:
  • 展开 8 次循环迭代;
  • 没有边界检查的数组中有 8 个加载,然后是 8 add指示;
  • 加载顺序略有不同,但无论如何所有地址都共享相同或相邻的缓存线。

  • 关键区别在于,在 JDK 14 上,循环迭代被拆分为两个嵌套块。这是 Loop strip mining 的结果JDK 10 中出现了优化。这种优化的想法是将计数循环拆分为没有安全点轮询的热内部部分和具有安全点轮询指令的外部部分。
    C2 JIT 将循环转换为类似
        for (int i = 0; i < array.length; i += 8000) {
            for (int j = 0; j < 8000; j += 8) {
                int ix = i + j;
                int v0 = array[ix];
                int v1 = array[ix + 1];
                ...
                int v7 = array[ix + 7];
                sum += v0 + v1 + ... + v7;
            }
            safepoint_poll();
        }
    
    请注意,JDK 8 版本在计数循环内根本没有安全点轮询。一方面,这可以使循环运行得更快。但另一方面,这对于低延迟应用程序来说实际上是不利的,因为暂停时间可能会随着整个循环的持续时间而增加。
    JDK 14 在循环中插入安全点轮询。这可能是您观察到速度变慢的原因,但我并不相信这一点,因为由于循环带挖掘优化,安全点轮询在 8000 次迭代中仅执行一次。
    要验证这一点,您可以使用 -XX:-UseCountedLoopSafepoints 禁用安全点轮询。 JVM 选项。在这种情况下,JDK 14 编译版本看起来与 JDK 8 几乎相同。基准分数也是如此。

    关于java - 降低新 JVM 的性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63524800/

    相关文章:

    c++ - 编译器能内联这个方法吗?

    java - JVM 优化是如何基于假设的?

    .net - 根源是什么?

    java - RxJava - Just vs From

    java.lang.NumberFormatException : For input string: "currPage"

    java - 未知公式的包装函数

    python - numpy float : 10x slower than builtin in arithmetic operations?

    javascript - 通过 CSS 或 JS 缩放 Canvas 是否更高效?

    java - 在两个 JVM 之间快速传递图像数据

    java - 流 - ObjectOutputStream 给出 NULL