java - JDK 17.0.1 ZGC 分配因快速分配速率而停滞

标签 java memory-leaks garbage-collection zgc

我们在 JDK 17.0.1 上使用 ZGC。应用程序成功运行一段时间,没有出现意外的 GC 停顿,之后它会进入在 GC 完成时 ZGC 不会释放太多内存的状态。从那时起,这会导致应用程序内出现多个周期性的分配停顿。

应用程序本身以大约 35 毫秒的周期间隔进行大量分配。峰值负载时的分配量约为 2GB/s。通常在早上开始时,大约有 30 分钟的负载较低,直到外部数据开始到达的时间点,此后负载较高。因此,当应用程序运行时,负载配置文件确实会发生变化。由于加载期间的可预测分配,我们使用 ZCollectionInterval 按时间间隔设置集合。

即使在负载非常相似的不同机器之间,我们也无法一致地重现分配停顿。奇怪的是,如果在负载已经很高时重新启动应用程序,那么它通常不会遇到问题。

我们之前也尝试过使用 UseDynamicNumberOfGCThreads,但似乎得出的线程数为 2。所附图表使用 SoftMaxHeapSize=1G。这有点激进,但我们确实在没有设置 SofMaxHeapSize 的情况下看到了同样的问题。 我们还注意到,GC 的并发标记阶段似乎增加了导致此问题的原因。在 17.0.5 之前的 JDK 版本的发行说明中,我们也没有看到任何有关 ZGC 的值得注意的内容。

有人遇到过类似的问题吗?我们不确定为什么会发生这种情况,或者这是否是设置标志时的预期行为。由于尚未设置的内容以及它发生的罕见性,我们希望有人能够提供一些指示,以确定这实际上是现有修复程序的已知问题,还是我们只是设置错误。

这些是相关标志。

-XX:-OmitStackTraceInFastThrow
-XX:CICompilerCount=2
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-Xmx8192m
-Xms8192m
-Xss8m
-Xlog:gc*=info:file=${LT_APP_LOG_PATH_FORMAT}.gc:time,level,tags:filecount=32,filesize=64M
-XX:+UseZGC
-XX:+AlwaysPreTouch
-XX:+UseLargePages
-XX:+UseTransparentHugePages
-XX:ZCollectionInterval=4
-Djdk.nio.maxCachedBufferSize=262144
-XX:+HeapDumpOnOutOfMemoryError
-XX:SoftMaxHeapSize=1G
[2022-11-24T09:21:12.378+0530][info][gc,phases   ] GC(669) Concurrent Select Relocation Set 8.713ms
[2022-11-24T09:21:12.378+0530][info][gc,phases   ] GC(669) Pause Relocate Start 0.008ms
[2022-11-24T09:21:12.420+0530][info][gc,phases   ] GC(669) Concurrent Relocate 41.564ms
[2022-11-24T09:21:12.420+0530][info][gc,load     ] GC(669) Load: 36.38/31.86/25.10
[2022-11-24T09:21:12.420+0530][info][gc,mmu      ] GC(669) MMU: 2ms/99.1%, 5ms/99.6%, 10ms/99.8%, 20ms/99.9%, 50ms/99.9%, 100ms/100.0%
[2022-11-24T09:21:12.420+0530][info][gc,marking  ] GC(669) Mark: 4 stripe(s), 2 proactive flush(es), 1 terminate flush(es), 0 completion(s), 0 continuation(s) 
[2022-11-24T09:21:12.420+0530][info][gc,marking  ] GC(669) Mark Stack Usage: 32M
[2022-11-24T09:21:12.420+0530][info][gc,nmethod  ] GC(669) NMethods: 10353 registered, 328 unregistered
[2022-11-24T09:21:12.420+0530][info][gc,metaspace] GC(669) Metaspace: 44M used, 46M committed, 1072M reserved
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Soft: 2311 encountered, 98 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Weak: 2559 encountered, 652 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Final: 56 encountered, 7 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Phantom: 27425 encountered, 27313 discovered, 27030 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Small Pages: 5764 / 11528M, Empty: 4752M, Relocated: 58M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Medium Pages: 3 / 96M, Empty: 64M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Large Pages: 0 / 0M, Empty: 0M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Forwarding Usage: 29M
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Min Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Max Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Soft Max Capacity: 1024M(6%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)                Mark Start          Mark End        Relocate Start      Relocate End           High               Low         
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)  Capacity:    16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)   
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Free:     4760M (29%)           0M (0%)         4788M (29%)        5392M (33%)        5394M (33%)           0M (0%)     
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Used:    11624M (71%)       16384M (100%)      11596M (71%)       10992M (67%)       16384M (100%)      10990M (67%)    
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Live:         -              5860M (36%)        5860M (36%)        5860M (36%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Allocated:         -              4760M (29%)        4788M (29%)        4858M (30%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)   Garbage:         -              5763M (35%)         947M (6%)          273M (2%)             -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Reclaimed:         -                  -              4816M (29%)        5490M (34%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc          ] GC(669) Garbage Collection (Allocation Stall) 11624M(71%)->10992M(67%)
[2022-11-24T09:21:12.479+0530][info][gc,start    ] GC(670) Garbage Collection (Allocation Stall)
[2022-11-24T09:21:12.479+0530][info][gc,ref      ] GC(670) Clearing All SoftReferences
[2022-11-24T09:21:12.479+0530][info][gc,task     ] GC(670) Using 4 workers
[2022-11-24T09:21:12.479+0530][info][gc,phases   ] GC(670) Pause Mark Start 0.009ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ] === Garbage Collection Statistics =======================================================================================================================
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]                                                              Last 10s              Last 10m              Last 10h                Total
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]                                                              Avg / Max             Avg / Max             Avg / Max             Avg / Max
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]   Collector: Garbage Collection Cycle                   3410.686 / 3486.475    844.670 / 3486.475    464.952 / 3486.475    464.952 / 3486.475    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]  Contention: Mark Segment Reset Contention                     7 / 15                3 / 30                1 / 30                1 / 30          ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]  Contention: Mark SeqNum Reset Contention                      0 / 1                 0 / 3                 0 / 3                 0 / 3           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Allocation Stall                                  1 / 5                 0 / 7                 0 / 7                 0 / 7           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Allocation Stall                            677.602 / 900.858     303.367 / 900.858     303.367 / 900.858     303.367 / 900.858     ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: GC Locker Stall                                   0 / 0                 0 / 0                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: GC Locker Stall                               0.000 / 0.000         0.000 / 0.000         0.033 / 0.058         0.033 / 0.058       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Relocation Stall                                  0 / 0                 0 / 0                 0 / 0                 0 / 0           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Relocation Stall                              0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Allocation Rate                                1479 / 1860           1003 / 2264            329 / 2264            329 / 2264        MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Out Of Memory                                     0 / 0                 0 / 0                 0 / 0                 0 / 0           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Flush                                  0 / 0                 0 / 32                0 / 32                0 / 32          MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L1                               198 / 827             177 / 901              59 / 901              59 / 901         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L2                               541 / 930             316 / 1052            103 / 1052            103 / 1052        ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L3                                 1 / 15               12 / 657               4 / 657               4 / 657         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Miss                                   0 / 0                 0 / 1                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Uncommit                                          0 / 0                 0 / 0                 0 / 0                 0 / 0           MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Object Allocation Failed                     0 / 0                 0 / 1                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Object Allocation Succeeded                  0 / 3                 4 / 855               1 / 855               1 / 855         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Page Allocation                              0 / 3                 0 / 4                 0 / 4                 0 / 4           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark                            3356.120 / 3437.684    812.574 / 3437.684    445.012 / 3437.684    445.012 / 3437.684    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark Continue                      0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark Free                          0.001 / 0.001         0.001 / 0.002         0.001 / 0.002         0.001 / 0.002       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Process Non-Strong References      8.650 / 8.848        12.746 / 47.124        8.067 / 47.124        8.067 / 47.124      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Relocate                          36.550 / 41.564       14.436 / 55.995        8.357 / 55.995        8.357 / 55.995      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Reset Relocation Set               0.067 / 0.080         0.023 / 0.080         0.013 / 0.080         0.013 / 0.080       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Select Relocation Set              7.214 / 8.713         3.382 / 29.365        2.579 / 29.365        2.579 / 29.365      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Mark End                                0.011 / 0.011         0.011 / 0.018         0.009 / 0.018         0.009 / 0.018       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Mark Start                              0.008 / 0.009         0.008 / 0.016         0.008 / 0.016         0.008 / 0.016       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Relocate Start                          0.008 / 0.008         0.007 / 0.013         0.006 / 0.013         0.006 / 0.013       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Classes Purge                      0.266 / 0.497         0.411 / 5.255         0.251 / 5.255         0.251 / 5.255       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Classes Unlink                     5.018 / 5.358         9.216 / 28.725        5.991 / 28.725        5.991 / 28.725      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark                            3352.212 / 3437.123    855.661 / 3437.123    454.341 / 3437.123    454.341 / 3437.123    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark Try Flush                     0.030 / 0.055         0.760 / 42.368        0.442 / 42.368        0.442 / 42.368      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark Try Terminate                 0.703 / 1.059         1.393 / 27.549        1.126 / 27.549        1.126 / 27.549      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent References Enqueue                 0.003 / 0.004         0.003 / 0.013         0.002 / 0.013         0.002 / 0.013       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent References Process                 3.055 / 3.524         1.183 / 9.454         0.655 / 9.454         0.655 / 9.454       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots ClassLoaderDataGraph         0.090 / 0.223         0.135 / 12.400        0.105 / 12.400        0.105 / 12.400      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots CodeCache                    0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots JavaThreads                  0.411 / 1.162         0.520 / 12.391        0.407 / 12.391        0.407 / 12.391      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots OopStorageSet                0.018 / 0.038         0.051 / 12.043        0.030 / 12.043        0.030 / 12.043      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Weak Roots OopStorageSet           0.259 / 0.263         0.449 / 4.890         0.331 / 4.890         0.331 / 4.890       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Pause Mark Try Complete                       0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      System: Java Threads                                     28 / 28               26 / 29               21 / 29               21 / 29          threads
[2022-11-24T09:21:13.180+0530][info][gc,stats    ] =========================================================================================================================================================
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Concurrent Mark 3465.071ms
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Pause Mark End 0.011ms
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Concurrent Mark Free 0.001ms
[2022-11-24T09:21:15.953+0530][info][gc,phases   ] GC(670) Concurrent Process Non-Strong References 8.537ms
[2022-11-24T09:21:15.953+0530][info][gc,phases   ] GC(670) Concurrent Reset Relocation Set 0.062ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-primary-pool-2) 704.524ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-primary-pool-0) 704.168ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-secondary-pool-0) 711.668ms
[2022-11-24T09:21:15.955+0530][info][gc          ] Allocation Stall (lt-primary-pool-1) 707.109ms
[2022-11-24T09:21:15.955+0530][info][gc          ] Allocation Stall (lt-secondary-pool-1) 713.196ms
[2022-11-24T09:21:15.957+0530][info][gc,phases   ] GC(670) Concurrent Select Relocation Set 4.082ms

java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)

使用 GCViewer 在 GC 日志上创建的图表。

ZGC 清理问题 ZGC clean up issues

一切正常时分配没有问题 No issues allocation when everything is fine

最佳答案

我们没有找到问题的原因,但是,我们在更仔细地检查问题方面确实取得了一些进展。

在 github 上检查 OpenJDK 的 ZGC 代码时。

https://github.com/openjdk/zgc/blob/1f8813495e8184b6c38319df1c2cb70de7811a76/src/hotspot/share/gc/z/zDirector.cpp#L184

它有计算分配率的代码。我们还启用了 ZGC 的调试日志记录。

-Xlog:gc*=debug:file=gc.log:time,level,tags

类似于下面最后几行的日志记录预测应用程序启动的较长时间内分配率为 0。我们怀疑这可能是问题的原因之一。然而,在查看 GCGraphs 时,它并没有完全解释为什么在正常功能期间有足够的内存时它会进入某种 panic 状态。

我们设置-XX:ZAllocationSpikeTolerance=10确保估计的内存使用量接近我们正常的分配率 2000MB/s。从那时起,ZGC 的预测内存使用情况似乎是正确的。我们已经在我们的服务上运行此设置一段时间了,到目前为止没有出现任何问题。我们还删除了 -XX:SoftMaxHeapSize=因为所需的 GC 频率似乎仅通过 XX:ZAllocationSpikeTolerance 即可正确估计。 .

预测分配率的调试日志记录示例。

[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: Allocation Rate (Dynamic GC Workers), MaxAllocRate: 9624.6MB/s (+/-0.6%), Free: 3412MB, GCCPUTime: 0.136, GCDuration: 0.136s, TimeUntilOOM: 0.352s, TimeUntilGC: 0.116s, GCWorkers: 1 -> 1
[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: High Usage, Free: 3412MB(83.3%)
[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: Proactive, AcceptableGCInterval: 3.394s, TimeSinceLastGC: 0.473s, TimeUntilGC: 2.921s
[2022-12-02T12:45:29.925+0530][debug][gc,alloc       ] Allocation Rate: 958.0MB/s, Predicted: 960.9MB/s, Avg: 954.4(+/-5.7)MB/s
[2022-12-02T12:45:29.925+0530][debug][gc,director    ] Rule: Timer, Interval: 4.000s, TimeUntilGC: 3.427s
[2022-12-02T12:45:29.925+0530][debug][gc,director    ] Select GC Workers (Normal), AvoidLongGCWorkers: 0.013, AvoidOOMGCWorkers: 0.393, LastGCWorkers: 1.000, GCWorkers: 0.393

关于java - JDK 17.0.1 ZGC 分配因快速分配速率而停滞,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74563986/

相关文章:

java - 从命令提示符编译 J2ME

C++ odbc 关闭记录集泄漏内存

javascript - three.js 和许多形状的内存泄漏

c - Valgrind 链表内存泄漏

java - 如何正则表达式匹配排除某些字符?

java - 在避免大量工厂的同时避免“泄漏”

java - CP1252转UTF-8编码后的特殊字符

C# Action 、闭包和垃圾收集

java - Elasticsearch 搜索 thread_pool 的拒绝

java - 如何从G1 gc日志中提取关键信息