我有一个有 6 个节点的 Elasticsearch 集群。堆大小设置为 50GB。(我知道建议小于 32,但由于某种我不知道的原因,它已经设置为 50GB)。现在我看到很多来自搜索 thread_pool 的拒绝。
[2020-10-26T12:50:32,730][DEBUG][o.e.a.s.TransportSearchAction] [IDC.node1.com] [indice_1][3], node[M4pYcHhzSTIj5NXA], [P], s[STARTED], a[id=IAQaCGTCTA-CkKGPw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[indice_1], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[_doc], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [IDC.node1.com][node1:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@21060fcc on QueueResizingEsThreadPoolExecutor[name = IDC.node1.com/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.2s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@3c46a309[Running, pool size = 49, active threads = 49, queued tasks = 1063, completed tasks = 1693218669]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-7.4.2.jar:7.4.2]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:84) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$8(SearchService.java:1043) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.index.shard.IndexShard.awaitShardSearchActive(IndexShard.java:3247) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$9(SearchService.java:1043) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:114) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:87) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.SearchService.rewriteShardRequest(SearchService.java:1048) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:340) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchTransportService.lambda$registerRequestHandler$6(SearchTransportService.java:335) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:185) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:118) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
我还可以在 gc 日志中不断发生“Pause Young(分配失败)”。
[2020-10-26T07:31:06.514+0000][61585][gc,start ] GC(7139651) Pause Young (Allocation Failure)
[2020-10-26T07:31:06.514+0000][61585][gc,task ] GC(7139651) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 1: 40802096 bytes, 40802096 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 2: 16714352 bytes, 57516448 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 3: 8509976 bytes, 66026424 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 4: 6906208 bytes, 72932632 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 5: 4186344 bytes, 77118976 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 6: 1960288 bytes, 79079264 total
[2020-10-26T07:31:06.574+0000][61585][gc,heap ] GC(7139651) ParNew: 1631431K->102498K(1763584K)
[2020-10-26T07:31:06.574+0000][61585][gc,heap ] GC(7139651) CMS: 35964431K->35966764K(50469312K)
[2020-10-26T07:31:06.574+0000][61585][gc,metaspace ] GC(7139651) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:06.574+0000][61585][gc ] GC(7139651) Pause Young (Allocation Failure) 36714M->35223M(51008M) 59.928ms
[2020-10-26T07:31:06.574+0000][61585][gc,cpu ] GC(7139651) User=1.02s Sys=0.00s Real=0.06s
[2020-10-26T07:31:06.574+0000][61585][safepoint ] Safepoint "GenCollectForAllocation", Time since last: 675463951 ns, Reaching safepoint: 1382982 ns, At safepoint: 60208788 ns, Total: 61591770 ns
[2020-10-26T07:31:07.575+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000129984 ns, Reaching safepoint: 901989 ns, At safepoint: 39274 ns, Total: 941263 ns
[2020-10-26T07:31:08.529+0000][61585][gc,start ] GC(7139652) Pause Young (Allocation Failure)
[2020-10-26T07:31:08.529+0000][61585][gc,task ] GC(7139652) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:08.584+0000][61585][gc,age ] GC(7139652) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:08.584+0000][61585][gc,age ] GC(7139652) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:08.584+0000][61585][gc,age ] GC(7139652) - age 1: 17722072 bytes, 17722072 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 2: 15204416 bytes, 32926488 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 3: 16712976 bytes, 49639464 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 4: 8509664 bytes, 58149128 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 5: 6501424 bytes, 64650552 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 6: 3047280 bytes, 67697832 total
[2020-10-26T07:31:08.585+0000][61585][gc,heap ] GC(7139652) ParNew: 1670178K->89020K(1763584K)
[2020-10-26T07:31:08.585+0000][61585][gc,heap ] GC(7139652) CMS: 35966764K->35968680K(50469312K)
[2020-10-26T07:31:08.585+0000][61585][gc,metaspace ] GC(7139652) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:08.585+0000][61585][gc ] GC(7139652) Pause Young (Allocation Failure) 36754M->35212M(51008M) 56.177ms
[2020-10-26T07:31:08.585+0000][61585][gc,cpu ] GC(7139652) User=0.87s Sys=0.00s Real=0.05s
[2020-10-26T07:31:08.585+0000][61585][safepoint ] Safepoint "GenCollectForAllocation", Time since last: 948848143 ns, Reaching safepoint: 4487353 ns, At safepoint: 56505105 ns, Total: 60992458 ns
[2020-10-26T07:31:09.586+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000188782 ns, Reaching safepoint: 807865 ns, At safepoint: 31141 ns, Total: 839006 ns
[2020-10-26T07:31:10.587+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000218955 ns, Reaching safepoint: 947598 ns, At safepoint: 40468 ns, Total: 988066 ns
[2020-10-26T07:31:11.588+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000252002 ns, Reaching safepoint: 688820 ns, At safepoint: 24744 ns, Total: 713564 ns
[2020-10-26T07:31:12.590+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000168389 ns, Reaching safepoint: 978766 ns, At safepoint: 30805 ns, Total: 1009571 ns
[2020-10-26T07:31:13.497+0000][61585][gc,start ] GC(7139653) Pause Young (Allocation Failure)
[2020-10-26T07:31:13.498+0000][61585][gc,task ] GC(7139653) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 1: 11416792 bytes, 11416792 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 2: 4119704 bytes, 15536496 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 3: 15256408 bytes, 30792904 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 4: 17073416 bytes, 47866320 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 5: 8509616 bytes, 56375936 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 6: 5802544 bytes, 62178480 total
[2020-10-26T07:31:13.560+0000][61585][gc,heap ] GC(7139653) ParNew: 1656700K->91797K(1763584K)
[2020-10-26T07:31:13.560+0000][61585][gc,heap ] GC(7139653) CMS: 35968680K->35971272K(50469312K)
[2020-10-26T07:31:13.560+0000][61585][gc,metaspace ] GC(7139653) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:13.561+0000][61585][gc ] GC(7139653) Pause Young (Allocation Failure) 36743M->35217M(51008M) 63.001ms
[2020-10-26T07:31:13.561+0000][61585][gc,cpu ] GC(7139653) User=0.99s Sys=0.00s Real=0.07s
[2020-10-26T07:31:13.561+0000][61585][safepoint ] Safepoint "GenCollectForAllocation", Time since last: 906329704 ns, Reaching safepoint: 1247593 ns, At safepoint: 63668517 ns, Total: 64916110 ns
这种情况应该经常发生吗?以下是堆池设置:
"pools" : {
"young" : {
"used_in_bytes" : 731499608,
"max_in_bytes" : 1605304320,
"peak_used_in_bytes" : 1605304320,
"peak_max_in_bytes" : 1605304320
},
"survivor" : {
"used_in_bytes" : 187117000,
"max_in_bytes" : 200605696,
"peak_used_in_bytes" : 200605696,
"peak_max_in_bytes" : 200605696
},
"old" : {
"used_in_bytes" : 16461553480,
"max_in_bytes" : 51680575488,
"peak_used_in_bytes" : 50144794688,
"peak_max_in_bytes" : 51680575488
}
拒绝是因为堆大小还是导致 thread_pool 较低。为什么“Pause Young”发生得这么快?
更新 此更新是作为另一个问题提出的:Search thread_pool for particular nodes always at maximum
最佳答案
搜索线程池有一个关联的队列,搜索工作线程在该队列上工作,其默认容量为 1000(注意每个节点),在您的情况下会消耗该容量,请参阅 thread_pools在 ES 中了解更多信息。
现在,由于以下原因之一,队列容量可能会耗尽:
- ES 节点上的搜索请求率很高,您的节点可能无法跟上。
- 一些缓慢的查询需要花费更多时间来处理,并导致其他请求在搜索队列中等待,最终导致拒绝。
- 您的集群中的节点和分片未正确平衡,这可能会导致少数节点接收大量搜索查询。
您应该查看搜索慢速日志,以过滤问题期间成本高昂的查询。
如果搜索流量合理增加,您应该扩展集群以处理更多搜索请求。
关于年轻堆日志,如果它没有花费更多时间并且低于 50-70 毫秒,那么它并不是那么大的问题,您应该专注于修复搜索拒绝问题,这可能会解决或降低堆问题,并且在任何情况下为了获得更好的性能,最大堆大小应低于 32 GB。
关于java - Elasticsearch 搜索 thread_pool 的拒绝,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64533043/