elasticsearch - 以零停机时间重新启动 2 节点 elasticsearch 集群

我们一直在运行一个 2 节点 ES 集群(现在在 1.4.1 上)，所有默认设置和以下覆盖:

config.cluster.name = "..."
config.discovery.zen.ping_timeout = "5s";
config.discovery.zen.ping.multicast.enabled = false;
config.discovery.zen.ping.unicast.hosts = ["IP1", "IP2"];

最近，我们开始注意到，当我们通过 http://127.0.0.1:9200/_cluster/nodes/_local/_shutdown 请求关闭每个节点时，集群变得无响应30 秒。

当主节点明确关闭时，另一个节点似乎并没有立即恢复主节点的角色...而是一直尝试直到 30 秒(默认 discovery.zen.fd.ping_timeout) 超时。

在此期间，集群出现'no master' block，向根节点请求返回503:

{
"status" : 503,
"name" : "...",
"cluster_name" : "...",
"version" : {
"number" : "1.4.1",
"build_hash" : "89d3241d670db65f994242c8e8383b169779e2d4",
"build_timestamp" : "2014-11-26T15:49:29Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

block 级别是["write", "metadata"].

您可以在日志中看到这一点:

[2014-12-04 17:46:16,000][INFO ][discovery.zen            ] [NODE_1] master_left [[NODE_0][VXtqWIw2Q2C9b5UHvWlZyQ][RD000D3A1024B8][inet[/100.72.14.37:9300]]], reason [shut_down]
[2014-12-04 17:46:16,012][WARN ][discovery.zen            ] [NODE_1] master left (reason = shut_down), current nodes: {[NODE_1][WoVynRBhQvSwvxNp1nj8kw][RD000D3A109006][inet[/100.78.140.38:9300]],}
[2014-12-04 17:46:16,012][INFO ][cluster.service          ] [NODE_1] removed {[NODE_0][VXtqWIw2Q2C9b5UHvWlZyQ][RD000D3A1024B8][inet[/100.72.14.37:9300]],}, reason: zen-disco-master_failed ([NODE_0][VXtqWIw2Q2C9b5UHvWlZyQ][RD000D3A1024B8][inet[/100.72.14.37:9300]])
[2014-12-04 17:46:16,497][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:18,358][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:19,508][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:20,384][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:21,150][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:21,915][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:22,540][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:23,384][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:23,900][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:24,572][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:26,794][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:27,783][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:28,441][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:29,330][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:30,393][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:31,264][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:31,905][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:32,608][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:35,572][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:36,529][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:37,295][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:37,911][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:38,661][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:39,411][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:40,032][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:40,643][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:41,505][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:41,927][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:42,630][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:43,380][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:44,193][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:44,963][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:45,824][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:46,511][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:46,574][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:47,278][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:48,028][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:48,373][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:48,811][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:49,530][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:49,530][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:50,155][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:50,405][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:51,030][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:51,170][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:51,889][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:51,938][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:52,530][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:52,561][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:53,406][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:53,596][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:53,908][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:54,353][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:54,587][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:55,056][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:55,712][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:56,322][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:56,806][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:56,962][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:57,791][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:57,806][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:58,228][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:58,447][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:59,088][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:59,355][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:59,775][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:47:00,400][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:47:00,400][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:47:00,619][INFO ][cluster.service          ] [NODE_1] new_master [NODE_1][WoVynRBhQvSwvxNp1nj8kw][RD000D3A109006][inet[/100.78.140.38:9300]], reason: zen-disco-join (elected_as_master)

我们如何在关闭命令期间强制当前节点放弃其作为主节点的角色，以便其他节点可以立即恢复此职责并防止 30 秒无主 block 中断？我们已经尝试了各种“ transient ”集群更新调用来强制立即进行选举，但无济于事。

最佳答案

您需要在设置文件中将 discovery.zen.rejoin_on_master_gone 设置为 false。
这个值默认为true，意思是节点会在master离开时重新开始加入集群的过程，而不是选举master。如果您将其设置为 false，该节点会在注意到旧 master 离开后立即选择自己为 master(假设您的 discovery.zen.minimum_master_nodes 设置不在2).

编辑以添加警告:此设置作为其修复的一部分添加到 Elasticsearch 1.4 中以帮助防止脑裂。如果您的集群容易出现这些情况(或再次出现)，您将希望保持此设置不变。

关于elasticsearch - 以零停机时间重新启动 2 节点 elasticsearch 集群，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27345508/

elasticsearch - 以零停机时间重新启动 2 节点 elasticsearch 集群

上一篇：Elasticsearch 基数聚合给出完全错误的结果

下一篇：elasticsearch 动态查询 - 为返回的每个文档添加另一个字段