我有一个由 elastio.co 托管的托管集群。这是配置
|平台 => 亚马逊网络服务
| |内存 => 4 GB
|
|存储 => 96 GB
| |SSD => 是
| |高可用性 => 是 2 个数据中心
|
该集群中的每个索引都包含恰好一天的日志数据。平均索引大小为 15 mb
,平均文档数为 15000
。集群没有任何压力(JVM、索引和搜索时间、磁盘空间都在非常舒适的区域)
当我打开一个先前关闭的索引时,集群变为红色。这是我在查询 elasticsearch 时发现的一些矩阵。
GET /_cluster/allocation/explain
{
"index": "some_index_name", # 1 Primary shard , 1 replica shard
"shard": 0,
"primary": true
}
响应:
"unassigned_info": {
"reason": "ALLOCATION_FAILED"
"failed_allocation_attempts": 3,
"details": "failed recovery, failure RecoveryFailedException[[some_index_name][0]: Recovery failed on {instance-*****}{Hash}{HASH}{IP}{IP}{logical_availability_zone=zone-1, availability_zone=***, region=***}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: IndexNotFoundException[no segments* file found in store(mmapfs(/app/data/nodes/0/indices/MFIFAQO2R_ywstzqrfbY4w/0/index)): files: []]; ",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions": [
{
"node_name": "instance-***",
"node_decision": "no",
"store": {
"in_sync": false,
"allocation_id": "RANDOM_HASH",
"store_exception": {
"type": "index_not_found_exception",
"reason": "no segments* file found in SimpleFSDirectory@/app/data/nodes/0/indices/RANDOM_HASH/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@346e1b99: files: []"
}
}
},
{
"node_name": "instance-***",
"node_attributes": {
"logical_availability_zone": "zone-0",
},
"node_decision": "no",
"store": {
"found": false
}
}
我已经尝试将分片重新路由到一个节点。甚至将数据丢失标志设置为 true。
POST _cluster/reroute
{
"commands" : [
{"allocate_stale_primary" : {
"index" : "some_index_name", "shard" : 0,
"node" : "instance-***",
"accept_data_loss" : true
}
}
]
}
响应:
"acknowledged": true,
"state": {
"version": 338190,
"state_uuid": "RANDOM_HASH",
"master_node": "RANDOM_HASH",
"blocks": {
"indices": {
"restored_**: {
"4": {
"description": "index closed",
"retryable": false,
"levels": [
"read",
"write"
]
}
},
"restored_**": {
"4": {
"description": "index closed",
"retryable": false,
"levels": [
"read",
"write"
]
}
}
}
},
"routing_table": {
"indices": {
"SOME_INDEX_NAME": {
"shards": {
"0": [
{
"state": "INITIALIZING",
"primary": true,
"relocating_node": null,
"shard": 0,
"index": "SOME_INDEX_NAME",
"recovery_source": {
"type": "EXISTING_STORE"
},
"allocation_id": {
"id": "HASH"
},
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"failed_attempts": 4,
"delayed": false,
"details": "same as explanation above ^ ",
"allocation_status": "no_valid_shard_copy"
}
},
{
"state": "UNASSIGNED",
"primary": false,
"node": null,
"relocating_node": null,
"shard": 0,
"index": "some_index_name",
"recovery_source": {
"type": "PEER"
},
"unassigned_info": {
"reason": "INDEX_REOPENED",
"delayed": false,
"allocation_status": "no_attempt"
}
}
]
}
},
欢迎任何类型的建议。谢谢和问候。
最佳答案
当主节点突然关闭时会发生这种情况。
这是我为解决我遇到的相同问题所采取的步骤,
第一步:检查分配情况
- curl -XGET http://localhost:9200/_cat/allocation?v
第 2 步:检查分片存储
- curl -XGET http://localhost:9200/_shard_stores?pretty 留意出现您显示的错误的“索引”、“分片”和“节点”。 错误应该是 --> “在 SimpleFSDirectory@/... 中找不到段*文件。”
第 3 步:现在重新路由该索引,如下所示
- curl -XPOST ' http://localhost:9200/_cluster/reroute?master_timeout=5m '\ -d'{“命令”:[{“allocate_empty_primary”:{“索引”:“IndexFromStep2”,“分片”:ShardFromStep2,“节点”:“NodeFromStep2”,“accept_data_loss”:真}}]}'
第 4 步:重复第 2 步和第 3 步,直到您看到此输出。
- curl -XGET ' http://localhost:9200/_shard_stores?pretty '
{ “指数”:{}
您的集群应该很快就会变绿。
关于elasticsearch - 打开一个关闭的索引后,健康的 Elasticsearch 集群变为红色,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49005638/