当前位置: 首页 > >

ES处理failed shard on node [xxxxxx]: failed recovery报错

发布时间:

今天,ES集群有个节点挂了,集群状态一下子就red了,重新启动后,等了许久,发现始终有几个分片无法恢复,运行命令如下:


curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "twitter",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2018-11-06T06:11:15.562Z",
"failed_allocation_attempts" : 5, [0/819]
"details" : "failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed recovery, failure RecoveryFailedException[[t
witter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXMKnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300
}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[fai
led to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerL
ength==16 (resource=SimpleFSIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/trans
log/translog-1228.ckp"))]; ",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in
-sync shard copy",
"node_allocation_decisions" : [
{
"node_id" : "CxXWE8BiQbS4ThB9AvvGQA",
"node_name" : "node-1",
"transport_address" : "10.142.0.2:9300",
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "gxegPAMyQa21MH5NxQEACw"
},
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - man
ually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-11-
06T06:11:15.562Z], failed_attempts[5], delayed=false, details[failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed
recovery, failure RecoveryFailedException[[twitter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXM
KnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed to recover from gateway
]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec f
ooter (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/var/lib/elasticsearch/n
odes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp"))]; ], allocation_status[deciders_no]]]"
}
]
}
]

该原因是:某节点上的分片尝试恢复5次没有成功,然后就丢弃不管。导致该分片无法恢复。


解决办法:


POST /_cluster/reroute?retry_failed=true

重新恢复失败的分片,一会集群就恢复为green。


参考:https://discuss.elastic.co/t/failed-shard-after-ooming-corrupt-index/155612



友情链接: