Maximum Shards in ES 7

Introduction

So everything is working fine when suddenly all your pretty graphs just hit a brick wall. Discovering your data in the index patterns also shows the same thing:

sept-18-stop

A quick check of filebeat shows logs being sent, and logstash is still happily parsing away. So what gives?

Diagnosis

However, there is also a curious warning log:

[2019-09-21T03:11:05,419][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"syslog-2019.09.21", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x7b904a4f>], :response=>{"index"=>{"_index"=>"syslog-2019.09.21", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [999]/[1000] maximum shards open;"}}}}

The key here is that I’m hitting the limit of 1000 shards. Google-fu tells me this is a new limitation in ES7. That might be why I never encountered this in ES6, but also, that system indexed monthly, not daily, as I am doing in my home lab.

(For reference, this behavior is defined in the output section of your logstash pipeline)

index => "%{type}-%{+YYYY.MM.dd}"

A quick check of the node stats confirms I’m at the limit:

GET /_stats

{
"_shards" : {
"total" : 999,
"successful" : 499,
"failed" : 0
},

Interesting note on the above is that the successful amount is half of the total. More on why that is later…

Workaround

There are two things you could do in this situation.

  1. Delete all your old indices if you don’t need them. The drawback is you will encounter the issue again in the future at some point if you don’t figure out why you hit the limit to begin with. Should you index monthly instead of daily? Reduce the number of shards and replica shard? It depends on your needs.
  2. Increase the “max_shards_per_node” setting above 1000. The drawback is potentially a performance hit during replication between nodes in a cluster, as you are increasing the amount of work required to have a healthy cluster. You could still end up hitting the newly increased upper limit, or end up with a train wreck of a cluster suffering from poor performance the replication failures

Since this is a home lab and I wasn’t concerned with keeping data from three months ago, I dropped old indices. As soon as I did that things started working again.

As this is a single node and not a cluster, the above concerns with increasing the “max_shards_per_node” aren’t relevant, so I bumped this value up to 2000. Apparently this can be defined in elasticsearch.yml, but in ES7 there is a bug where that setting “cluster.max_shards_per_node” in elasticsearch.yml is not read. I suspect this would be because that same value exists as a default in the logstash and kibana settings.

To work around that, setting this persistently via the API works:

[root@elastic elasticsearch]# curl -X PUT localhost:9200/_cluster/settings -H 'Content-type: application/json' --data-binary $'{"transient":{"cluster.max_shards_per_node":2000}}'
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"max_shards_per_node":"2000"}}}
[root@elastic elasticsearch]# curl -X GET "localhost:9200/_cluster/settings?pretty"
{
"persistent" : { },
"transient" : {
"cluster" : {
"max_shards_per_node" : "2000"
}
}
}

Looking at the node stats, they are better. However, I still see the “successful” shards as about half the total…

[root@elastic elasticsearch]# curl -X GET "localhost:9200/_stats?pretty"

{
"_shards" : {
"total" : 289,
"successful" : 146,
"failed" : 0
},

Resolution

Looking at the overall health, it’s a bit more clear why not all the shards are successful. Half of them are actually unassigned:

[root@elastic elasticsearch]# curl -X GET "localhost:9200/_cluster/health?pretty"
{
"cluster_name" : "elastic1",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 146,
"active_shards" : 146,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 143,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.51903114186851
}

Why would a shard not be assigned? Well it won’t be assigned to a node for replication purposes when another node does not exist (this is a single stack setup), and the primary shard already exists in the same location/node. It’s basically just duplicated data taking up space.

[root@elastic elasticsearch]# curl -X GET "localhost:9200/*/_settings?pretty" |egrep '("number_of_shards"|"number_of_replicas")' |sort |uniq -c

3 "number_of_replicas" : "0",
105 "number_of_replicas" : "1",
89 "number_of_shards" : "1",
19 "number_of_shards" : "3",
[root@elastic elasticsearch]#
[root@elastic elasticsearch]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

elastiflow-3.5.0-2019.09.16 1 r UNASSIGNED INDEX_CREATED
elastiflow-3.5.0-2019.09.16 2 r UNASSIGNED INDEX_CREATED
elastiflow-3.5.0-2019.09.16 0 r UNASSIGNED INDEX_CREATED
dnsmasq-2019.09.07 0 r UNASSIGNED INDEX_CREATED
syslog-2019.09.15 0 r UNASSIGNED INDEX_CREATED
polling-2019.09.12 0 r UNASSIGNED INDEX_CREATED
snmp-polling-2019.09.08 0 r UNASSIGNED INDEX_CREATED
syslog-2019.09.07 0 r UNASSIGNED INDEX_CREATED

 

Here is the verification we need:

[root@elastic elasticsearch]# curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "dnsmasq-2019.09.06",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2019-09-06T11:44:58.531Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "5qG9f4jLTYim3x0V1US7SQ",
"node_name" : "elastic.berry.net",
"transport_address" : "192.168.88.21:9300",
"node_attributes" : {
"ml.machine_memory" : "8201482240",
"xpack.installed" : "true",
"ml.max_open_jobs" : "20"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[dnsmasq-2019.09.06][0], node[5qG9f4jLTYim3x0V1US7SQ], [P], s[STARTED], a[id=Pa92lOgGSjiGDVpNQetebw]]"
}
]
}
]
}

Elastic also tends to force the number of replicas per shard (auto_expand_replicas) to always be one. Since I do not need any replicas, I need to set this to zero. The below will do just that (run this in the console under dev tools in Kibana):

PUT /*/_settings
{
"index" : {
"number_of_replicas":0,
"auto_expand_replicas": false
}
}

Any, health wise my node is now in better shape. No more unassigned shards:

[root@elastic elasticsearch]# curl -X GET "localhost:9200/_cluster/health?pretty"
{
"cluster_name" : "elastic1",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 146,
"active_shards" : 146,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

 

Conclusion

Obviously, this was a bit of a “break/fix” scenario. In order to prevent it from happening again should I reach 1999 shards, I’ll need to monitor the number of shards regularly and take action to carry out one of the below:

  • Manually drop old data
  • Index monthly instead of daily
  • Implement curator to trim old data automatically
  • Look into compression and archival of older data if I really wish to keep it

I “could” install metricbeat on my elastic node and track the performance by using the elasticsearch module. I do wonder if that will also contribute to more shard usage (I could index those monthly I suppose. That will also create a large index as metricbeat tends to pull in a lot of data).

Another option would be a monitoring script…might go that route instead for now. More on that next post.