Elasticsearch - Curator & update#2

Initially when we started looking at ES I had grand visions of months and months of accumulated data.  Unfortunately, you need a big hardware hammer to deal with that amount of data, so after living with 60 days of log data and having 5 nodes (4GB heap each, 8GB RAM) constantly piling up to 80-90% heap usage, we decided to drop back to 30 days (still adequate for what we're doing today).

I would caveat that statement with 'well, that's what I know today'.  For all I know, it's supposed to be running at 80-90% heap usage.  If that is the case, seems strange that tools like 'kopf' show red at 80% heap usage, however - and ES docs indicate that anything past 90% is BadNewsBears.


Also re-read the Curator docs after seeing that optimizing helps get a little bit more performance out of the cluster, and sure enough there is an 'optimize' switch.

Our cron job list now looks like this:
20 0 * * * /usr/bin/curator --host camasves01 close --older-than 30
0 1 * * * /usr/bin/curator --host camasves01 delete --older-than 30
0 2 * * * /usr/bin/curator --host camasves01 optimize --older-than 2
 (that first curator item has something running before it, hence the 20)

Note that the optimize switch defaults to 2 segments per shard: https://github.com/elasticsearch/curator/wiki/Optimize

Future plans

There is talk about using systems like ELK to provide our customers with dashboards, so I'm guessing as that comes up there will be more to learn.

On the books is also ditching the syslog as a middleman so we can have better filtering of inbound traffic.  For example, if a Cisco or Netscaler device sends to Logstash, instead of having to parse out a dozen different types of syslog messages, it would go straight to a 'Netscaler' LS input.  Ideally, anyways.  Or maybe that's the wrong approach!  If nothing else, there's always something new to learn...


Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

Fixing duplicate SPNs (service principal name)

Logstash to Nagios - alerting based on Windows Event ID