Usage experience with Elasticsearch (ELK) so far

We started looking for monitoring options after someone brought me the problem: 'WUG can't run scripts this long...what are our options?'  So naturally I said, 'well I'm sure Nagios can handle something like this' (because zero budget and prior experience).

Getting it running wasn't an issue, but eventually I felt we needed more on the dashboard side, and started reading (though not quite agreeing with) the '#monitoringsucks' posts.  Finally came across OMD and tried it out - it's a user-friendly and functional implementation of everything I was looking for!

Except eventually it wasn't.  And I started seeing amazing graphs (because ooh shiny) from the folks doing Logstash, then read about how centralized storage of logs is good/handy.  So hey, why not?

It was really easy to get running, but learning the Logstash config stuff took some effort.

Before:

  • Windows event logs: Log on to server, open event viewer (or use RSAT)
  • IIS logs: If you are brave enough to even look at them...log on to server.
  • Syslogs: Painful & slow grepping, but at least already centrally collected.
After:
  • Windows event logs: Kibana dashboard to get you started, then modify filters/queries as you want
  • IIS logs: Kibana dashboard, fields are pre-broken up, so super easy to see what's what and where
  • Syslogs: Kibana dashboard, searching now painless.  No more grepping!

The networking guy has already praised the marked difference now that he can just fire up Kibana and poke around - before he would avoid even touching the logs unless there was a problem that required investigation.

It's also offered some big power to our support staff - normally they would be blind going into an issue - just a pager alert that maybe was relevant, and not really descriptive - and wouldn't really even look at IIS/EV logs.  Now we not only give them full (and rapid) access to those things, but we can teach them how these things can help them do their job better, plus the data is piped to dashboards.  We're working on providing useful info in a clear manner to the rest of the company as well.

So the ELK stack is powerful, but we quickly (& expectantly) ran into performance issues running it all on the same box.  Looking at the data, we could see that whenever someone hit a broad query that Java heap would max out and everything would take a dump.  Since best practices say 'never exceed 90% usage' and we were averaging 97% (thanks to tools like the kopf elasticsearch plugin), the choice was made to move up to a dedicated cluster, three nodes to start.

That solved our performance issue (and wasn't terribly hard to migrate from embedded ES to remote cluster ES), but our disk was rapidly filling up (we'd planned on keeping 60-90 days of data).  Resolving that issue was simple, once you figured out how to do it.  I'll do another post on the tech details of this.

Once that was done, things went fine, but the heap usage continued to rise (I believe this is part of having a lot of data), so another node was called for.  Added node#4 - no real issues...except I broke the cardinal (magical?) rule of 'clusters must have same version!'.  Again, not a big deal to resolve, but took some time for the cluster to rebuild after each node was upgraded (yes, I'd disabled shard allocation - it still takes ~30-40 minutes per node to re-index, maybe I'm doing it wrong).

Since then we've hit the 60-day curator limit, and heap levels are around 70%.  I've also added in some Nagios monitoring for the cluster details so I'm not nervously checking kopf every morning (I do anyways).

One of the more forward-thinking fellows at the office has been strongly hinting at the whole big data thing for a while (we are a prime candidate for taking advantage of it), so I sense that in our future is Cassandra and Spark (alongside ES).  Some cool stuff coming, I think...





Comments

Popular posts from this blog

Learning through failure - a keyboard creation journey

Learning Opportunities - Watching/listening list

DFSR - eventid 4312 - replication just won't work