Skip to main content

Usage experience with Elasticsearch (ELK) so far

We started looking for monitoring options after someone brought me the problem: 'WUG can't run scripts this long...what are our options?'  So naturally I said, 'well I'm sure Nagios can handle something like this' (because zero budget and prior experience).

Getting it running wasn't an issue, but eventually I felt we needed more on the dashboard side, and started reading (though not quite agreeing with) the '#monitoringsucks' posts.  Finally came across OMD and tried it out - it's a user-friendly and functional implementation of everything I was looking for!

Except eventually it wasn't.  And I started seeing amazing graphs (because ooh shiny) from the folks doing Logstash, then read about how centralized storage of logs is good/handy.  So hey, why not?

It was really easy to get running, but learning the Logstash config stuff took some effort.


  • Windows event logs: Log on to server, open event viewer (or use RSAT)
  • IIS logs: If you are brave enough to even look at them...log on to server.
  • Syslogs: Painful & slow grepping, but at least already centrally collected.
  • Windows event logs: Kibana dashboard to get you started, then modify filters/queries as you want
  • IIS logs: Kibana dashboard, fields are pre-broken up, so super easy to see what's what and where
  • Syslogs: Kibana dashboard, searching now painless.  No more grepping!

The networking guy has already praised the marked difference now that he can just fire up Kibana and poke around - before he would avoid even touching the logs unless there was a problem that required investigation.

It's also offered some big power to our support staff - normally they would be blind going into an issue - just a pager alert that maybe was relevant, and not really descriptive - and wouldn't really even look at IIS/EV logs.  Now we not only give them full (and rapid) access to those things, but we can teach them how these things can help them do their job better, plus the data is piped to dashboards.  We're working on providing useful info in a clear manner to the rest of the company as well.

So the ELK stack is powerful, but we quickly (& expectantly) ran into performance issues running it all on the same box.  Looking at the data, we could see that whenever someone hit a broad query that Java heap would max out and everything would take a dump.  Since best practices say 'never exceed 90% usage' and we were averaging 97% (thanks to tools like the kopf elasticsearch plugin), the choice was made to move up to a dedicated cluster, three nodes to start.

That solved our performance issue (and wasn't terribly hard to migrate from embedded ES to remote cluster ES), but our disk was rapidly filling up (we'd planned on keeping 60-90 days of data).  Resolving that issue was simple, once you figured out how to do it.  I'll do another post on the tech details of this.

Once that was done, things went fine, but the heap usage continued to rise (I believe this is part of having a lot of data), so another node was called for.  Added node#4 - no real issues...except I broke the cardinal (magical?) rule of 'clusters must have same version!'.  Again, not a big deal to resolve, but took some time for the cluster to rebuild after each node was upgraded (yes, I'd disabled shard allocation - it still takes ~30-40 minutes per node to re-index, maybe I'm doing it wrong).

Since then we've hit the 60-day curator limit, and heap levels are around 70%.  I've also added in some Nagios monitoring for the cluster details so I'm not nervously checking kopf every morning (I do anyways).

One of the more forward-thinking fellows at the office has been strongly hinting at the whole big data thing for a while (we are a prime candidate for taking advantage of it), so I sense that in our future is Cassandra and Spark (alongside ES).  Some cool stuff coming, I think...


Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

This warning isn't documented that well on the googles, so here's some google fodder:

You are trying to set up replication for a DFS folder (no existing replication)Source server is 2008R2, 'branch office' server is 2012R2 (I'm moving all our infra to 2012R2)You have no issues getting replication configuredYou see the DFSR folders get created on the other end, but nothing stagesFinally you get EventID 4312:
The DFS Replication service failed to get folder information when walking the file system on a journal wrap or loss recovery due to repeated sharing violations encountered on a folder. The service cannot replicate the folder and files in that folder until the sharing violation is resolved.  Additional Information:  Folder: F:\Users$\\Desktop\Random Folder Name\  Replicated Folder Root: F:\Users$  File ID: {00000000-0000-0000-0000-000000000000}-v0  Replicated Folder Name: Users  Replicated Folder ID: 33F0449D-5E67-4DA1-99AC-681B5BACC7E5  Replication Group…

Fixing duplicate SPNs (service principal name)

This is a pretty handy thing to know:

SPNs are used when a specific service/daemon uses Kerberos to authenticate against AD. They map a specific service, port, and object together with this convention: class/host:port/name

If you use a computer object to auth (such as local service):

If you use a user object to auth (such as a service account, or admin account):

Why do we care about duplicate SPNs? If you have two entries trying to auth using the same Kerberos ticket (I think that's right...), they will conflict, and cause errors and service failures.

To check for duplicate SPNs:
The command "setspn.exe -X

C:\Windows\system32>setspn -X
Processing entry 7
MSSQLSvc/ is registered on these accounts:
CN=SQL Admin,OU=service accounts,OU=resources,DC=company,DC=local

found 1 groups of duplicate SPNs. (truncated/sanitized)

Note that y…

Logstash to Nagios - alerting based on Windows Event ID

This took way longer than it should have to get here's a config and brain dump...

You want to have a central place to analyze Windows Event/IIS/local application logs, alert off specific events, alert off specific situations.  You don't have the budget for a boxed solution.  You want pretty graphs.  You don't particularly care about individual server states.  (see rationale below - although you certainly have all the tools here to care, I haven't provided that configuration)

ELK stack, OMD, NXlog agent, and Rsyslog.  The premise here is as follows:

Event generated on server into EventLogNXlog ships to Logstash inputLogstash filter adds fields and tags to specified eventsLogstash output sends to a passive Nagios service via the Nagios NSCA outputThe passive service on Nagios (Check_MK c/o OMD) does its thing w. alerting
Open Monitoring Distribution, but the real point here is Check_MK (IIRC Icinga uses this...).  It makes Nagios easy to use and main…