Logstash to Nagios - alerting based on Windows Event ID

This took way longer than it should have to get going...so here's a config and brain dump...

Why?
You want to have a central place to analyze Windows Event/IIS/local application logs, alert off specific events, alert off specific situations.  You don't have the budget for a boxed solution.  You want pretty graphs.  You don't particularly care about individual server states.  (see rationale below - although you certainly have all the tools here to care, I haven't provided that configuration)

How?
ELK stack, OMD, NXlog agent, and Rsyslog.  The premise here is as follows:

  1. Event generated on server into EventLog
  2. NXlog ships to Logstash input
  3. Logstash filter adds fields and tags to specified events
  4. Logstash output sends to a passive Nagios service via the Nagios NSCA output
  5. The passive service on Nagios (Check_MK c/o OMD) does its thing w. alerting

OMD
Open Monitoring Distribution, but the real point here is Check_MK (IIRC Icinga uses this...).  It makes Nagios easy to use and maintain.  Doesn't fix all the legacy concerns w. Nagios, but it works darn well nonetheless.  We have email distribution lists get 'pretty' emails (w. graphs) and the pagers get simple text emails w. abbreviated info.

The passive services we tie to 'localhost' right now because we don't actually need to monitor the servers themselves (3rd party hosted environment)...we just need to monitor the application level.  You define a 'classical active/passive nagios service' for localhost, then add a rule to disable active checks.  Enable NSCA for the site via 'omd config'.

ELK stack
Elasticsearch, Logstash, Kibana.  Simple to get running, good examples around highly available config, and gives you pretty graphs.  We are looking at a 7-day ELK stack inside our production environment that ties to Nagios and a 6-12-month ELK stack at our local colo doing analysis/trending.  If you have a load balancer it's quite easy to horizontally scale.  The Kibana graphs will form part of our dashboard system.  I think at this point we'll be doing RabbitMQ for the production-facing ELK stack, and perhaps Redis for the Big 'ole ELK stack at our office.

NXlog agent
Install this on your Windows servers.  Lightweight, simple, handles EventLogs and IIS logs (for us), but can also deal with other file inputs and much more.

Rsyslog
All of our network devices send logs to a syslog box, which in turn simultaneously archives and then forwards to Logstash.

Configuration
NXlog

# Create the parse rule for IIS logs. You can copy these from the header of the IIS log file.
<Extension w3c>
    Module xm_csv
    Fields $date, $time, $s-ip, $cs-method, $cs-uri-stem, $cs-uri-query, $s-port, $cs-username, $c-ip, $csUser-Agent, $sc-status, $sc-substatus, $sc-win32-status, $time-taken
    FieldTypes string, string, string, string, string, string, integer, string, string, string, integer, integer, integer, integer
    Delimiter ' '
    QuoteChar '"'
    EscapeControl FALSE
    UndefValue -
</Extension>
# Convert the IIS logs to JSON and use the original event time
# MySiteName site logs
<Input MySiteNameLogs>
    Module    im_file
    File    "C:\\inetpub\\logs\\LogFiles\\W3SVC1\\u_ex*"
    SavePos  TRUE
    Exec if $raw_event =~ /^#/ drop();          \
       else                                     \
       {                                        \
            w3c->parse_csv();                   \
    $EventTime = parsedate($date + " " + $time); \
            $SourceName = "IIS";                \
            $Message = to_json();               \
       }
</Input>

<Route IIS>
    Path MySiteNameLogs => out
</Route>

Note that you also need the basic Windows event log stuff set up with matching route/output names.  IIRC I pulled most of my NXlog config from the Loggly pages.

Logstash
I read up on the best way to handle Logstash conf files, and my interpretation was:

  • conf.d/00-inputs.conf (all inputs)
  • conf.d/01-filter_start.conf (really just 'filter {' )
  • conf.d/10-filter_parsing.conf, 20-filter_tagging.conf, etc
  • conf.d/98-filter_end.conf (really just '} #end filter' )
  • conf.d/99-outputs.conf (all outputs)


input {
    tcp {
        type => eventlog
        port => 1935
        codec => json_lines
    }

    syslog {
        type => syslog
        port => 5514
    }
}
filter {
# All grok message parsing filters
# VMware syslog input
if [message] =~ /camasvvm1/ {
    grok {
      break_on_match => false
      match => [
        "message", "<%{POSINT:syslog_pri}>%{TIMESTAMP_ISO8601:@timestamp} %{SYSLOGHOST:hostname} %{SYSLOGPROG:message_program}: (?(?(?:\[%{DATA:message_thread_id} %{DATA:syslog_level} \'%{DATA:message_service}\'\ ?%{DATA:message_opID}])) \[%{DATA:message_service_info}]\ (?(%{GREEDYDATA})))",
        "message", "<%{POSINT:syslog_pri}>%{TIMESTAMP_ISO8601:@timestamp} %{SYSLOGHOST:hostname} %{SYSLOGPROG:message_program}: (?(?(?:\[%{DATA:message_thread_id} %{DATA:syslog_level} \'%{DATA:message_service}\'\ ?%{DATA:message_opID}])) (?(%{GREEDYDATA})))",
        "message", "<%{POSINT:syslog_pri}>%{TIMESTAMP_ISO8601:@timestamp} %{SYSLOGHOST:hostname} %{SYSLOGPROG:message_program}: %{GREEDYDATA:message-syslog}"
      ]
    }
}
# All tagging filters go in here
# Don't forget to set the 'tag_on_failure' or else _grokparsefailures will abound!

### This is a test/template section... ###
if [type] == "eventlog" and [SourceName] == "Microsoft-Windows-Security-Auditing" {
        if "AUDIT_SUCCESS" in [EventType] {
                grok {
                        match => [
                        "EventID", "(4624|4634|4769|4672|4768)"
                        ]
                        add_tag => ["nagios_check_eventlog", "AuditSuccess"]
                        add_field => ["nagios_service", "PassiveTest"]
                        tag_on_failure => ["other-auditsuccess"]
                }
        }
        if "AUDIT_FAILURE" in [EventType] {
                grok {
                        match => [
                        "EventID", "(4624|4634|4769|4672|4768)"
                        ]
                        add_tag => ["nagios_check_eventlog", "AuditFailure"]
                        add_field => ["nagios_service", "PassiveTest"]
                        tag_on_failure => ["other-auditfailure"]
                }
        }
}
### End test/template section... ###

#
# BizTalk Error Tagging & Alerting #
#
if [type] == "eventlog" and [SeverityValue] == "4" {
        if [SourceName] == "BizTalk Server" {
                grok {
                        match => [
                        "EventID", "(5749|5719|5743|5813|6913|5754|5815|5778|5634|5750)"
                        ]
                        add_tag => ["nagios_check_eventlog"]
                        add_field => ["nagios_service", "PSV-BZ_ERROR"]
                        tag_on_failure => ["other_bzsrv-err"]
                }

        }
        if [SourceName] == "XLANG/s" {
                grok {
                        match => [
                        "EventID", "(10008|10034)"
                        ]
                        add_tag => ["nagios_check_eventlog"]
                        add_field => ["nagios_service", "PSV-BZ_SuspInst"]
                        tag_on_failure => []
                }
        }
        if "System.ServiceModel" in [SourceName] {
                grok {
                        match => [
                        "EventID", "3\n"
                        ]
                        add_tag => ["nagios_check_eventlog"]
                        add_field => ["nagios_service", "PSV-BZ_SystemErr"]
                        tag_on_failure => []
                }

        }
}

if "other_bzsrv-err" in [tags] {
        grok {
                match => [ "SourceName", "BizTalk Server" ]
                add_tag => ["nagios_check_eventlog"]
                add_field => ["nagios_service", "PSV-BZ_Other-Error"]
                tag_on_failure => []
        }
}
#
# End BizTalk Error Tagging & Alerting #
#

#
# IIS Tagging & Alerting  #
#
if [SourceName] == "IIS" {
        grok {
                match => ["sc-status", "[2,3]\d\d"]
                add_tag => ["nagios_check_iislog"]
                add_field => ["nagios_service", "PSV-IIS_Traffic"]
                tag_on_failure => []
        }
        grok {
                match => ["sc-status", "4\d\d"]
                add_tag => ["nagios_check_iislog"]
                add_field => ["nagios_service", "PSV-IIS_400"]
                tag_on_failure => []
        }
        grok {
                match => ["sc-status", "5\d\d"]
                add_tag => ["nagios_check_iislog"]
                add_field => ["nagios_service", "PSV-IIS_500"]
                tag_on_failure => []
        }
}
} #filter
output {
elasticsearch {
        host => "127.0.0.1"
        #cluster => "elasticsearch"
        }

if "nagios_check" in [tags] {
        nagios_nsca {
                host => "checkmk.domain.com"
                port => 5667
                send_nsca_config => "/etc/nagios/send_nsca.cfg"
                nagios_host => "localhost"
                nagios_service => "%{nagios_service}"
                nagios_status => 2
                message_format => "%{EventTime} %{Hostname} %{EventID} %{SourceName}"
                }
        }
} # output

I would once again note the rationale behind our passive service checks.

  1. Our production environment is 'hosted' in the sense that it is a 'traditional' VM/cluster environment, but we are hands off.  The hosting provider takes care of server/network/infrastructure monitoring, but anything beyond that is up to us (or costs us more).
  2. Our mandate is to have 'application health insight' - i.e. know something's up before it breaks, or at least before the customer knows.
That being said, it doesn't make sense for us to duplicate individual server monitoring - so we just pile all the logs into one place and alert if specific eventIDs from specific servers come into the ELK stack.

Conclusion
So there you go...that's about 2 solid weeks of effort to work all the kinks out.  The nagios_nsca portion alone took many days of head bashing and result-less googling.  There's another post on using curl / regex to create Nagios-useable values out of Elasticsearch queries - include that in this time.

The NXlog stuff is out there, but the IIS portion I had to get from different sources.

You'll note a lot of BizTalk stuff - we use that in our environment and it tends to be a black box, so these kinds of alerts will be helpful.  

Eventually we'll have Logstash output to a scriptbox or something that will attempt to remediate an issue before notifying Nagios.  I also want to get Graphite in place so we can start utilizing app-generated metrics, plus have another source of pretty graphs.

Comments

  1. I've updated the filter/tagging section - better logic (fwiw), less _grokparsefailures!

    ReplyDelete
  2. i am looking solution how to ship the choosen log using nxlog shipper to logstash and send alert to nagios.

    there is many log pattern in that file, i want to monitor this type of log like below :

    2015-03-04 09:54:55.298 [178] Statistic SmscWorker_10 INCOMING SM from MSISDN 651111111111111

    2015-03-04 09:54:55.328 [220] Statistic SmscWorker_16 SUBMIT SM to MSISDN 651111111111111 ReqId 1008688024 TN 232


    how to filter this log using nxlog shipper and send alert from logstash to nagios. "INCOMING SM" and "SUBMIT SM" should be in the log every 5 minutes. if there is no "INCOMING SM" or "SUBMIT SM" in 5 minutes, the logstash send alert to Nagios maybe using "/var/lib/nagios3/rw/nagios.cmd" prefered if there is nrpd command

    Thanks

    ReplyDelete
  3. Noel, what I have posted here should do the trick for you. For creating your filter, you want to go here: https://grokdebug.herokuapp.com/

    There's lots out there on setting up grok filters to parse out 'non-standard' logs (that might even be a standard format - try a few lines of that in the 'discover' section of the grokdebug app linked above.

    Your monitoring could be done by piping in "INCOMING SM" & "SUBMIT SM" entries via the Nagios output into a passive Nagios monitor. Set 'stale' alerts to 5 minutes...should do the trick! I'm sure there are other ways of doing this, too.

    I believe Elasticsearch 'percolator' can be used in a similar fashion, i.e. monitor what HASN'T happened in the last x minutes, such as this post: http://garthwaite.org/noticing-what-didnt-happen.html

    Good luck!

    ReplyDelete

Post a Comment

Popular posts from this blog

Fixing duplicate SPNs (service principal name)

DFSR - eventid 4312 - replication just won't work