Skip to main content

App review: InfraDog (new take on SMB system monitoring)

A friend of mine has been very busy lately - the last few years, in fact - and I've found out just what he's accomplished.  I would note right off the bat that I am not being paid for this, just helping get the word out on an app that could be a game-changer for many of you SMB folk.

Since we live in a 'TL;DR' (too long; didn't read) kind of world I will cut to the chase, then paste in my 7-day InfraDog journal.  If you are in SMB, do not have server monitoring in place, and want cheap insurance on common IT blunders that cost you business credibility and money, just do the 30-day free trial.  Do it.  I don't have actual figures on post-trial pricing, but I understand it is extremely (read: can't afford not to) affordable to keep going after the 30 days are up.

Setup is almost comically easy if you're familiar with traditional monitoring solutions (SCOM, Nagios).  Step 1, download and install the management client onto a single server/PC of your choice.  Step 2, add desired servers/PCs to your list of monitored machines.  Step 3, download the app on your IOS device (iPhone preferred), connect, and start poking around.

Out of the box you get CPU/RAM/disk monitoring templates.  There are other event log templates available, but the key indicators in a 'there could be a problem here' kind of monitoring system are those three.  I would also hazard a guess that new templates will continue to become available as the app develops (it was only released a week ago as of writing).  These templates can then be applied to your group (or groups) of servers (only Windows as of release - VMware and Linux on the way).

This is just my gut feeling, but in the SMB arena (which they tell me they are exclusively targeting) this will completely change how companies a) view monitoring (the vast majority do nothing), and b) make IT professionals/consultants' lives easier and less stressful.  Once you get familiar with the app layout, it starts to dawn on you just how handy this app could be.

Now, I'll be 100% clear here...this will not replace Nagios or SCOM...it's not meant to.  This gives you a front-line defence against common IT problems, and the ability to know about them before they become a problem.  Further, the potential this app has really makes you think - if you have seen their YouTube commercial, that conveys the basic premise they're going for.  Consider this:  Reading event logs is now easy...and fast.

The app is not perfect as of release, but it's pretty darn polished and smooth for only having been on the market for a week.  I'm impressed.  Good job, guys!!

Edit:  Forgot to put a linky in: http://www.infradog.com

Journal as follows.....be warned it rambles.

Device: iPad1, wifi-only

Day1

  • Got registered, downloaded, installed and had ~50 servers (3 subnets) added within 15 minutes.
  • Applied the 3 default CPU/RAM/Disk monitors.
  • Poked around, really enjoying the ping/portquery option and how it looks - doesn’t just give ‘ok’ but lets you see results of the command.
  • I started using the app on my iPad (gen1), unfortunate that there is only the ‘2x’ button (designed as an iPhone app) - I think you could get a lot of value out of the extra screen real estate for reports, viewing all info at once.  A tag team of iPad in the office, iPhone out of the office might be interesting.


Day2

  • Got a LOT of email alerts (60) overnight, pretty much all of them memory-related (set to alarm by default at 85% usage).  Up/downs.  So it’s clear that this is not an ‘out of the box you’re good to go’ kind of thing, but in reality the CPU/RAM is usually not a problem, almost always it’s disk causing serious issues.


Day3

  • Got a disk notification for a server that’s been troublesome in the past - discovered the job meant to clean the backup files had failed to run.  Probably time to just fix the ‘sins of the fathers’ and move the backups off the DB drive.
  • Found some of the event log filtering a big buggy - removing informational under a particular server’s system event log removed all items, even though there were a few recent warnings.  Hit refresh, and it displayed all events.  Checked the filter settings - informational still set to ‘off’.  Odd.  Oh well, some bugs to be expected.  Update - not a bug...filter has a time period default of 24 hours.  Whups.
  • Not a huge amount of monitor templates, but that will probably change going forward.
  • My impression of the app thus far is that it’s a completely different way of viewing your server inventory...I see this going places.


Day4

  • Spent some time looking around at what’s available.  (going from memory here)  The summary report can be emailed straight from the app, but individual server items must be emailed from the on-device email profile...unfortunate as my test device did not have any email configured, but in the real world not a big deal.
  • The re-scan would not pick up the newly sized disk right away - after about 15 minutes post-scan, the disk started showing up.
  • Found the app would crash (forgotten the specific method), but not a big deal as it started up again very quickly.
  • Creating a user-defined server group wasn’t too difficult, very easy to do once the process figured out.
  • Scrolling through event logs is very nicely implemented - the key point is ‘fast’.  Security log is not available, but that’s not a deal-breaker, as you’d most likely be at a PC by this point anyways.  It’s too easy to start thinking this app can do it all.  Maybe in future...


Day5

  • Adding in some method of monitoring application response time would be handy, but hardly falls under ‘simple stuff’.
  • The ‘infradog’ email folder I created was getting quite full by this point (300 emails over the weekend), so I adjusted the monitor templates from the defaults.  This was another instance of ‘would be nice if...’ as the options for the template monitors are fairly limited.
  • The general report is handy if you need to view all system specs in one place without compiling a list yourself (and assuming you don’t have any other sort of reporting/monitoring infrastructure).


Day6

  • Checked out my alert emails, and I’m still getting them set on the old RAM template values...investigated - I did not ‘apply’ the template to the host group.  Did so, but since I changed the parameters of the disk check, I now have two disk checks - one with each setting (%/GB).  Poked around (literally, har) for a way to remove the old ‘%’ disk template...it seems that since I’m now using the ‘new’ template, there is no way to remove the old one without removing both.  The confusing bit is where I only edited the one disk template.  After removing the GB disk monitor, I recreated the % disk monitor, re-applied it, removed, re-created the GB disk monitor and applied it.  Seems ok now, and it was only the disk monitor with this bug.
  • The GB disk monitor is confusing to read: Free space < 10GB, Value=24GB, Total=29GB  (should just read ‘Only 5GB free!’)
  • The pull down to refresh is cool.
  • This seems like a great place for some graphing, but would get into huge databases for Infradog to support.


Day7

  • Alert frequency is now very manageable, the template tweaks did the job.


Comments

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

This warning isn't documented that well on the googles, so here's some google fodder:


You are trying to set up replication for a DFS folder (no existing replication)Source server is 2008R2, 'branch office' server is 2012R2 (I'm moving all our infra to 2012R2)You have no issues getting replication configuredYou see the DFSR folders get created on the other end, but nothing stagesFinally you get EventID 4312:
The DFS Replication service failed to get folder information when walking the file system on a journal wrap or loss recovery due to repeated sharing violations encountered on a folder. The service cannot replicate the folder and files in that folder until the sharing violation is resolved.  Additional Information:  Folder: F:\Users$\user.name\Desktop\Random Folder Name\  Replicated Folder Root: F:\Users$  File ID: {00000000-0000-0000-0000-000000000000}-v0  Replicated Folder Name: Users  Replicated Folder ID: 33F0449D-5E67-4DA1-99AC-681B5BACC7E5  Replication Group…

Fixing duplicate SPNs (service principal name)

This is a pretty handy thing to know:

SPNs are used when a specific service/daemon uses Kerberos to authenticate against AD. They map a specific service, port, and object together with this convention: class/host:port/name

If you use a computer object to auth (such as local service):
MSSQLSVC/tor-sql-01.domain.local:1433

If you use a user object to auth (such as a service account, or admin account):
MSSQLSVC/username:1433

Why do we care about duplicate SPNs? If you have two entries trying to auth using the same Kerberos ticket (I think that's right...), they will conflict, and cause errors and service failures.

To check for duplicate SPNs:
The command "setspn.exe -X

C:\Windows\system32>setspn -X
Processing entry 7
MSSQLSvc/server1.company.local:1433 is registered on these accounts:
CN=SERVER1,OU=servers,OU=resources,DC=company,DC=local
CN=SQL Admin,OU=service accounts,OU=resources,DC=company,DC=local

found 1 groups of duplicate SPNs. (truncated/sanitized)

Note that y…

Logstash to Nagios - alerting based on Windows Event ID

This took way longer than it should have to get going...so here's a config and brain dump...

Why?
You want to have a central place to analyze Windows Event/IIS/local application logs, alert off specific events, alert off specific situations.  You don't have the budget for a boxed solution.  You want pretty graphs.  You don't particularly care about individual server states.  (see rationale below - although you certainly have all the tools here to care, I haven't provided that configuration)

How?
ELK stack, OMD, NXlog agent, and Rsyslog.  The premise here is as follows:

Event generated on server into EventLogNXlog ships to Logstash inputLogstash filter adds fields and tags to specified eventsLogstash output sends to a passive Nagios service via the Nagios NSCA outputThe passive service on Nagios (Check_MK c/o OMD) does its thing w. alerting
OMD
Open Monitoring Distribution, but the real point here is Check_MK (IIRC Icinga uses this...).  It makes Nagios easy to use and main…