Our (contextual) approach to monitoring...
Let me preface this with a disclaimer: I am not the creator/inventor of any of these ideas, this is just a combination of Google and the context of our environment. For all I know people have been doing this for ages...it's new to me! Our context Our 'prod' environment is in a managed hosting context for various reasons (compliance, past decisions, etc). This means that from a 'traditional' monitoring point of view we are covered. Anything beyond that we're on our own. Also, we are 95% Windows. What is monitored today: Host/svc, host/svc, ports alive, HTTP GET = XYZ, is CPU crazy, is RAM maxed, etc... Picking up what's left: Logs! Centrally aggregate all logs and query that pile of data to get alerts. i.e. the services/applications we run (written in-house) dump stuff to the event log, among other places. What are our goals? Get visibility into what our svs/apps are doing - know (be alerted) that svcX is dumping errors right now, not a wee...