Application-generated metrics - Part1: What is the difference between Metrics, Events, and Logs?

We're working on making 'application-generated metrics' an accessible thing through some sort of metrics framework.  The subtle goal is that we'd like to provide the ability for metrics-driven decisions to become a viable option, and to gently push back on "sales-driven development".

Given that I spent an entire day trying to figure out the answer to this question of 'how are metrics/events/logs different', it's probably worth taking the time to write it down.

DISCLAIMER: This is not a scientific paper, so do your own research to refute or support the following...

Backstory

What is 'application-generated metrics'?  An old concept that I'm probably wording poorly.  Essentially, as your code does stuff, it should tell you about it.  Keeping track of important flows like registration and payments should be boosted by having dashboards/monitoring that tracks 'registration failure' or 'payment failure'.  This was my original exposure to this concept, and it stuck with me:  Measure anything, measure everything - Etsy

The pattern for introducing a metrics framework (not sure if that's the right term) was established with our efforts to bring in centralized logging (which was a rousing success):

  • Helped figure out and test how to actually implement the in-app logging component (we are using NLog for the C# apps)
  • Helped establish a sane naming convention/hierarchy/index patterns/etc
  • Enforced said naming convention
  • Helped train/teach developers on the use of Kibana
  • Helped build dashboards for teams (we did this alongside our push for Runbooks)
  • Helped onboard stubborn applications that needed extra Logstash love
  • Built a fully-Terraformed ELK infrastructure and made sure it stayed up (it wasn't very stable to start, and there were some spectacular app-splosions, but we got past it)

So given that experience, it made sense to start down a similar path - collaboration!  But this time - bonus! - there was already a metrics push underway by the developers (care of the data team).  They had written a Nuget package to be consumed internally that provided metric-dropping functionality (sending to our data analysis stack).  Since I have the good fortune to sit across from the developer responsible for that package (and for the last few weeks had been dropping hints about putting together our own metrics datastore), we got together to discuss how this could work.

My proposed solution

InfluxDB & Grafana, with added providers/endpoints to their metrics package.  Simple!  I believe that I actually said: "You just drop a metric for 'registration success', easy!" ...

So easy I arrogantly wrote up a cat-centric code snippet!





Why this was silly

...except for the fact that they built their metrics package based on the Google Analytics pattern of Category + Action + Label (which is fine! not sure why I set it up to be the baddie?).  But just try mapping the GA pattern to what InfluxDB expects...not so simple!  Or....even try thinking before you speak!  Also not simple!

InfluxDB wants a MeasurementName and (a) field key/value(s) - optionally tag key/value(s).  Fields are required, i.e. you should be passing some sort of highly variable data inside the field.  Fields are not indexed! (so no WHERE queries targeting fields)  Tags are optional, and the values can only be strings.  Tags are indexed!  And the best we could figure was that Actions map to Measurements.

Example Metric

Category: Registration
Action: Success
Label: LeagueId
(timestamp)

Translated Metric

Measurement: success
Fields: Label:LeagueId
Tags: Category=Registration,
(timestamp)

However, inevitably someone would want to do a query that involved 'WHERE LeagueId=12345' and this would all fall apart.  So the problem here is simply that we are trying to jam something into Influx that does not fit their definition of an event (still uncertain as to how they define it). 

Further, how do we map the following to that standard?

  • Results: Success or Failure of a thing, like payment failed
  • Occurrences (Events): Start or Finish of a thing, calling a 3rd party API
  • Actions: Button or link clicks (you could argue this is what GA does :) )

Here's a code snippet of me trying to fit an example metric into the Influx format.   You get an idea for what kinds of metadata and such we are looking to attach to an 'event' metric.  (are we doing it wrong?  tell me!)

The Google Analytics pattern makes sense to me (I liken it to stamping a metric out), so I tried to reconcile 'registration success' with some sort of value...came up empty. get it!?   Aha! This is really a case of 'incorrect assumptions' - I assumed that this would be simple, and that I understood the requirements.  It's good to be proved wrong this early in the game, fortunately - proof that socializing stuff works.

Conclusions

So more reading, and talking to a lot of folks at the office, and here is where we ended up:

  • A metric is a count of events, or the status of a thing (registration took 123ms)
    • These have business and diagnostic significance
    • It has a highly variable value attached!
  • An event is a thing that happens and is tracked (a registration by user 444 was successful)
    • These have business significance
    • I call them 'loglets'
  • A log is just a verbose event, kinda (the maximum number of retries has been hit and the circuit breaker is opening)
    • These have diagnostic significance
    • Only look at logs if you have to
    • Your classic log entry goes here
Our implementation will change a bit, knowing this.  There is now cause to consider sending three different types of information, so we need specific channels.
  • Metrics with math-able values are sent to Influx
    • Registration took 1500ms
    • I am having trouble coming up with others...
  • Events (and logs) are sent to Elasticsearch
    • User:1234 registered successfully
    • LeagueId roster upload failed
    • Not sure yet if we can just re-use the NLog connector, or if we need something more direct to Elasticsearch (probably the latter)

So there you have it.  Events in the sense of 'this thing happened at this time and this user was involved with this other metadata' are not something you put in InfluxDB.

More to come as we figure out implementation-ey stuff, plus later on I'm sure there will be a 'hey this worked!' or ' hey this was dumb!' post.

Some links from my research

https://labs.spotify.com/2015/11/16/monitoring-at-spotify-the-story-so-far/
https://support.google.com/analytics/answer/1033068?hl=en
https://docs.influxdata.com/influxdb/v1.3/concepts/key_concepts/
https://docs.influxdata.com/influxdb/v1.3/concepts/schema_and_data_layout/
https://groups.google.com/forum/#!topic/influxdb/jge1kctn35g
https://dba.stackexchange.com/questions/163292/understanding-how-to-choose-between-fields-and-tags-in-influxdb
https://groups.google.com/forum/#!topic/influxdb/JsirU1Q2aqE
https://groups.google.com/forum/#!topic/influxdb/7tHKkEMJWwQ
https://github.com/influxdata/influxdb-csharp)
https://www.influxdata.com/blog/why-build-a-time-series-data-platform/

Comments

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

Fixing duplicate SPNs (service principal name)

Logstash to Nagios - alerting based on Windows Event ID