Update on canary - how has our release impact changed?

We had two primary targets for our canary process - our monolith has a main web app and a main api - deployments to those two using normal deploy processes cause 'release impact'.  Namely, all our systems go wonky and the scope of 'what gets disrupted' is sometimes even kinda unknown (or rather, has never been tabulated).
For the backstory, check out the other two posts on this topic:

TL;DR

  1. Customer impact with canary is now 0 (barring a broken deploy)
  2. We can now release at will instead of inside low-throughput release windows
  3. We have fast feedback via metrics that will automatically revert a failed deployment
  4. The actual deploy process is longer, but that's ok - because we don't have a time restriction

Some definitions...

  • Impact window (length of release impact) - how long are users and systems impacted negatively, specifically because of our new code going out (releasing (ok technically deploying))
  • Outage - the application is completely unavailable to serve requests
  • Average latency - measured from IIS responses, using the field 'time_taken', average
    • It's really just for illustration
  • Percentiles are measured from 'time_taken'
    • I chose those three because:
      • 75th being high means things are pretty bad
      • 95th is the one I've seen recommended as 'target to hit'
      • 99th highlights weird horrors

API Release Impact

  • Impact window reduction: 150s -> 30s
  • Outage reduction: 30s -> 0s
  • Average latency during deploy: 8141ms -> 575ms
  • 75th percentile reduction: 13752ms -> 575ms
  • 95th percentile reduction: 19600ms -> 4350ms
  • 99th percentile reduction: 20175ms -> 5900ms

Normal Deploy - API










Canary Deploy - API





Web Release Impact

  • Impact window reduction: 60s -> 0s
  • Outage reduction: 20s -> 0s
  • Average latency during deploy: 4200ms -> 80ms
  • 75th percentile reduction: 7150ms -> 42ms
  • 95th percentile reduction: 14600ms -> 216ms
  • 99th percentile reduction: 20600ms -> 1020ms

Normal Deploy - Web App










Canary Deploy - Web App













Comments

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

Fixing duplicate SPNs (service principal name)

Logstash to Nagios - alerting based on Windows Event ID