Canary deployments of IIS using Octopus & AWS

I'm writing this more to clear my head than anything else.  If it helps someone, great.

We have measured significant 'release impact' when deploying one of our core applications.   The main problem is initialization/warmup of the appPool.  We've tried the built-in methods, but for whatever reason we are always stuck with ~25s of dead time while the first request warms things up (we assume it's warming things up, not really sure what is happening).  After that 25s wait things are very snappy and fast, so how do we prevent all of our web servers from going into 25s of dead time with production traffic inbound?

Starting point - why do this?

We care about our customers, and we want to help drive our business forward with as much quality/safety/speed as possible.

Because we want to drive our business forward, we are pushing to do more and more deploys (currently we do a daily deploy, but want to see 5x that) (if you have to ask why we want 5x, read this).  Because we care about our customers, we evaluated what a customer sees/experiences during these deploy windows.  If you've never done this, might be worthwhile!  We discovered there is a 30s blackout followed by a minute or two of slow response times.

So we need to deploy more often, but we can't deal with customer outages (even a few minutes will increase support call volume).  At present, our target of 5x deploys/day would cause 10-15m of downtime per day - not acceptable!  Further, even with a single deploy per day, if a change goes out, core metrics might be impacted any nobody would know until customers complained/someone tested the system.

'Cluster Immune System' concept to the rescue!  It's not a new concept (introduced over 10 years ago), but I've only just read about it recently, and only had opportunity to implement it recently.  Here's the gist of it:

  • Canary deploy concept (one server at a time)
  • Interesting metrics are watched/polled after your canary period (this part is really the most important)
  • Revert kicks in if metrics are affected, team is notified

Infrastructure


  • EC2 boxes running IIS & Octopus Tentacle
  • Application Load Balancer with a target group (not classic ELB, but ELBv2)
  • DNS routes to the ALB

Some process hurdles we faced


  • We all agreed that 'everything must be backwards compatible' - this is a pre-requisite for all of the modern deploy methodologies (canary, blue/green, CD, etc)
  • We all agreed that 'everything must have a feature flag' - we separate deploy from release
  • We all agreed that 'major DB changes go outside this system' - these further must be well-socialized with the team
    • Because setting this up to deal with major DB changes is super hard, and frankly it happens so rarely that it's not worth building something for very minor edge cases

Some technical hurdles we faced

  • Our app cannot run on any port other than 443, and must be using SSL (yes, even on the back end).  This is just a fact of life we have to live with for now - everyone acknowledges that it's bad.  We are not superheroes.  :)
    • Because of this, we cannot have two applications running at the same time.  This means our failback process has to re-warm the old appPool.  Not world-ending, but takes another 1-2m per server.
      • This means we're applying this concept 'multiple IIS sites, only one is running'
      • I initially wrote this process to use ports instead of sites :)
  • Octopus doesn't account for this scenario:
    • Server1 deploy passes
    • Server2 deploy fails
    • Server2 is rolled back
      • What about Server1!?
  • Querying Elasticsearch is really hard, I don't know why.  I was able to figure out sending data to ES no problem, but getting data back (search results) never seemed to work how I expected.  I am probably doing it wrong.
    • This guy's stuff works: https://github.com/gigi81/elastico
    • But even it has some quirks, since Lucene is particular about single versus double quotes, you have to do weird stuff to keep double quotes intact.
$errorQueryErrors = 'application_name="{0}" AND response:500' -f $newAppName
Write-Output "Querying elk for errors: $errorQueryErrors"
$errorSearchOutput = Search-ESLastMinutes -indexShortName $indexShortName -querySearchTerms $errorQueryErrors -startDate $loggingStartTimestamp

Process outline


  • Octopus rolling deploy, one server at a time
    • Deploy our 'deployscripts' package - contains powershell scripts, modules, etc
    • Generate variables - i.e. we need a good 'old' fallback point - before we deploy anything is the safest time
      • Creates oldAppName and newAppName (i.e. app.domain.com-3456 and app.domain.com-3457)
    • Canary start - deregister the instance from the ELB, stop the old site, cleanup old sites (aside from oldAppName) - at this point no traffic is flowing to this host and all IIS sites are stopped on this host
    • Deploy the new code via Octopus IIS site thingy
      • New site and appPool get called newAppName
    • Canary finish
      • Test new site exists and is started
      • Warmup new appPool - we've discovered that a simple invoke-webrequest to the right URI is enough
        • Run this 5-6 more times and verify the response times are below 100ms
      • Register the instance with the ELB
      • Wait for X amount of time
        • Still undecided, probably 2m
      • Query Elasticsearch for response times and error rates (arbitrary thresholds based on pre-existing normal levels)
        • Using the Elastico module: https://github.com/gigi81/elastico
      • If there is failure, revert.
        • Deregister, stop new site, start old site, warmup old site, register, drop a failure metric

Important next steps

Some initial demos have shown me a few basic areas that need improvement:
  1. Maintainability - switching over to script files instead of script code in Octopus, no more copy-pasting (which I'd been wanting to learn for a while anyways - turns out it's easy, and done)
  2. The 'interesting metrics' piece needs to be easy to expand on (seeing as how this is the lynch pin of the whole Cluster Immune System)
  3. Figure out the revert scenario

Stuff to figure out

I'm still stuck on how we'll keep track of server status.  Toying with redis that gets flushed before each run, then each server sets status?  Or some way to export an octo variable, and then we check for that variable?  Or just re-poll all servers, if they have newAppName running, revert? (this seems most sane - do the revert on the first server to fail, then re-query all of them for newAppName -eq running)
We want the revert to happen ASAP, because it means customers are impacted somehow.

The maintainability is an on-going lesson for us, and really means we must get ourselves involved in the development teams daily activities.

Interesting metrics will be a learning process.  Two parts:
  1. The idea of knowing what to expect with a change - admittedly it's a question of visibility into and knowledge of the rest of the system
  2. The idea of knowing what are the core/critical things to track and actually tracking them - and doing something if things go off the rails

Future state

The goal is to apply this concept to all of our app pipelines in some way/shape/form.  We come out of this with:
  • Safer deploys
  • More frequent deploys
  • Better understanding of our systems
  • Better understanding of what change means
  • Happier customers
  • Happier product & business teams
  • Happier developer, QA, and ops folk

Comments

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

Fixing duplicate SPNs (service principal name)

Logstash to Nagios - alerting based on Windows Event ID