Detailed review of our canary + IIS + AWS + Octopus process

Another 'clear my head' posts.  Given the view count, that's all these ever are anyways.  :)

This is an expansion of the previous post, just a bit more detail, pictures, and a gist.

I would preface this with a note that we are in the Probably More Common Than The Internet Would Like To Admit camp of helping Legacy systems grow and blossom into less-Legacy systems.  So if I am not using the latest hotness, you may excuse my backwardness.  Also excuse blogger's formatting, which is also backwardness.

Basic Traffic Flow

  • Nginx is there as a legacy thing, and we ran into some interesting issues trying to put an ALB upstream target in Nginx (fancy DNS issues due to ELB/ALB autoscaling).
  • The instances are Windows boxes running IIS - always on port 80 - with the previous version left in place, even after a successful deploy, just in case (the amusing thing is that I had a 'port jumping' solution all written up, only to discover that our app can't deal with not-port-80 - yet)
  • AWS config is really pretty straightforward.  We didn't use Terraform here because we have a rule 'Terraform on net-new, just say no on pre-existing'.


The core app deploy is configured as a rolling deployment with a window of 1.  I am guessing we'll run into 'just canary one server and be done with it' arguments...but I see this as a benefit:

  • We have a small amount of servers, it doesn't take hours to roll out doing it one-by-one
  • Our metrics query looks at how THE NEW APP is behaving, not how THE SERVER is behaving.  This means that each subsequent run of Interesting Metrics is looking at a wider and wider subset of data, thus increasing the chance of badness being caught.
    • You could argue this limits the effectiveness of looking at a single server's state - but we don't care about a single server, we care about the new code we just rolled out!
      • Ok, we do sort of care about a single server

  • CanaryStart: Pull the instance from the Active ELB, clean up any old stopped sites, make the call on 'what are we reverting to' (Octo variable), and gather the new build number (another Octo variable) - then stop the active site. (we can only run the app on port 80 - legacy thing that we'll hopefully fix soon)
  • Deploy Core Web: Do the IIS package deploy of our main webApp, set ports, and set the IIS site naming to be coreapp.siplay.(TLD)-(buildNum)
  • CanaryWarmup: Fire some basic HTTP requests at the new app, keep doing this until there are several responses under 100ms.  Add the instance to the Canary ELB (actually an ALB w. target group)
  • Release and Run CritPath:  We're undecided on whether to keep this step, but essentially it would run through a small subset of our UI test suite and generate 'legitimate' traffic on the instance (and thus, 'legitimate' traffic to query in the next step) - this one was actually kinda tricky, because during a rolling deploy in Octopus you can ONLY run stuff on the target server - so to trick Octopus this step runs octo.exe and creates a release (that auto-triggers) on a dedicated Octo project that ONLY runs our CritPath category against our canary domain - and waits for it to complete (--progress)
  • CanaryInterestingMetrics-CanaryELB: Queries Elasticsearch for 'interesting metrics' (right now just response time and error rate) and does a pass/fail - failing this step (or any in here really) triggers a revert
  • CanaryChangeELBs: Removes the instance from our Canary ELB and adds it to the Active ELB. (this is a separate step so the 'interesting metrics' steps can be essentially the same code)
  • CanaryInterestingMetrics-ActiveELB: Same idea as before, but now we'll have a flood of real traffic to base our metrics checks on (the working theory here is that we'll have different 'interesting metrics' based on canary or active contexts.
  • CoreApp-CanaryRevertAll: This reverts ELB location and IIS running site to 'as before', based on the Octopus variables set in the first step - another step in our deploy process bombs out to Slack with @channel

Interesting Metrics

Interesting metrics was the curious part.  It seemed so simple to me, reading the IMVU account of how their cluster immune system worked, that I made the crucial mistake of thinking we'd have a huge pile of metrics ideas lined up.  Since that didn't happen, I took the basics - what do we get paged on?  Error rate & response time, for starters!!

I then followed what was essentially the 'six stages of debugging' during my attempt to programmatically query Elasticsearch via PowerShell.  

Protip:  When, early in your search, you run across a very comprehensive thing that does exactly what you want, but in your foolish pride you think, 'How hard can this be?  Another 30m, that's all I need...'.  Just stop.  And use that thing.

Long story short, I wrote a terrible PS wrapper module on top of the existing PS module 'Elastico' (which is excellent), and quickly discovered that Lucene is even more bizarrely particular than I'd realized.  I am sure it's something specific to our implementation, but it certainly had me puzzling.  Also very puzzling is how easy it is to write to Elasticsearch (even given the arcane timestamp requirements!), but how difficult it is to read from it.

The Elastico wrapper really just has this in it (along with params):
    $queryTimeRange = "@timestamp:[$startDate TO $endDate]"
    if ($newApp) {
        # the assumption is that the application_name you are passing in is newly created, so a timeframe isn't necessary
        # iirc you get EVERYTHING - but since it's minutes of data, this is ok.
        $searchResults = Search-ElasticV5 -Node $elkUrl -Index $indexName -Query "$querySearchTerms" -Size 10000
    else {
        $searchResults = Search-ElasticV5 -Node $elkUrl -Index $indexName -Query "$querySearchTerms AND $queryTimeRange" -Size 10000
    return $searchResults

Why -Size 10000 ?  Because our prod environment has a huge amount of traffic flowing, and a single page of 100 tends to give you 30% or 50% error rates (actuality is 0.03%, because there are 30-40k ES hits).  And Elastico/whatever it uses maxes out at 10k results.

Well that was fun!

The next post will be in a few weeks once we've got past the teething stage - will post up some actual real-world improvement detail.


Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

Fixing duplicate SPNs (service principal name)

Logstash to Nagios - alerting based on Windows Event ID