Update on canary - how has our release impact changed?

We had two primary targets for our canary process - our monolith has a main web app and a main api - deployments to those two using normal deploy processes cause 'release impact'.  Namely, all our systems go wonky and the scope of 'what gets disrupted' is sometimes even kinda unknown (or rather, has never been tabulated).
For the backstory, check out the other two posts on this topic:
TL;DRCustomer impact with canary is now 0 (barring a broken deploy)We can now release at will instead of inside low-throughput release windowsWe have fast feedback via metrics that will automatically revert a failed deploymentThe actual deploy process is longer, but that's ok - because we don't have a time restriction Some definitions...Impact window (length of release impact) - how long are users and systems impacted negat…

Detailed review of our canary + IIS + AWS + Octopus process

Another 'clear my head' posts.  Given the view count, that's all these ever are anyways.  :)

This is an expansion of the previous post, just a bit more detail, pictures, and a gist.

I would preface this with a note that we are in the Probably More Common Than The Internet Would Like To Admit camp of helping Legacy systems grow and blossom into less-Legacy systems.  So if I am not using the latest hotness, you may excuse my backwardness.  Also excuse blogger's formatting, which is also backwardness.
Basic Traffic Flow
Nginx is there as a legacy thing, and we ran into some interesting issues trying to put an ALB upstream target in Nginx (fancy DNS issues due to ELB/ALB autoscaling).The instances are Windows boxes running IIS - always on port 80 - with the previous version left in place, even after a successful deploy, just in case (the amusing thing is that I had a 'port jumping' solution all written up, only to discover that our app can't deal with not-port-…

Canary deployments of IIS using Octopus & AWS

I'm writing this more to clear my head than anything else.  If it helps someone, great.

We have measured significant 'release impact' when deploying one of our core applications.   The main problem is initialization/warmup of the appPool.  We've tried the built-in methods, but for whatever reason we are always stuck with ~25s of dead time while the first request warms things up (we assume it's warming things up, not really sure what is happening).  After that 25s wait things are very snappy and fast, so how do we prevent all of our web servers from going into 25s of dead time with production traffic inbound?

Starting point - why do this? We care about our customers, and we want to help drive our business forward with as much quality/safety/speed as possible.

Because we want to drive our business forward, we are pushing to do more and more deploys (currently we do a daily deploy, but want to see 5x that) (if you have to ask why we want 5x, readthis).  Because we ca…

Lessons over Chinese BBQ

Met with a long-time friend and mentor last night for delicious Chinese BBQ and lessons.  Part of the conversation moved to idea/app development, and clarifying the stages and clear scopes.
Stages of product developmentProof of concept
Scope: Validating an idea, or a concept i.e. "something that's never been done before"
Ends when the idea/concept is validated or disqualified
Mitigates risk/cost at the expense of time (estimating this is naturally hard because it's a new thing)
Is thrown away entirely (aside from lessons learned) at the end.
Should be coded as such! (i.e. hard code connection strings, because things like setting up config files is waste for this stage)

Scope: Laying the framework for your production thing.
Minimum feature set because the goal here is 'establish the product, build the foundation'.
Once the groundwork is laid, architecture set up, additional features can be added.
Should be coded as 'this will end up being production, …

Jmeter-Terraform - Dealing with AWS ELB IP changes during load testing

We ran into an issue when trying to load test our 'new' production environment - the ELB IP addresses change as it silently auto-scales.  And since you're throwing load at it, of course those IPs will change!

When we started the tests, we had updated the hosts file on the master/slave Jmeter nodes, but of course at some point during the day the IPs changed and we got a pile of timeouts.  After much gnashing of teeth, we found the DNS Cache Manager in Jmeter, and figured there were two solutions:

Use the DNS Cache Manager to point to a custom DNS server (that we set up), and have that DNS server deal with keeping the IPs up to date.Write a cron job to run on each node that updates the hosts file, then use the DNS Cache Manager to 'clear cache on each iteration' We elected to do #2 (since our Jmeter infra is built/destroyed a lot, it didn't make sense to have another server in the mix), and here's what we ended up with.
Terraform applies userdata to each EC2 i…

Another round of team health checks is upon us

And I've learned a few things...

Take detailed notes of the discussion - you can use these to refresh memories, provide contrast, use for discussion points, etcTake care to note the action items, the expected outcomes, the action item owners, and update notes on this as time goes on (if you're involved with the teams)If you are doing it remotely, remember that GoToMeeting has a 6 camera limit - just because someone's face isn't on the screen doesn't mean they aren't there!  Keep everyone involved!Don't let newbies off the hook - they tend to have good insight as a relative outsiderIf a topic is voted all one colour, get people to talk about why they voted that wayTry to ensure that everyone is there, especially if the team has a track record of appreciating the time spent doing thisNote outliers/contextual changes on the results doci.e. A team recently took over a function.  Their lack of visibility into that function had previously led them to vote 'yel…

Terraform/Jmeter performance testing - practical experience

Over the last while we've had the opportunity to put our new Jmeter learnings to work.

Bug came up that was only evident under load - we were able to reproduce it in our dev environments!  The dev ran Jmeter off his laptop, and it was enough load to generate the bug.I think I mentioned last time about how the simple act of mapping out a Jmeter script revealed excess calls to our middleware - tickets were created to address this.QA has used it to help draw out issues with a new production environment, but... ...the other day they ran out of steam on their laptops.  So we got to come back to the Terraform/Jmeter setup we built a few months back.  Thankfully everything still worked, and we were quickly (15m) on our feet with 1 master and 6 slaves (c4.large) raising heck.
This is where I will talk about the lessons we learned today... Terraform is amazing and was totally worth the time investmentIf you are testing a cold production environment - ASK ABOUT HOSTS FILE CHANGES!!  We hammer…