Posts

Team health checks - now with more teams!

I had the chance to help out with team (squad) health checks again, this time with eight dev teams (last occasion wasjust two) - seven of which happened in a single week!  One of the teams had done a few health checks in the years prior, and it was really cool to see their responses and approaches mature over time.

As before, a ton of learning and a great experience all around, with some really cool insight from the folks participating:
On the topic of 'delivering value', one team member raised his hand after the team had said their piece and noted: "You [the team] have been describing delivering quality.  Delivering value is something else."  (which of course spurred other discussions and thoughts)Success metrics/OKRs have the power of crystallizing the organizational mission as resultsLegacy code/platforms constrain present/future options (i.e. tech debt doesn't just require recoup work in future, it prevents you from doing some things in future - as if to say, …

Packer, WinRM, AMIs - making it work

This has been a pretty frustrating task for me, so it felt like a good thing to write about.  Pretty much every source had one thing or another wrong, and I'd ended up just mashing it all into one big huge WinRM script glob.

I would preface this with 'it's a packer build, not concerned with security and encryption', which is not terribly security-conscious of me, but here we are.  One of the articles I found whilst googling had a basic SSL setup, but I opted to start with the Packer documentation after many many failures.

If you start with said documentation here (https://www.packer.io/intro/getting-started/build-image.html#a-windows-example) you will (possibly) discover that you are still blocked by 'waiting for winrm'.  So add this to your user-data WinRM setup portion:
set-item WSMan:\localhost\Client\AllowUnencrypted -Value True -Force set-item WSMan:\localhost\Client\Auth\Basic -Value True -Force set-item WSMan:\localhost\Client\TrustedHosts -Value * -Force E…

My roadmap learning experience so far

Image
This is now my second year having to come up with a roadmap for what we call 'release', so while I'm not old hat, I am definitely learning and getting a little better!  One of the fun aspects of this is looking at the 'how you go about roadmapping', as in, if you are starting from scratch or not, what are the exercises you can go through to help you develop a plan for the coming year?

We're fortunate enough to have a CTO who pushes things like value proposition design (pain-gain diagrams), and encourages us to really think through what it is we are trying to do - plus takes the time to guide us himself.  Last year's roadmap started with a fresh canvas and the question 'if you could only work on one thing, what would it be?'.  This was to help focus my scattered brain, and I found it a great starting point.  As I am also a visual person, draw.io was used extensively to sketch out my thoughts (also used my whiteboard, but draw.io is portable).

What fo…

Application-generated metrics - Part1: What is the difference between Metrics, Events, and Logs?

Image
We're working on making 'application-generated metrics' an accessible thing through some sort of metrics framework.  The subtle goal is that we'd like to provide the ability for metrics-driven decisions to become a viable option, and to gently push back on "sales-driven development".

Given that I spent an entire day trying to figure out the answer to this question of 'how are metrics/events/logs different', it's probably worth taking the time to write it down.

DISCLAIMER: This is not a scientific paper, so do your own research to refute or support the following...
Backstory What is 'application-generated metrics'?  An old concept that I'm probably wording poorly.  Essentially, as your code does stuff, it should tell you about it.  Keeping track of important flows like registration and payments should be boosted by having dashboards/monitoring that tracks 'registration failure' or 'payment failure'.  This was my original …

Update on canary - how has our release impact changed?

Image
We had two primary targets for our canary process - our monolith has a main web app and a main api - deployments to those two using normal deploy processes cause 'release impact'.  Namely, all our systems go wonky and the scope of 'what gets disrupted' is sometimes even kinda unknown (or rather, has never been tabulated).
For the backstory, check out the other two posts on this topic:
http://blog.practicaltech.ca/2017/08/canary-deployments-of-iis-using-octopus.htmlhttp://blog.practicaltech.ca/2017/09/detailed-review-of-our-canary-iis-aws.html
TL;DRCustomer impact with canary is now 0 (barring a broken deploy)We can now release at will instead of inside low-throughput release windowsWe have fast feedback via metrics that will automatically revert a failed deploymentThe actual deploy process is longer, but that's ok - because we don't have a time restriction Some definitions...Impact window (length of release impact) - how long are users and systems impacted negat…

Detailed review of our canary + IIS + AWS + Octopus process

Image
Another 'clear my head' posts.  Given the view count, that's all these ever are anyways.  :)

This is an expansion of the previous post, just a bit more detail, pictures, and a gist.

I would preface this with a note that we are in the Probably More Common Than The Internet Would Like To Admit camp of helping Legacy systems grow and blossom into less-Legacy systems.  So if I am not using the latest hotness, you may excuse my backwardness.  Also excuse blogger's formatting, which is also backwardness.
Basic Traffic Flow
Nginx is there as a legacy thing, and we ran into some interesting issues trying to put an ALB upstream target in Nginx (fancy DNS issues due to ELB/ALB autoscaling).The instances are Windows boxes running IIS - always on port 80 - with the previous version left in place, even after a successful deploy, just in case (the amusing thing is that I had a 'port jumping' solution all written up, only to discover that our app can't deal with not-port-…

Canary deployments of IIS using Octopus & AWS

I'm writing this more to clear my head than anything else.  If it helps someone, great.

We have measured significant 'release impact' when deploying one of our core applications.   The main problem is initialization/warmup of the appPool.  We've tried the built-in methods, but for whatever reason we are always stuck with ~25s of dead time while the first request warms things up (we assume it's warming things up, not really sure what is happening).  After that 25s wait things are very snappy and fast, so how do we prevent all of our web servers from going into 25s of dead time with production traffic inbound?

Starting point - why do this? We care about our customers, and we want to help drive our business forward with as much quality/safety/speed as possible.

Because we want to drive our business forward, we are pushing to do more and more deploys (currently we do a daily deploy, but want to see 5x that) (if you have to ask why we want 5x, readthis).  Because we ca…