My roadmap learning experience so far

This is now my second year having to come up with a roadmap for what we call 'release', so while I'm not old hat, I am definitely learning and getting a little better!  One of the fun aspects of this is looking at the 'how you go about roadmapping', as in, if you are starting from scratch or not, what are the exercises you can go through to help you develop a plan for the coming year?

We're fortunate enough to have a CTO who pushes things like value proposition design (pain-gain diagrams), and encourages us to really think through what it is we are trying to do - plus takes the time to guide us himself.  Last year's roadmap started with a fresh canvas and the question 'if you could only work on one thing, what would it be?'.  This was to help focus my scattered brain, and I found it a great starting point.  As I am also a visual person, was used extensively to sketch out my thoughts (also used my whiteboard, but is portable).

What fo…

Application-generated metrics - Part1: What is the difference between Metrics, Events, and Logs?

We're working on making 'application-generated metrics' an accessible thing through some sort of metrics framework.  The subtle goal is that we'd like to provide the ability for metrics-driven decisions to become a viable option, and to gently push back on "sales-driven development".

Given that I spent an entire day trying to figure out the answer to this question of 'how are metrics/events/logs different', it's probably worth taking the time to write it down.

DISCLAIMER: This is not a scientific paper, so do your own research to refute or support the following...
Backstory What is 'application-generated metrics'?  An old concept that I'm probably wording poorly.  Essentially, as your code does stuff, it should tell you about it.  Keeping track of important flows like registration and payments should be boosted by having dashboards/monitoring that tracks 'registration failure' or 'payment failure'.  This was my original …

Update on canary - how has our release impact changed?

We had two primary targets for our canary process - our monolith has a main web app and a main api - deployments to those two using normal deploy processes cause 'release impact'.  Namely, all our systems go wonky and the scope of 'what gets disrupted' is sometimes even kinda unknown (or rather, has never been tabulated).
For the backstory, check out the other two posts on this topic:
TL;DRCustomer impact with canary is now 0 (barring a broken deploy)We can now release at will instead of inside low-throughput release windowsWe have fast feedback via metrics that will automatically revert a failed deploymentThe actual deploy process is longer, but that's ok - because we don't have a time restriction Some definitions...Impact window (length of release impact) - how long are users and systems impacted negat…

Detailed review of our canary + IIS + AWS + Octopus process

Another 'clear my head' posts.  Given the view count, that's all these ever are anyways.  :)

This is an expansion of the previous post, just a bit more detail, pictures, and a gist.

I would preface this with a note that we are in the Probably More Common Than The Internet Would Like To Admit camp of helping Legacy systems grow and blossom into less-Legacy systems.  So if I am not using the latest hotness, you may excuse my backwardness.  Also excuse blogger's formatting, which is also backwardness.
Basic Traffic Flow
Nginx is there as a legacy thing, and we ran into some interesting issues trying to put an ALB upstream target in Nginx (fancy DNS issues due to ELB/ALB autoscaling).The instances are Windows boxes running IIS - always on port 80 - with the previous version left in place, even after a successful deploy, just in case (the amusing thing is that I had a 'port jumping' solution all written up, only to discover that our app can't deal with not-port-…

Canary deployments of IIS using Octopus & AWS

I'm writing this more to clear my head than anything else.  If it helps someone, great.

We have measured significant 'release impact' when deploying one of our core applications.   The main problem is initialization/warmup of the appPool.  We've tried the built-in methods, but for whatever reason we are always stuck with ~25s of dead time while the first request warms things up (we assume it's warming things up, not really sure what is happening).  After that 25s wait things are very snappy and fast, so how do we prevent all of our web servers from going into 25s of dead time with production traffic inbound?

Starting point - why do this? We care about our customers, and we want to help drive our business forward with as much quality/safety/speed as possible.

Because we want to drive our business forward, we are pushing to do more and more deploys (currently we do a daily deploy, but want to see 5x that) (if you have to ask why we want 5x, readthis).  Because we ca…

Lessons over Chinese BBQ

Met with a long-time friend and mentor last night for delicious Chinese BBQ and lessons.  Part of the conversation moved to idea/app development, and clarifying the stages and clear scopes.
Stages of product developmentProof of concept
Scope: Validating an idea, or a concept i.e. "something that's never been done before"
Ends when the idea/concept is validated or disqualified
Mitigates risk/cost at the expense of time (estimating this is naturally hard because it's a new thing)
Is thrown away entirely (aside from lessons learned) at the end.
Should be coded as such! (i.e. hard code connection strings, because things like setting up config files is waste for this stage)

Scope: Laying the framework for your production thing.
Minimum feature set because the goal here is 'establish the product, build the foundation'.
Once the groundwork is laid, architecture set up, additional features can be added.
Should be coded as 'this will end up being production, …

Jmeter-Terraform - Dealing with AWS ELB IP changes during load testing

We ran into an issue when trying to load test our 'new' production environment - the ELB IP addresses change as it silently auto-scales.  And since you're throwing load at it, of course those IPs will change!

When we started the tests, we had updated the hosts file on the master/slave Jmeter nodes, but of course at some point during the day the IPs changed and we got a pile of timeouts.  After much gnashing of teeth, we found the DNS Cache Manager in Jmeter, and figured there were two solutions:

Use the DNS Cache Manager to point to a custom DNS server (that we set up), and have that DNS server deal with keeping the IPs up to date.Write a cron job to run on each node that updates the hosts file, then use the DNS Cache Manager to 'clear cache on each iteration' We elected to do #2 (since our Jmeter infra is built/destroyed a lot, it didn't make sense to have another server in the mix), and here's what we ended up with.
Terraform applies userdata to each EC2 i…