Skip to main content

So what would you say you do here? Release management?

I recently was asked to give a presentation on 'release' - audience: entire company.  Putting that together gave me much time to ruminate on 'what value are you delivering?'.

Note:  I'm writing this as a mental exercise; working on clarifying my purpose.  Much of this is naive (and rambling) philosophy.

While part of the operations department, I have been unofficially tasked with 'release management' for our dev/qa teams.  What 'release management' constitutes, I would posit, depends on your perspective of 'people, process, tools'.  Personally I fall under the 'people, process, tools - in that order' camp, so a good chunk of my focus has been on helping teams 'level up'.  Delivering value to customers (vs. products/features) and product ownership are also high on my priority list.

Fundamentals

  • Help teams grow by addressing/highlighting risk gaps (e.g. missing non-functional requirements)
  • Help teams understand by facilitating health checks & process mapping (empathy, flow, feedback loops)
  • Help customers through release impact analysis
  • Enable teams through automation, build/deploy/test processes, and adding/creating tooling.
As usual, the specifics of what your list is made up of, or how you do this is contextual - for example, the level of senior management buy-in, team morale/ability, or business focus will all dictate different approaches.  (read up on Cynefin, interesting stuff!)  However, I suspect that my context is not far off from what most experience.

Addressing risk gaps

Value: Preventing loss of customer satisfaction from easily addressed issues. (both internal and external customers!)
Cost: Depends on items being addressed - should be part of the risk estimation (high effort only worth high risk), use common sense

This piece is the 'hard and fast, must comply' portion.  Strategy is to target low-hanging/high-risk fruit first - things like basic monitoring, some form of Agile process, good development practices, appropriate test automation, release impact, etc.  You are either red (non-compliant, no plan in place), yellow (non-compliant, but plan in progress), or green (good enough for now).

I have put together a 'team scorecard' as a visual tool, but the reality is that nobody can get everything done - choosing battles is important here, as is making a good case for your choice.  Usually just pointing out that the basics aren't there is enough to generate discussion and action - especially if senior management is fully on-board.

Health checks

Value: Improves team communication, makes subjective issues visible
Cost: 3-4 meetings per year @ 2hrs each

Google 'spotify health check' and you'll quickly figure out what this is.  Facilitate a safe discussion where the 'touchy feely' side of the team can be evaluated.  The only goal of this is to get the team talking to each other about important (and sometimes controversial) topics.

This generates a simple colour map of how the organizations teams are doing, and you can easily spot trends or specific teams that need some lovin'.  The data generated is "I feel that..." information, and thus 'safe' from being used to evaluate compensation and such.  If nothing else, everyone walks away from these meetings having learned something, and spent time communicating w. team members.

Process mapping

Value: Share/document tribal knowledge, makes poor process visible
Cost: 2-3 meetings per year @ 1hr each

For the team leads, it can be a good exercise to review the process by which a ticket turns into consumed code.  For me, it's invaluable as a learning aid!  You also get to learn which areas they already know are pain points, and which areas they had not yet considered looking at.

A more macro perspective here would be to look at the people/high level processes in place and apply some systems thinking.  Haven't done this yet, though.

Release impact analysis

Value: Makes cost of 'non-ownership' visible, attaches value to customer experience
Cost: Depends on tooling in place, but can be very easy to assess

This came about because of many conversations around 'I don't understand why we cannot deploy at will' or 'The site is only down for a few minutes, that's not a big deal'.  Thankfully we have New Relic, because that data makes a pretty clear case.  Some simple application of 'our users see X, Y, and Z during our release window' and 'are we okay with that?'.  Once you present the data to folk outside of engineering, the foot comes down pretty hard.

The conversations you have because of the impact analysis will quickly lead to deploy exclusions.  e.g. please don't deploy between this and that time, unless it's an emergency
When this window becomes untenable for any of the teams (desire to increase/decrease), action will have to be taken to reduce the release impact.

Enabling teams

Value: Possible difference between long-term success and failure?
Cost: ?

This piece requires a bit of a paradigm shift, both in how you view the operations team, and in how you perceive product ownership.  I won't go into the depths of this, lots of material out there - suffice to say this is 'whole business transformation'.

Product ownership is wholly on the product/dev/qa team, who are supported by operations, and enabled by the business team.  It is not a free ride - ownership implies freedom, but also responsibility.  The catch is that this requires everyone to be on board.

The operations team, I suspect, should be an enabler and adviser - a partner to the dev/qa teams that brings non-functional requirements expertise alongside tooling and automation.

And thus...

I feel a bit better having written this down, but some key next steps will be laying down (on paper) my own personal vision for where I want to be going with all this.
     i.e. I believe that the reasons behind my actions are right.

Got this advice the other day:  "Keep pushing.  This is something we believe in that has all these implications - so we need to focus on making each of the implications/steps demonstrate its tie-in to the big picture."


Comments

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

This warning isn't documented that well on the googles, so here's some google fodder:


You are trying to set up replication for a DFS folder (no existing replication)Source server is 2008R2, 'branch office' server is 2012R2 (I'm moving all our infra to 2012R2)You have no issues getting replication configuredYou see the DFSR folders get created on the other end, but nothing stagesFinally you get EventID 4312:
The DFS Replication service failed to get folder information when walking the file system on a journal wrap or loss recovery due to repeated sharing violations encountered on a folder. The service cannot replicate the folder and files in that folder until the sharing violation is resolved.  Additional Information:  Folder: F:\Users$\user.name\Desktop\Random Folder Name\  Replicated Folder Root: F:\Users$  File ID: {00000000-0000-0000-0000-000000000000}-v0  Replicated Folder Name: Users  Replicated Folder ID: 33F0449D-5E67-4DA1-99AC-681B5BACC7E5  Replication Group…

Fixing duplicate SPNs (service principal name)

This is a pretty handy thing to know:

SPNs are used when a specific service/daemon uses Kerberos to authenticate against AD. They map a specific service, port, and object together with this convention: class/host:port/name

If you use a computer object to auth (such as local service):
MSSQLSVC/tor-sql-01.domain.local:1433

If you use a user object to auth (such as a service account, or admin account):
MSSQLSVC/username:1433

Why do we care about duplicate SPNs? If you have two entries trying to auth using the same Kerberos ticket (I think that's right...), they will conflict, and cause errors and service failures.

To check for duplicate SPNs:
The command "setspn.exe -X

C:\Windows\system32>setspn -X
Processing entry 7
MSSQLSvc/server1.company.local:1433 is registered on these accounts:
CN=SERVER1,OU=servers,OU=resources,DC=company,DC=local
CN=SQL Admin,OU=service accounts,OU=resources,DC=company,DC=local

found 1 groups of duplicate SPNs. (truncated/sanitized)

Note that y…

Logstash to Nagios - alerting based on Windows Event ID

This took way longer than it should have to get going...so here's a config and brain dump...

Why?
You want to have a central place to analyze Windows Event/IIS/local application logs, alert off specific events, alert off specific situations.  You don't have the budget for a boxed solution.  You want pretty graphs.  You don't particularly care about individual server states.  (see rationale below - although you certainly have all the tools here to care, I haven't provided that configuration)

How?
ELK stack, OMD, NXlog agent, and Rsyslog.  The premise here is as follows:

Event generated on server into EventLogNXlog ships to Logstash inputLogstash filter adds fields and tags to specified eventsLogstash output sends to a passive Nagios service via the Nagios NSCA outputThe passive service on Nagios (Check_MK c/o OMD) does its thing w. alerting
OMD
Open Monitoring Distribution, but the real point here is Check_MK (IIRC Icinga uses this...).  It makes Nagios easy to use and main…