What's Really Going On? · Steven Murawski

Outages

One constant in IT Operations is that things will fail, sometimes causing an outage. If you work in a traditional IT organization, this can be a trying time. People are scrambling to see what broke and how to restore service. Managers are getting conference bridges going to pull in the engineers to ask them if they found the problem and have fixed it yet. Executives are calling for the responsible person to be delivered as a sacrifice to appease the gods of complexity, following the doctrine of “if we fire all those who make mistakes, we won’t have any more outages.

I Didn’t Do It (and If I Did, “He” Told Me To)

All this environment does is make one effecient at covering one’s backside or their tracks. In moves that at covert action agencies would find impressive, IT Operations folks learn to sneak in risky changes with seemingly common updates or bundling them with “required security updates”. Or they are timed with new software deployment - “since we are going to new software, let’s do that on the latest OS!”, despite the fact that they can’t test the changes they do now in a realistic way - much less how a new OS and software deployment will behave.

How Do We Solve This Problem?

What we really need in dealing with failures and outages is an unprecendented level of truth and statement of fact.

When I worked for a local police department, we would occasionally have internal investigations regarding alleged misconduct. We lost more employees for failing to be truthful (there was a truthfulness policy and it was a fireable offense to not fully and accurately disclose what happened) and attempting to cover up what might have been a minor infraction (and some not so minor). This changed the result of the investigation from being a reprimand or suspension (or similar) to having employment terminated.

While having failing to tell the truth as a fireable offense is a bit on the extreme side, it was important in that context due to the responsibility held by the police department. We dealt with issues where people might be exposeed to fines, have their liberty constrained (arrest), possibly even lose their life. The stakes don’t get much higher and being able to look at what really happened is critical for an agency and the public it served to have a level of confidence and trust.

In IT Operations, we need to put a premimium on having everyone be able to share, without fear of reprisal, what actions/choices/changes they participated in that might have contributed to an outage. This is where we can really learn how to make our systems resiliant. It can help us determine what to watch for, write tests or alert rules for, or change policy. This only works if we have real visibility into what happened and an accurate accounting for people’s action.

Learn More - Beyond Blame

If this strikes a chord, go check out the recently released (to Kindle, print is coming soon) Beyond Blame: Learning from Failure and Success. It’s a short read - I read it on the flight from Milwaukee to Minneapolis, but it stages the problem well and provides actionable steps for moving from blame to learning.