index & main

Human error is never a root cause

P1s, major incidents (MIs), outages — whatever you call them, you need to learn from them. How did they happen? How did we recover? And how can we improve?

Blameless postmortems are the industry-standard tool to apply here. The more forward-looking of us now advocate for premortems and also learning from success as well by assuming that some level of failure is always present. Unless you are "born in the cloud" in terms of practices, your organization may be doing a basic form of root cause analysis (RCA) with the 5 Whys technique. Performing an RCA is a good first step but it tends to lead us into the following vicious cycle:

I recently started attending executive RCA debriefs, where teams across the org present their findings and next steps. These debriefs are a great way to gradually change behaviours in an org that could be better at doing RCAs in enough detail or in a timely manner. Like I said earlier, doing an RCA is better than doing nothing at all, but with teams following the letter of 5 Whys and not the spirit you easily end up with the following findings:

Side note: weekly reviews with feedback are a great way to change behaviour in a people system, much like an monitoring and feedback loop while making small changes in legacy code.

Don't blame the human

Whereas most RCA findings may simply fail to extract meaningful learnings, calling "human error" the root cause is actively harmful to your organization. Pinning an incident on human error leads to two types of harm:

Focusing on human error assumes that human error can be solved, whereas in reality humans will always make mistakes. This misplaced focus on human error leads to the myth of the "sufficiently smart" or "sufficiently careful" person, a theoretical person that would avoid any mistakes. This individual doesn't exist.

Blame the system

Let's take another look at typical RCA findings:

All of the example findings above fall into the same trap: they address the superficial failure mode while avoiding deeper analysis of the underlying system that enabled the failure. You may have heard the quote "every system is perfectly designed to get the results it gets." You won't reduce incidents without analyzing the system that produced it. It's like trying to change a product by only looking at the last unit off the assembly line, instead of studying the assembly line itself.

<give an example here>

How to think about failure modes

What can you do as a leader?