Human error is never a root cause
P1s, major incidents (MIs), outages — whatever you call them, you need to learn from them. How did they happen? How did we recover? And how can we improve?
Blameless postmortems are the industry-standard tool to apply here. The more forward-looking of us now advocate for premortems and also learning from success as well by assuming that some level of failure is always present. Unless you are "born in the cloud" in terms of practices, your organization may be doing a basic form of root cause analysis (RCA) with the 5 Whys technique. Performing an RCA is a good first step but it tends to lead us into the following vicious cycle:
- A team puts in a lot of effort into their RCA and comes up with a single root cause, ignoring the latent contributing factors (such as business decisions, systems in place) that made the incident worse than it had to be.
- They address the root cause (such as a bug, or an incorrect procedure), and successfully prevent the exact same incident from occuring again, but the latent contributing factors are still there making all incidents more painful.
- The team feels that they didn't get much value out of the RCA and will put even less effort into their next RCA. "Incidents and downtime are a fact of life, amirite?"
I recently started attending executive RCA debriefs, where teams across the org present their findings and next steps. These debriefs are a great way to gradually change behaviours in an org that could be better at doing RCAs in enough detail or in a timely manner. Like I said earlier, doing an RCA is better than doing nothing at all, but with teams following the letter of 5 Whys and not the spirit you easily end up with the following findings:
- the root cause was a bug in our system
- the root cause was a bug in vendor software
- the root cause was human error during a procedure
Side note: weekly reviews with feedback are a great way to change behaviour in a people system, much like an monitoring and feedback loop while making small changes in legacy code.
Don't blame the human
Whereas most RCA findings may simply fail to extract meaningful learnings, calling "human error" the root cause is actively harmful to your organization. Pinning an incident on human error leads to two types of harm:
- it builds organization scar tissue: more trainings, more detailed procedures, more approvals required, more segregation of duties and increased hand-offs - all of which increase the cost of daily operations and reduce the agility of the organization
- it reduces psychological safety. People will be focused on avoiding mistakes rather than improving, and they will avoid speaking up in an RCA. Teams will argue about who to blame for the incident, breaking down the organization's culture.
Focusing on human error assumes that human error can be solved, whereas in reality humans will always make mistakes. This misplaced focus on human error leads to the myth of the "sufficiently smart" or "sufficiently careful" person, a theoretical person that would avoid any mistakes. This individual doesn't exist.
Blame the system
Let's take another look at typical RCA findings:
- the root cause was a bug in our system
- the root cause was a bug in vendor software
- the root cause was human error during a procedure
All of the example findings above fall into the same trap: they address the superficial failure mode while avoiding deeper analysis of the underlying system that enabled the failure. You may have heard the quote "every system is perfectly designed to get the results it gets." You won't reduce incidents without analyzing the system that produced it. It's like trying to change a product by only looking at the last unit off the assembly line, instead of studying the assembly line itself.
<give an example here>
- but what about the systems that let the human error take place? latent root cause
- checks and balances missing. can also be automated
- reduce the blast radius
How to think about failure modes
- what about eliminating the failure mode all together? brief mention of the hierarchy of control. learn from other non-software industry. safety and reliability engineering. process safety engineering. human factors engineering. safety in aviation and health care.
What can you do as a leader?
- Leaders must always call out any RCA that tries to pin an incident on human error.
- this includes non-managers. anyone who wishes to practice technical leadership.