More, more, more! Why the most resilient companies want more incidents

Article Monday, January 10 2022

If you were to poll most organizations, the majority of people within them aren’t particularly fond of incidents. They are disruptive, sometimes damaging, and almost always have a negative connotation. The natural reaction is to simply want them to stop as quickly as possible.

But as an industry, we need to evolve to a more positive incident culture. We’ve become overly-focused on what I like to call the “I hate incidents metrics” e.g. mean time to resolution, mean time to recovery, mean time to failure…you get the idea. For one, when dealing with major incidents, it’s frivolous to try and derive meaning out of an average across truly unique, black swan events.

And as a manager or tech lead, if all you really care about are the charts going down…well then you need to at least understand that all of these metrics are easily game-able. For example, an employee can wait until the absolute last minute to declare an incident that they should have declared hours or days before, or they can close out an incident well before it is actually resolved – all in the name of keeping the mean times looking better than they actually are. In other words, these metrics can actually create perverse incentives to make decisions that actually harm the resilience of the company.

The same is true for incident count more generally. If all that management cares about is incident count going down, well then it’s reasonable to expect people to become more hesitant to declare them in the first place.

We need to flip this thinking around and view incidents as things that are happening every day, that need to be embraced, and that ultimately can be learned from and which contribute to the long-term resilience of the company.

In a report from the analyst firm Intellyx titled Modern Incident Management: Evolving Towards a More Positive Incident Culture, principal analyst Jason English writes,

For the longest time, it seemed like the primary metric of incident management was measuring the MTTR (the mean time to resolution) it took to find and fix something. Now the market is evolving, and perhaps it’s time the ‘mean-time’ metrics are replaced by a kinder, more positive incident culture. We must lean more than ever on digital collaboration with remote co-workers and flexible team structures to support fast-changing software and cloud infrastructures. Success in this new environment will be measured by improvement in our overall organizational resilience — the ability to learn from mistakes, and to bounce back faster and better over time.

So who is actually good at this? In tech, we always like to think we are always ahead of the curve, when in reality we are playing catch up to other industries. The truth is that the aviation industry is the crucible from which modern incident management grew; after a series of crashes in the late 80s and early 90s, there were deliberate decisions made to document as many incidents as possible, and then to ensure the learnings from those incidents were distributed as widely as possible.

If you look at the recent history of aviation, they have an incredible track record of safety that is directly related to their incident culture. Pilots are incentivized to declare incidents early and often, and in some cases are even given bonuses for their actions. The industry knows that if you catch an incident early, it’s less likely to snowball into a life-threatening situation.

This transformation didn’t happen overnight– it took deliberate and decisive moves from both leadership and pilots alike. Here are some of the steps they took that any organization can implement to encourage a more positive incident culture and ultimately become more resilient:

Lower the barrier to reporting: The easier it is to file an incident, especially a lower severity incident, the more likely employees are to file them in the first place. The goal is to catch emerging situations before they escalate to the dreaded SEV1, so it’s important to make sure that filing a SEV3 or INFORMATIONAL incident isn’t only easy, but actively encouraged. It’s even OK to go so far as to accept a percentage of false positives in the name of making sure to catch as much as possible.

Simplify the incident process: Fear of “what’s going to happen” is a major blocker to participation in the incident process. Your incident management processes should be open and easy to use. Further, your incidents themselves should be as publicly available within the company as possible so when an employee is filing an incident for the first time, they will have a familiarity with the process from being an observer in the past. Walls of text in an internal wiki won’t cut it here!

Lead from the top: It’s critical that company leaders embrace positive incident culture. Leaders should not only avoid punishing those involved in incident responses but should go further and actually celebrate incidents and their responders loudly and publicly– especially those filing the near misses. At Google they have an “incident of the week” where incident responders are brought to the attention of the whole company to be celebrated. As with all cultural changes, it is ultimately in leadership’s hands to champion and demonstrate the values they want to see the company embrace, and incident culture is no different.

More, more, more! Why the most resilient companies want more incidents

John Egan

Software Daily

VMware Tanzu GemFire and Next-Generation Real-Time Application Development

Uber’s LedgerStore and its Trillions of Indexes with Kaushik Devarajaiah

GraphQL vs. REST: What Are They, and Which Is Better for You?

CodeRabbit and RAG for Code Review with Harjot Gill

Building Chess.com with Jay Severson

Mastodon with Eugen Rochko

Startup Investing with George Mathew

KubeCon Special: Docker with Justin Cormack

Software Architecture with Josh Prismon

Hardening C++ with Bjarne Stroustrup

Surviving ChatGPT with Christian Hubicki

Special Episode with George Hotz

Making React 70% faster with Aiden Bai of Million.js

Cross-functional Incident Management with Ashley Sawatsky and Niall Murphy

SDKs for your API with Sagar Batchu

Hyperscaling SQL with Sam Lambert

Spring AI and Java in 2024

Iceberg at Netflix and Beyond with Ryan Blue

About Us

Community

Get Involved

More, more, more! Why the most resilient companies want more incidents

POPULAR

Software Daily

VMware Tanzu GemFire and Next-Generation Real-Time Application Development

Uber’s LedgerStore and its Trillions of Indexes with Kaushik Devarajaiah

GraphQL vs. REST: What Are They, and Which Is Better for You?

CodeRabbit and RAG for Code Review with Harjot Gill

Building Chess.com with Jay Severson

Mastodon with Eugen Rochko

Startup Investing with George Mathew

KubeCon Special: Docker with Justin Cormack

Software Architecture with Josh Prismon

Hardening C++ with Bjarne Stroustrup

Surviving ChatGPT with Christian Hubicki

Special Episode with George Hotz

Making React 70% faster with Aiden Bai of Million.js

Cross-functional Incident Management with Ashley Sawatsky and Niall Murphy

SDKs for your API with Sagar Batchu

Hyperscaling SQL with Sam Lambert

Spring AI and Java in 2024

Iceberg at Netflix and Beyond with Ryan Blue