5 Myths About Major Incident Management
The term “incident management” is still very much an ambiguous term, which is a gift and a curse for Kintaba. As an incident management startup, we cringe when the term is used as a catch-all for any alert, regardless of priority, or for ad-hoc response processes for dealing with crises.
But it also means there’s opportunity for us to shape best practices and lead conversations.
To set the scene: when we are talking about major incident management, we are talking about critical, unexpected events. So while APM and observability tooling are getting better at detecting anomalies and providing the autoremediations, humans will always need to be involved in responding to the truly unique and unexpected (at least for the foreseeable future).
Hopefully, in the not-so-distant future, more teams and companies will adopt best practices, versus glueing together text messages and emails and task managers and google docs and slack conversations. Effectively responding to disasters pre-dates the tech boom in places like the airlines industry, and there’s a wealth of literature out there we can all still pull and learn from.
With that said, here are 5 myths about incident management worth addressing:
1) Resolving incidents is solely the responsibility of SREs
If you look at the incident management tools available today, you may notice that they are built, designed, and marketed towards SREs / IT engineers. But why is that? We know effective incident management involves other roles across the organization.
Having an account rep notified that a Tier 1 customer was impacted can be just as important as pinging the engineer on-call. And as the incident evolves, you may realize that someone from security or legal needs to be brought into the process, and it’s unreasonable and ineffective to expect the engineer to manage quarterbacking all of that on top of solving any technical issues that may be on hand.
A best-in-class Incident Management tool should automate all of that process, so that engineers don’t even have to think about who needs to be brought in at what moment.
2) Incident response can be completely automated
There’s a saying that if an issue can be predicted in advance, then the solution should be automated. But major incidents, by their very nature, are often black swan events. In these moments of crisis, automation comes up short because the outage occurred outside the realm of predicted scenarios. It’s critical that the human element be empowered to react quickly and efficiently; what we can do is automate more of the process, so that the responders are free to actually deal with the problem.
3) You don’t need a dedicated incident management tool
I’ll cut right to the chase…It’s ineffective to try and manage incidents across various tools like Slack, PagerDuty, Email, Google Docs, SMS, etc. All of these tools can certainly play a role in responding to major incidents, but for every tool you add to your response process, you slow your responders down and you silo information. Teams need a single source of truth, accessible to the entire organization, for declaring emergencies, mitigating and responding to them, and ultimately learning from them so they don’t reoccur.
4) Postmortems need to be long technical documents
One of the anti-patterns we see in companies struggling with incident response processes are extensive and overly detailed postmortem templates with dozens of data-entry fields. These documents are not only intimidating — they are often down right burdensome for responders. The fact is that if it feels like a burden, people simply aren’t going to do it! At Kintaba, we recognize that making the postmortem editing experience flexible and customizable by the writer is crucial. Even getting a single sentence from the person who was present when the incident occurred, and who understands intimately what could be improved for next time, is better than writing down nothing (and in turn learning nothing).
5) Failure is preventable
When we talk about the need for humans to have good processes in place to respond to major incidents, it often gets interpreted as meaning that we don’t believe in the future of automation (sometimes phrased as AIOps). On the contrary, we very much acknowledge that machines will continue to get better at auto-remediation. But until artificial intelligence becomes much more sophisticated, the fact of the matter is that you can only automate the fix for things that you know in advance will go wrong. And the amount of things that can go wrong, as any engineer working on internet-scale systems knows, is essentially infinite. And so especially for critical events that either couldn’t have been predicted, or simply weren’t predicted, you need to have effective processes in place to respond quickly as an organization.
It’s not a matter if something will go wrong, but when something will go wrong. Any solution that claims to magically fix all of your problems without human involvement is selling you snake oil. Of course, as engineers, we should do what we can to mitigate recurring failures and automate the fixes. But you need to be ready for when something inevitably goes wrong that you didn’t expect.