[0:00:00] LA: Ken, welcome to Software Engineering Daily.

[0:00:02] KG: Good to be here, Lee. Good to see you.

[0:00:05] LA: Full disclosure, everybody. As many of you might already know, Ken and I are very good friends, and we recently published a book together, Business Breakthrough 3.0. Ken and I have talked a lot about this sort of thing we're going to be talking about today. Ken, anything you want to add before we get started?

[0:00:23] KG: No. It's good to see you and love to dig in it. This is a topic that you and I have talked about different ways over the years.

[0:00:29] LA: Yeah, it's a great topic, actually. Let's get started. Let's start with some basics. Incident Management has really multiple dimensions associated with it, right? There's pre-incident monitoring and alerting. There's incident response, the process itself, there's cleanup afterwards, has retrospectives. Where does Blameless fit in the stack, and is there a sweet spot?

[0:00:55] KG: Well, it's a good question, Lee. I think, first of all, like a lot of the customers I talked to, when they think about incident management, I actually think they define it even smaller. They think about the detecting and fixing is incident management. Yes, maybe the call. To me, that is certainly a function of it. But if you think about enterprise reliability, a little bit bigger than that, it really – that's where I think you bring in the things. Okay. Well, during the incident, like communications, very critical. If your organization is having an outage, does your sales team know what to tell your customers? Does your customer success team know? Are your executives aligned on what's going on so that everybody's well informed?

You think about, during the incident, are you capturing all of the metadata so there's really great learnings, continuous improvement post. After that, of course, you do a retro? There's the technical parts, and making sure, again, that you take all that metadata from the incident, the learnings. But then, there's the process of, are we actually creating action items so that we can fix this so that hopefully, that same thing will break again? Then beyond that, I think about enterprise reliability, which to me, incident management, usually just that first part for many people that I've talked to. After you've kind of captured the things that you don't want to do again, how do you make sure that the team actually goes in prioritize that work, so you don't have another incident with the same root cause?

I like to think about as enterprise reliability, and where incident management is a piece of it out. Some people call incident management multiple pieces. But most of the people I've talked to a lot of times, just really think about it as the observing and fixing portion, at least in some of the conversations I've had.

[0:02:35] LA: So you're focused on the post-incident side of the world? What are the post-problems side of the world. Not the detection of incidents, as much as the, how do we examine what did happen, and how do we make sure that doesn't happen any other time in the future?

[0:02:55] KG: Yes. When I think about it, we pick up where observability kind of ends in many cases. We can import New Relic, we can import Datadog, all of those types of tools that you might have that tell you you have an issue. So now, you've got those alerts, they go off, the teams knows that now there's an issue. Well, now you have to do a lot of different things. You have to create a custom Slack teams channel, you've got to look up your service directory, start paging the right team members to come to that slack channel, which is much better course than a phone call. Then you have to start, you know, working the problem, communicating with people.

What we've done is really automate that incident fixing the incident communication, the incident metadata capture of why it's actually happening for the afterwards, then we have the other parts of it, that retrospective piece, the RCA, the action items, then the dashboards, putting it in ServiceNow, putting it in Jira, or whatever tool you're using to make sure that that work gets done within the times that you set. I look at that as that's really that kind of continuous improvement circle from when it actually breaks, putting the right people together, all the way to identifying what happened, all the way to fixing it and making sure it doesn't happen again. That's really what we focus on at Blameless.

[0:04:15] LA: Cool. Let's go one step deeper into this now. Let's start – you say you have an incident on your site, something happens on your site, your observability frameworks have notified you, pager duty has told you you've got a problem. And now, your team is starting to get engaged, starting to get engaged. Starting from that point, let's talk through it a little bit more detail the steps involved in incident management that every company should be doing, whether they're using blameless or not. What are the steps that every company should be doing all the way through to the end of the process, and where does blameless help with those. Can you help me walk through that process?

[0:05:00] KG: So all of the steps from – what you mentioned is it's really, it's an interesting thing because in this modern world, many people have deconstructed from the monoliths. Now, they're in microservices, and it seems pretty straightforward. But a lot of things have changed, because first of all, what's down. Used to be, the whole thing's down, or the whole thing's up. Or that you might have one subservice that's intermittently having issues causing some cascading impact on something else. So it gets a lot more complicated, I think. So we can dig in it. There's the operational pieces, there's the technical pieces. Which one do you want to hit on first?

[0:05:36] LA: Well, let's – because I'm a big DevOps fan, I'm a big [inaudible 0:05:40] model fan. We've talked about that. In my mind, the owner, operator are the same person. I prefer that model. I like that model. I think there's a lot of advantages to that model. So let's keep with that sort of a model, and realize that we are talking large scale applications with multiple components. So a particular service or system is problematic. I want to get back to the question about what level within an application you do incident response. But let's say that for it for a little bit.

[0:06:13] KG: I think at all levels, we do incident response, but let's go through the business process or the technical part. You have a service, the service starts having some issues. It's a you build it, you own organization. You build it, you own it organization. So the software team gets page to look at their particular service. And so they're now researching, okay, what is this service? Is there an impact? And they have to start communicating, so they've got to create – there's the logistics. We've got to create a Slack channel where we're going to start working, so everybody knows that they're communicating. There's a process of who's the incident commander, who's on point, so that you're doing the updates. There's the process of, how are we going to do the update. Who's on point? Are those updates going out on a frequent basis? That's where it starts.

I'll give you an example. I was talking to a large streaming company that we'd all be familiar with that had a big outage not too long ago, and well known for being very microservices focus. So thousands and thousands of microservices. In this case, there is a core team that – I think almost all companies these days have a core team, that's the catch all. But that was the one that had the services that did the actual streaming. There was another team that did some things related to live events. They were in their own Slack channel, doing their own triaging, disconnected communication-wise from this bigger outage. So the big team that was doing streaming wasn't communicated.

So to your question, I would say the first thing is, the right teams get paged, but then there needs to be that right process of how we're going to communicate. So we're going to set up a Slack room, there needs to be communication that the teams, this team, maybe multiple teams are actively investigating in case you have cascading impact. That communication becomes really important. You need to be clear of roles and responsibility, who's the incident commander, who's driving the communication pieces? Who else do you need? How do you page them and get them easily onto the system? Then, I think as you're going, as you start to uncover what's going on, I think capturing real time easily with commands, like you're in Slack. Hey, let's capture this piece of metadata. Let's capture this graph in New Relic or maybe a graph in Datadog. So that way, you've got some great data post incidents, so you're not spending days or weeks after the incidents resolved actually looking for all that metadata of what happened in the first place.

[0:08:39] LA: Okay. Finding the right combination of people to be aware of an incident, that itself can be, in a large organization, and especially a large microservice driven organization. That itself can be challenging. I like the example you gave from the streaming company where two separate incidents can be going on that are very much related. Yet the two teams know nothing about each other, and don't know that the other team is having problems because of a poor communication. Lack of communication during an incident can be a huge issue. But so can over communication because of signal to noise ratio and all those sorts of issues. How do you find the right balance between too much and too little communication? So you know that the right people are connected, but no more than the right pupil.

[0:09:31] KG: I think that's fair. I'll just tell you what we think about it at Blameless. Is at Blameless, it really depends on the severity of the incident, what that service is, how it ranks, how often you want to make those updates. In Blameless, we think about as you think about defining, these are services, these are severity levels. Then based on the severity levels, we want to communicate to these groups. So if it's a low severity, then you might want to have a much smaller group. If it's a high severity, you might want to have a much broader group. I think, whether you use something, a tool like Blameless or you do it in some other way, I think having mapped out clearly. I think you have that in your book, Architecting for Scale previous. Think about the severities, think about who needs to be notified, how frequent you need to have those notifications, so that the right level of communication happens with the right people. 

I also think it's important to have a tool or some sort of tool that gives you that global view, so you know you've got multiple incidents going on. Because if a team is, for example, triaging stuff, we'll call it off the grid, right? Then other teams that might also be triaging things don't know that they're looking at it. That's where I think – or process becomes really important that you say, with microservices, when you have an incident, once you declare it an incident, maybe it's a sub for whatever it might be, because you're not quite sure of the impact. That you still, at some point declare an incident once you have it, and then it communicates in the proper way to keep your whole enterprise aligned. Because you might have thousands of services, or dozens, or some people even have hundreds of teams.

[0:11:10] LA: I still – I guess I'm not quite clear what the process you would use. You use severity levels to determine the blast radius, if you will, or the communication coverage, but –

[0:11:26] KG: Or service, because you might have a service catalog, so it might be the service. This is a service that if this service goes down, you're automatically, it's going to be higher severity. Or if this particular service goes down, we might want to have the communication. So I think the service catalog will probably be first, and then I think probably followed by this severity. Because, you know, it may be a sub four on a high-profile service. But, the blast radius isn't that impactful.

[0:11:53] LA: Got it. Got it. I guess that's the piece I was missing there, is the service tiers, which is like, we used to the Amazon. We actually talked about those in the book where tier one service versus a tier five service. Yes, that makes perfect sense. Now, given that, the goal of that whole process is to figure out who should be engaged directly in the incident response process, and who should just be aware of it going on. Now, there is a different scale of involvement, depending on where you are in relationship to the service under stress, right? If you're a consumer of that service, you may want to be aware that a problem that's going on, and receive a certain amount of knowledge information, as the incident is going on. If you're a little further away, you may only want to care to be notified when it's over. If you're closer, you may want to be engaged in the active Slack channel trying to find the problem, diagnose a problem, fix the problem as quickly as possible. It all depends on your level of involvement where this problem occurs.

Are there some best practices on deciding? Before I finish asking that question, I'm going to take a little side note of what happened a lot at Amazon, when I was working at Amazon. Is we would have calls for instant responses, right, and people join the calls. A lot of people would join the call, simply because they cared about the outcome. Not necessarily because they had anything to contribute to the outcome. But we at the time, this is early in Amazon days. They didn't have a good way of separating the listeners from the engaged people. You tend to have a whole bunch of people on a call, and you'd have random ideas coming in from random people having nothing to do with what was going on. The signal to noise ratio, just the communication channel itself got too unwieldy. We had to start taking people out of the call, because we couldn't handle an incident with 300 people on the call effectively. But you had all of those 300 people cared about the response. But since we didn't have a way to separate out caring from, and engaging in the response, we couldn't treat that correctly. 

What are the industry best practices for doing that sort of issue, of deciding who should be actively involved in working in incident and who should just be informed of what an incident, when an incident is going on, and when it's over, and what the status is. What are the best practices involved in that?

[0:14:48] KG: I think, to me, the best practice is, you want to have the minimum amount – you want basically the fix agents or people that directly can help resolve the incident in the channel. You and I – I don't know back at AWS. But I know [inaudible 0:15:02], a lot of times, people used to do phone calls. I'm not a big fan of phone calls, because that's really where things could sometimes get even crazier as far as communication. I think you've got slack, you've got teams, which are great ways to communicate and let people multitask easier. I think, generally, what you want to do is have a channel that's created for the people that are directly involved in the incident. Then, you want to have regular communication. At the same time, you want to create talk channels, you might want to make it specific to this incident, or maybe you have a broader channel. Like you might have, let's just say it's company incidents. So you might have, hey, this is the people that are working on it directly here, they're doing updates, and then we've got a global channel that everybody can see where they can get those updates. 

To answer your question explicitly, is I think the people that are fixing the agent, fixing the work should be on the Slack channel directly. They should be informing other people, which might be customer success, sales, leadership, whoever it might be. In a different Slack channel, that's where they're getting those updates, that's where they're communicating. Then if you want to have two way, somebody can go to that other channel. You want to leave the people that are solving the problem alone, they're already under a ton of stress. Maybe it's in the middle of the night, and a bunch of questions thrown at them simultaneously doesn't help. That's my opinion. What are your thoughts?

[0:16:23] LA: Yes. The onion approach to Slack channels, essentially, you have the core team, that's that pipe channel and then go out from there. I completely agree with that. I think another best practice too that is often missed, or often not managed effectively is the level of involvement. Often, incidents attract higher level managers, and sometimes, higher level then should be involved in an incident. I've had more than one incident where everything was going fine, and then a VP join the Slack channel, heard during the phone call, and everyone started talking differently. Because the boss, boss, boss's boss's boss was in the room now, and we have to be very careful on what we say now. That changes the scope of the type of conversations that go on. As a negative, it changes it.

I think the other thing you need to do is to make sure that it's not only – you have layered involvement, but you limit the scope of management, or of responsible parties that can be involved too. Even if you are the owner of a service, but you're two levels removed management wise from the owner of the service, that doesn't mean you need to be involved in the day to day or the moment to moment issue associated with an incident. This is something I think Amazon did poorly too. You'd regularly have directors and VPs in the smallest level incidents, just because they were such a customer-focused company. Everybody cared about anytime that there was an incident, and it does affect things. I think you want to layer communication outward, not only for other teams, but also upward as well.

[0:18:20] KG: Yes. No, I agree. When I said create this, not just for other teams, you could have an executive channel. I'll give you an example. We used to have one particular product where, if you went to a car dealership and applied for credit, it was like 70%, 80% of the market. If there was an outage, it was a really big deal, because there was broad impact. In that case, there was a totally separate executive channel where information would go to them so they could stay updated, where the people that were fixing it could be focused on that.

I think to your point, because sometimes there's also the business leadership, and then the technical leadership, and those can have two different things. Sometimes technical leadership may have some things that they can add, or they can chill the conversation. I think that's where you probably align around expectations. If you've got like a critical tier one service, and it's down for 10 minutes, and 15 minutes, and the teams can't figure it out, and you've got some – maybe some executive is also very technically astute. Maybe it is better at troubleshooting that team. That's the delicate balance. At some point, it might make sense for them to get in there. But certainly not in the first few minutes. I think probably setting reasonable expectations. 

I got to be honest with you, I'm a little biased in this one, because sometimes I've been that VP or SVP that joined a call. But I would always try to do it. I would get the updates. If it looks like the teams was getting there within a reasonable amount of time, you would leave it alone. But if something was dragging or look like they weren't getting the right resources brought into the channel in a timely manner, or maybe they thought there were some things that were difficult decisions. Like, "Hey, if we do this one thing, maybe we can lose data." Let's get somebody to make a decision.

That's where I think it might make sense to have somebody, a technology executive. I don't think it usually makes sense to have business executives on a technical conversation, because usually, that just does just kind of have more of a chilling effect.

[0:20:22] LA: But I think even technology executives, you want to cap the level of involvement based on the incident type, and the level, and the problem. Obviously, the greater the severity, the greater the importance of the service, the higher the level of involvement. That makes sense. I get all that. But I think you still want to cap it because I almost always have seen from an organizational standpoint, higher levels in the organization want to be involved more so than they should be involved. Whether it's technical or business, I think that's often the case. I 100% agree with you, but I think it does apply to technical leadership as well.

One of the things I did at Amazon – when I was at AWS, at Amazon, which is a little later in my Amazon career. One of the things that I always did was, I always had my team. I was [inaudible 0:21:16] team. I always had my team, were the ones that were allowed on the call for the incident. Anytime, like my boss, my boss's boss, my boss's boss's boss wanted to get involved in the call, I politely told them about – ask them to leave and to come into another phone call that I would run separately with them. Then I would go back and forth between the two calls, and give them an update what was going on without having them involved in the call itself. It's more like a read-only call.

[0:21:51] KG: Yes, I totally agree. That's like with Blameless, we basically do exactly what you're talking about in automated way. So there is an automated executive Slack channel, and the engineers can do updates. That automatically sends that so no one needs to go to that Slack channel or reach out, they can get the updates there. If somebody has questions in that channel, they could ask. Back in AWS times. "At Lee, I've got this question" so that you don't have them jumping in. I think I'm 100% aligned with you. There's obviously some exceptions when it is there. But for the most part, you got to trust the team. And as long as you've got the updates, that's where I go back to communication is so critical. I would do the same thing, is I would proactively give heads up to higher levels if I knew there was an incident going on that they might care about. Because usually, as long as people are getting good information, then they tend not to jump in those calls. They have comfort that situation is under control.

[0:22:45] LA: One thing we haven't talked about, Ken, and we probably should talk about is the retrospective aspect after the incident is over and the process involved in that. Do you want to give a perspective about what Blameless does in that area?

[0:23:02] KG: For Blameless, we automate that whole process. Whether it be creating the template, porting all the metadata, a lot of the work that's done to prepare for an RCA, to capture the action items, to submit them automatically to Jira and ServiceNow. We automate all of those workflows. That's certainly important, and saves – I was on a QBR with the customer, and they said they saved 55% of engineers time. That's just a low value effort – exercise for engineers. What they really want to be focused on is discussing the details of the problem and getting into how can we prevent this? I like to think about it sometimes, like is there a chance we can build a do-not-repeat plan so that we've got some action items so this thing doesn't happen. I think if you automate it, then you can get people focused on the most important piece, which I like to talk about that, because I think that's where a lot of companies in particular also fall down, is setting the expectation that it is a blameless exercise.

That it's not figuring out who did something wrong. It's really just continuous improvement. How can we all uncover what happened, so that we can figure out how this is – hopefully not, if it's preventable, how can it happen again. Creating that culture where it's not blaming, it's not this team versus another team. That gets challenging, especially if – I think we've probably both seen some companies that they have a DevOps team, and then they have the software engineering team. The DevOps team runs the software, and the software engineering team builds the software. Well, that's almost just still the old model, right? Then, in cause a lot of friction.

[0:24:39] LA: Right. It's just labeled old model. Yes. The whole concept of the DevOps team drives me crazy every time I hear it. Ken, I'd love to talk more about this. I think we're going to have to do another episode, because we're going a little bit long here now, and because we really haven't talked too much about the retrospective side of it. There's a whole lot more conversation I think we can have here. Maybe we should try and schedule some time to do another follow-on episode. But for now, I think we're going to have to bring this one to an end. Ken Gavranovic is the COO of Blameless, and incident response and management company. Ken, thank you very much for joining me today on Software Engineering Daily.

[0:25:27] KG: I enjoyed it, Lee. Thanks.

[END]