[00:00:00] J: Ashley Sawatsky, Niall Murphy, welcome to Software Engineering Daily. [00:00:04] NM: Thank you. [00:00:04] AS: Thanks, Jeff. [00:00:06] J: So, today's topic is incident management. Actually, you work for a company called Rootly that works in the incident management field. What exactly is it that you do? Then, we'll talk about Niall in a second. [00:00:23] AS: Yes. So, I am Rootly’s Developer Relations Advocate/Incident Response Advocate, Reliability Advocate, all things incident response and reliability. I work a little bit on the external side in the community, creating content and information centered around how organizations can level up their incident response, especially tech companies. Prior to joining Rootly, I spent about seven years at Shopify, building out a lot of the incident management and communications processes there. So, taking all that knowledge into a really cool product that actually automates and streamlines a lot of the processes that take place during an incident. [00:01:05] J: Cool. Niall, do you want to quickly introduce yourself? [00:01:09] NM: Sure thing. My name is Niall Murphy. I'm a CEO of a startup in the SRE space called Stanza Systems, which has just recently gone into kind of private beta. I am extremely uncomfortable with being called technical, as well as being a CEO. But nonetheless, this is apparently what people see me as, so here we are. If your listeners know my name, it's probably because of the SRE book, the Site Reliability Engineering book, which was a quite popular, still continues to be sold to this day. [00:01:46] J: Fantastic. What is your relation to Rootly, if any, for our listeners’ understanding? [00:01:55] NM: Yes, I'm just a hanger on, really. I’m interested in the incident management space, I should say, that Stanza does not compete with Rootly, or Jeli, or Shoreline, or anything else in the incident management space. We're doing something very different, but I do kind of pay a lot of attention to the incident management space generally, because I think it's one of the intellectually most interesting areas in computer science and production as it's practiced today, because it's full of exceptional behavior and things you wouldn't expect. It's such a very interesting space to be in. [00:02:35] J: I like that you use the word intellectual rather than technical. I mean, you've made it clear that you're not a fan of the word as applied to yourself. But I'm reading into this, that incident management is more than just a technical thing, but also includes wider aspects, just such as, informing customers and so on. Is that fair? Did you mean to imply that? [00:03:01] NM: Well, I say intellectually addressing precisely because of the fact that it touches on so many verticals of which technical resolution understanding of distributed systems, et cetera, et cetera/ Obviously, a hugely important component, but there are incredibly important other components as well, and actually, Ashley is more qualified to talk about those. So, we should talk to her about that. [00:03:25] AS: Sure, I think, something that I often say when people ask what one of the most important skills you can have as an incident responder is, I think, it goes on both, whether you're a technical responder, or in the communications, or any other side, is just the ability to read the room. Incidents vary a lot. There's a lot of emotions and all sorts of things that are kind of going on in the background, in addition to the actual technical response and the action items that are taking place. I think that those are some of the most undervalued skills that we don't call out as much as maybe we should. Just the ability to read the room and kind of respond accordingly and anticipate what's needed from you from that perspective, in addition to the actual remediation actions you're taking when something breaks. [00:04:18] J: Speaking of reading the room, I think that the traditional approach to incident management is the classic war room, where you physically get everyone into a room. Now, of course, in the last couple of years, that approach has drastically shifted with, obviously, COVID and then everyone's still working from home. How important do you think the physical presence of physical colocation of incident responders still is or would be? [00:04:47] AS: Maybe, Niall, has a different perspective. To me, I would say physical colocation is not important at all during an incident. I've responded to probably hundreds of incidents and aside from really major events like at Shopify, for example, Black Friday, sure, we're all in the room together. But generally speaking, a lot of companies run on Slack these days, or just remotely in general, and I don't think it's necessary to be physically in the same room at all. [00:05:20] NM: Yes, I think, I mean, let's start off with the fact that I live in a tiny rain served island on the other side of the world, and have partaken in incidents in such a context for many decades. So, I don't view the physical location as being necessary. Although, there is some kind of optimization effect for small groups of people in physical colocation, I think. But whether or not the size of that effect is determinant with respect to success in the incident management piece, as opposed to being able to access expertise all over the world. I don't know that the size of an effect would dominate every other concern. In fact, I think, it probably doesn't. [00:06:04] AS: I was just going to say that I think there is a distinction between asynchronous and synchronous communication. Sometimes synchronous communication definitely does need to take place during an incident. But there's so many tools like Zoom, and Google Meet, and all these different things that allow for – even Slack huddles are really great for ad hoc synchronous communication. So, that has just become a lot easier to do without sharing a physical space. [00:06:31] NM: I think for those of us in the reliability space as well, one of the other considerations that I used to hear a lot, but I don't quite hear so much anymore, and I think there's a very valid reason for it. But one of the concerns I used to hear is, “Okay, you're the team that's on call for Zoom. Are you going to use Zoom, the product, to debug Zoom, the product?” Well, the answer is probably not. Because if you do, when Zoom is down, you reach for your Zoom channel in order to coordinate and actually it's down. The point that was typically made is your coordination mechanism for incident resolution should be distinct from the product that you are supporting, which I think most people would get behind. But even in Google, there was still a very strong instinct to crack open a Google Doc for coordination, by default, we should say, even for those – some of the folks working Google Docs, because it's quite rare for the whole thing to go down. Often you can get some bit, or some shard, or whatever, to be functioning. Yes, anyway. [00:07:41] J: No, that's a really good point, and I have to admit, I was wondering about that myself. If we had time towards the end of this, I was going to ask it, but let's spoil it now. Ashley, at Rootly, do you use Rootly to manage incidents about Rootly? [00:07:58] AS: Of course, we do, when possible. Of course, if something is going wrong with Rootly, the platform, we luckily have a very small, very agile engineering team that is really responsive, and we are at that stage where we are a startup. We're all kind of on call all the time in some way. If something breaks, we're on top of it. We, as a company, run on Slack, and Rootly also integrates with Slack. So, of course, if we're experiencing like Slack is having an outage, then we have fall backs that we can use to make sure we can still communicate if that takes place. But luckily, Slack is a wonderfully reliable platform that very rarely does that to us. [00:08:38] J: Yes, Slack is hands down my favorite. But we call it work instant messaging tool. You only really see the value of slack once it's taken away from you, and that I think the worst I had to work with at some point was, I don't know, I don't want to slag off any competitors in case there'll be future guests on a podcast. But Slack is definitely up there. Rootly, specifically works through Slack pretty much exclusively. Is that right? [00:09:11] AS: Yes. We do have a web platform as well, where you can see things like metrics and the history of your incidents. But when you're actively managing an incident, most people do so using our integration within Slack. [00:09:24] J: I think Slack is – one, if a company has Slack that will already be used for coordination during incidents and not during incidents anyway. So, to me, that's a really natural fit. I'm thinking of a time in my first job when I was working for a startup and it was my first job. I was working for a really hot startup and it went to my head, I think, is very fair to say in hindsight. So, I thought I was the shit. We were managing some sort of customer facing incident and I was fixing it, the old days, literally SSH-ing production servers and restarting. So, this isn't changing configuration files manually and what have you. And one of our more recent employees, she was working in marketing, basically, and PR and that, came by, like physically to my desk in the office, and asked me what was going on and what she could tell the customers, and I was really hugely dismissive. I said, “I'm busy fixing this. This is way more important than what you do. We'll talk about it afterwards.” So, I think that is really kind of key story as it relates to the title of this podcast, bridging the gap between the technical and the non-technical responders, and whatever Niall classifies himself as, during incidence. Do you have any kind of which we call it, war stories or anecdotes about similar things happening? And how have you genuinely gone about improving that kind of aspect of a company's culture? [00:11:22] AS: Yes, I can speak a little bit to that to start. I came upon incident response in a pretty unusual way, I think. I have fully a communications background. Before I was at Shopify, I was at Disney during desk communications. Not a technical person. I had no idea. I probably couldn't have even named like three programming languages when I started working at Shopify. I had no idea. Oh, any of the under the hood was working. As we were building out our incident response program, started more in the customer facing space, and eventually grew to work more closely with engineering. So, I had a huge learning curve when it came to what was going on, in incidents that were like a platform outage, or some sort, something of a technical nature. I think, at the beginning, it was a lot of Googling, to be honest. I didn't want to bother our engineers. You were in that position where you had some comms person in your ear going, “What does this mean? What am I supposed to say?” That's super annoying, and I didn't want to be that person. But I also had to admit to myself that I had no idea what was going on. So, fortunately, Google is your friend, I probably learned a whole lot about engineering and infrastructure just from Googling while following along in war rooms. What I would do is just do my best to piece that together, and I would just write something based on my understanding, and give it to our engineers. I luckily had some close ones that were really helpful in helping me understand what was going on. Shout out to people like Ryan McIlmoyl who's at GlossGenius now, who was one of those people at Shopify who was very patient with me, and would help me adjust comms. But I think part of it is just stop being so afraid of that technical jargon. I think a lot of people see like, especially comms people, we see what's going on and we’re like, “This is so over my head. I don't know.” But if you just start where you are, start at the company you're in, learn the key pieces of the tech stack that tend to be involved in incidents, learn what Kubernetes is at a very high level. Like whatever it is, just start small. term by term, Google by Google. You don't need a computer science degree, just kind of piece it together, I think is like a good place to start. It's not as scary as it looks. [00:13:48] NM: Yes, I think just responding to a few things that Ashley said there, one of the lines they used to say at Google is how do you solve a problem inside Google if you can't Google it? Which I thought was a good line, right? Like, “Oh, my God, I can't use the tool that everyone else uses for problem resolution.” [00:14:06] J: It's a bit reminiscent of the scene in the IT crowd where the IT manager convinces the boardroom that if you type Googling to Google, the internet will crash. [00:14:16] NM: Exactly. So, from that point of view, I suppose it's surprising the extent to which I don't know what X means, I will just look it up and see if somebody has written some kind of reasonable explanation for us. That actually takes you an incredibly far distance when it comes to the mechanics of restoring service. That's obviously a very different act creating a service from nothing or we should say the fundamentally creative act of writing software. But there is a kind of mechanical difference to restoring service and problem resolution and so on, when you are already confronted by a structure which has some flaw or some problem in it. Or maybe the problem is with your model of it, which we'll come back to in a moment. But there's something in front of you, which has some detail, and it's like find the problem. It's a little bit like, I don't know, if you play chess, or if anyone listening to this, plays chess, but you occasionally get problems like white to play and meet and two or something. And it turns out, there's a big intellectual difference between, you are told there is a thing here for you to find, and the task is finding it through a big tree of possibilities, or cloud of uncertainty, or however you would put that, versus playing on when you have no idea what the situation on the board actually represents. So, there's a difference between those two activities kind of cognitive. The other piece, I'd say, just attempting, however, inaccurately to channel the technical persona for a moment, that technical persona in the context of incident response, often has to wrestle with this question of what's the model of the situation versus what is happening? In some sense, an incident is definitionally, a situation where the model has varied from what the reality is doing, and that's always been very interesting, right? Because you have a model that that producer, consumer, or system you are looking after works in the following way. And then it turns out, it doesn't. Or it does, but there's some difference between what you thought would happen and what is actually happening. Anyway, the technical persona is awful, often, if not primarily wrestling with questions of cognitive models, and knowledge about the world, and questions of rapidity of change as well, because your model kind of been accurate at Tuesday at 16:15. And then inaccurate, Tuesday at 16:17, and those are always going to be interesting moments. [00:17:05] J: I suppose the challenge in communication comes partly from the fact that these models are different for different personas. So, if you're a systems engineer, you'll have one model, which is relatively close to an actual let's say, UML diagram or architecture model. But then, if you are an end customer, you have a completely different model from what reality should look like. Then, if you are a developer, there's yet another perspective, and if you work in communications, there's yet another perspective. So, how do we navigate that complexity, which I see is particularly difficult because there are other complexities for example, the architecture itself, is a complexity. But there are literally people hired to understand and manage that complexity. But here, the complexity stems from the fact that there are different roles and different perspectives. And there isn't one person unless, I'm mistaken, whose job it literally is to understand the complexity that is different models. So, how do we go about that? [00:18:24] AS: I think, one thing as a communicator that stands out is your job as a communicator is really to understand your audience. So, having a bit of a lack of understanding sometimes, of the technical complexity going on, can actually be a bit of a superpower. Because if we had our engineers, all of our customer facing comms they would be in depth and deeply accurate and completely confusing. Unless you are communicating with a deeply technical audience, it doesn't really matter how insightful and intelligent and accurate those comms are, because I have no idea what you're talking about anyways. So, I think as a communicator, you can use that filter to say, like, “Do I understand this well enough to explain it in a way that is meaningful and relevant to our audience?” You can use that to navigate where you might need to dig a little deeper and get more of a technical understanding to present things accurately. But also, where maybe you need to pare back and find ways to simplify. I like to pay attention to, when big companies are having incidents, how they communicate. And you can always tell when an engineer is writing the status page messaging, versus when they've had somebody from more of a communications or customer side, filter through that, and explain the difference between what's happening and the impact of what's happening to customers. I think if there's one thing communicators need to zoom in on, it's that. It's the impact of the problem and not worrying about having this in depth understanding of the problem itself and what the resolution is going to be. [00:19:59] NM: It’s definitely part of it. I think also, just coming back to your remark earlier, Jeff, about whose job it is to model the thing or what happens on the level when there's a conflict of bottles or the model resolution area of attention and so on is variable. So, it's a pretty interesting question because arguably, I wouldn't say this is achieved, right? But arguably, one of the potential benefits of using SRE approaches to things is you're trying to understand the system holistically. You are trying to integrate, in some sense, multiple models. I remember, I think this is somebody else's line. I know, I'd be very grateful if it was mine, but I don't think it is. I went to a conference and on one slide, or in one talk, there's a picture of lots of machines, they all have lines to each other, and we hand wave away the fact that they're connected by network. Then, the next talk, there's a picture of a Cisco and a Juniper, and then there's lines to machines, which don't have different names. There's cloud thing, server rack, or whatever. That's one another kind of model, and then, a third talk, and there'd be an architecture diagram, or organizational diagram of the people and their recording lines, and so on. Yes, different models exist. But one of the most interesting questions around software, I think, is what can you successfully ignore, and still do your job, or do the task, or whatever. So, what we find when we deal with highly complex models, or highly complex systems, is that we're building and discarding models. As it turns out, we could no longer successfully ignore the fact that when we post this thing to this other thing, it doesn't work like 1.01% of the time. We have to find the reason for that. Or in other words, we have a model, which when we stand back and look at it is kind of squishy, and cloudy, and vague, and so on. It has increasing hardness and specificity as we dive into the area in question, and look at the actuality of how it works. That particular bit of the system is crystallized when we're investigating the thing that we're looking at. Then, perhaps it turns out, that's not a contributing factor to what's going on. So, we zoom back out of that, and then go elsewhere in the problem space in order to crystallize that model. It's not really a question of competing models. It's a question of the resolution of your understanding at any particular moment. [00:22:53] J: That's really interesting. It strikes me that there is a parallel between what you've just talked about i.e., the complexity of navigating communications during incidents, and the actual incidents. Because in both cases, we have, let's call them, levels of abstraction. At some point, they don't work as expected, and we need to figure out where the issue is. I find it interesting about what can we successfully ignore? And also, I think you talked about before you said – I forget your exact phrasing. But basically, it was about when we build systems, there is something that we just rely on that we don't have to understand completely. So, for example, if just a really simple example, I put some sort of app on the Linux server, I don't need to understand all of the Linux kernel, for example. But if something breaks, or more likely, doesn't behave as expected, doesn't behave the way we modeled it, that gives rise to an issue. Do you guys have any thoughts about when and if to involve vendors? I know, currently, I'm working for a big corporate and basically, any system needs to have enterprise vendor support in case that that model is broken or doesn't behave the way we thought it would, that we are able to call in the vendor. Is that a con – I suppose, especially as a startup, you can’t always afford to do that. Is that a consideration that you make after incidents or before incidents? How do you navigate that whole part of complex systems design? [00:24:53] NM: You must understand that as the CEO of a vendor, I am here saying you must use vendors, hugely for all yourself. But actually, of course, there's the classic tradeoff between build and buy, and so forth. And many organizations at different stages of their lifecycle can actually go back and forth between different ways of doing this. So, I wouldn't really say there's a one size fits all answer. But I will say that in instant management as an area generally, I think even with the proliferation of folks who are writing tools in the space, or selling tools in the space, I should say, it's still a huge wild west area, like loads of stuff is done on internal and the belt buckles, and chewing gum scripts, and ad hoc docs, and so on so forth. I don't really look at what the industry is doing on average, and go, “Yes, this is totally fine.” And all of the products are just gilding the lily. No, the average level, or median level, or whatever, of how we do this is just not very sophisticated. [00:26:09] AS: Yes, we also exist in the vendor space, so a little bit of bias there, and we see plenty of build versus buy. I've decided between build versus buy, before I worked at an incident response automation platform for this exact use case. I would say like, of course, we do get many customers approaching us or folks we've approached to have some sort of homegrown version of even an instant response tool and Slack automation. And I think, A, we often see that people just underestimate how much can be done in this space. They think like, “Oh, it's just a little ChatOps. We can build that. No problem.” But with what we offer, we go much, much deeper than that. I often find that people who do lean towards build, what they're really looking for is flexibility and customization, and they want the thing exactly how they want it to be. So, really, we're in a good place with that, because we are such a configurable platform, that we actually tend to play very well with people who have already built some homegrown incident response tool, because we're able to configure really, to do whatever they want to do. Whereas some other vendors are going to be very opinionated. I think you have to decide if you want something to come in, that's very opinionated, that's going to tell you exactly how to do the thing you want to do. Or if you want more flexibility, and you know how you want to do it, and you maybe need help with implementation, and then there's all sorts of other tradeoffs in terms of like, who's going to maintain this thing that you built? Or is it going to be run off the side of the desk forever? Will there be a long-term plan to maintain it? What happens if it breaks? Then how much of your time is going to need to be devoted to that? Like Niall said, I'm sure we could do a whole episode on build versus buy and tradeoffs that come with that. But yes. [00:28:03] J: One more question on the build versus buy. It's a good example of what he talked about before that if you dig deeper, there is more complexity. Even if you buy, there could still be to the unsolved – is it self-hosted? Is it 100% hosted for you? So, what is the case of Rootly? Is that all hosted for you? Is there a self-host option for customers who are a bit more, I don't want to use the word paranoid, that’s a bit cynical, but who have more type regulations, I'd say? [00:28:33] AS: That's a great question. I don't believe we have a self-hosted option. We run on the cloud. I don't really believe that – I'm sure there are exceptions. I won't make a sweeping generalization. But in this day and age, I don't think any SaaS product really exists completely independently. A lot of the times if it does, I don't think you're actually putting yourself in a much better position from a reliability standpoint. It's probably like, depending on your needs, it's probably a little naive to think that you can do some of these things like hosting better than these giant, extremely reliable cloud hosting services that exist out there. I get the temptation for people to say like, “We don't want to ever rely on third parties”, but I just don't know how realistic that actually is. [00:29:23] J: Even if you do all your own hosting, again, for the down layer, you realize that you're still relying on the electricity company that provides your own data center with electricity or the, I don’t know, the person who supplies your water because otherwise your engineers can't be in the data center or whatever it is. There is always another level. Another question, I had to get a bit back on track on the topic of reaching that gap. How do you decide, you've alluded to it before, how do you decide how much to tell your customers and in what technical detail? I remember, as a, again, young but arrogant engineer, being a bit frustrated at times because we were having an incident. But we were telling customers that we were doing scheduled maintenance. And I think it was to keep up the illusion of having more uptime or being more reliable. Is that probably why in your opinion? [00:30:33] AS: That's really interesting. I think, my opinion on how much to share with your customers, again, probably no one size fits all answer. Different companies have different appetites for how transparent they choose to be about their systems publicly. Sometimes, you're in agreements with some of those vendors, and maybe you can't even mention them by name, based on your agreement. But I would say, my overall advice would be definitely don't do that. Don't lie. I wouldn't ever misrepresent or just outright lie about the circumstances. I think, sometimes there are things that will be omitted. I typically, the things like I would tend to advise to just omit from communications aren't from the standpoint of trying to hide things, but more from just like, is this useful information to our customers? Is this actionable for them? Does it leave them with more questions than answers? I think sometimes there is a temptation to tell them everything, and it's in the spirit of we want to be transparent. We want them to know what's going on. But when you're a customer, and you're impacted by an incident, and this, whatever it is, piece of software you rely on isn't working, probably the last thing you want is to be inundated with a massive amount of technical details, when you're just like, “Okay, but what do I do? What does this mean for me?” So, I would say to put that at the forefront of your thinking, is what do your customers actually need to know? If your customers are deeply technical, they probably need more technical information. If your customers are not technical at all, they probably just want you to tell them what to do or what not to do, or what you can say. Also, just being honest about what you know and what you don't. I think there's a lot of danger in speculating around like, “It'll be fixed soon.” When you maybe don't even know what is actually causing the problem yet. So, just being careful around speculating and making those assumptions. But centering it all on like what is actually valuable and actionable information for the people you're talking to is probably the best way to go. [00:32:50] NM: Yes, I have some opinions here, all right. I mean, it's interesting, because fundamentally, this kind of communication is a communication, implying every kind of attribute of that between humans and between different audiences. Ashley, you were saying earlier, you communicate technically if your audience is primarily technical for this. But actually, of course, for a sufficiently large product or popular product, or however you would put that, there's different audiences, right? So, part of the problem is, well, I could stick all of this data in an appendix, but some people would be annoyed by it, or some people would skip straight to the index or whatever. Is there some median, useful median and we can head for? I do not think that there is a finite algorithm, which leads us to the correct decision for this in every circumstance. I will say that my experience in the industry suggests that primarily the pressure to defying the detailed, like remove the detail from stuff comes a lot from legal departments actually, who believes that the more detail you supply, the more opportunity you give an antagonistically minded customer to assert, in course, that you behave poorly under law X, Y, Z, or however you would put that. So, I think the way I would describe it in the industry overall, there is more of a tendency to remove stuff from communications than there is to put stuff in, more of a tendency from the business layer. From the technical layer, of course, it's the other way around. Because the technical folks supplying that detail are like, of course, everyone wants to know host one, two, three, four is down, like customers A, B, C through X, Q, Z are affected by this in some notional sense. So, they would want to know that. Of course, not all of the time do they want to know all of that. And as Ashley says, centering around impact is often a better way to structure this. But also, again, just being realistic about it, impact is not always easy to quantify. You also don't always know that impact at the moment when you are compelled to communicate. For example, in the recent Datadog outage, there is essentially a long gap in communication, because they're spending a lot of time trying to figure out what the root cause/contributing factor is. They don't really have that much more to say, other than, our engineers considered 73 of the 146 possible options and have discarded 73 of them, we’ll get back to you soon. That's very satisfying for an engineer, because at least you can go okay, the progress bar is going across the screen, but it's much less satisfying for somebody outside of that culture, because they're going, “You considered 73 things. Oh, my God, you wasted so much time. You should have considered one thing, and the one thing should have been right.” But you can't explain to them, actually, there's no way that you can pick the right thing, or if you do, it's hugely statistically unlikely you would do that, et cetera, et cetera, et cetera. This is partially why I say it's an interesting area, because there aren’t many algorithms which are going to result in determined success given all inputs. But I will come back to this point about the statement you made about your company in your early career there, Jeff. Ultimately, it's a question of trust. Trust management, right? So, by going, “Oh, yes, this is totally scheduled outage. It just happens to been scheduled for 2:30 in the afternoon, our peak time, or whatever it is.” You can, of course, get away with us, right? I'm sure loads people do get away with it a few times. Then, it'll come out somehow. Somebody will brag to somebody else in some chat or other. It'll get a copy drowned. Then, what you have is, “Oh, I'm your customer and you lied to me, because you thought it would be convenient? Well, it is exceptionally convenient for me to retain my dollars. Thank you very much.” So, the company is putting at risk a sustained relationship with their customers on foot for the convenience of a short-term optimization, which most people would say is, shall we say, short-term optimizing at best and is customer disrespectful, at worst. [00:37:50] J: Yes. I think to put it into context, a little, in my view, it was mostly startup license. So, as a startup, you can build technical in all sorts of ways, and I suppose if you don't have a dedicated communications team yet, then taking shortcuts there, while you're working on making the whole, making the whole process more robust is justifiable. I think, I don't want to slack off at some point. They really are a fantastic company, and actually one of the shining beacons off customer respect, if I can claim that in my unbiased opinion. But I really get your point about the bit you said about figuring out which solution to go for when solving a problem. And in hindsight, it's always really easy to know which begs the question, why don't more engineers just use hindsight from the get go? [00:38:56] NM: Well, we haven't fixed the time machine thing yet. So, when we, when we do ship 1.0 of that thing, we're like going back to time all the time. [00:39:06] AS: Customers are going to love that. Once we figure that out, instant communications are about to get a lot easier. [00:39:13] NM: Very much so. [00:39:14] AS: But something on Niall’s point, actually, that just stood out to me. You brought up a few things around like often there's fear that is going to come from other stakeholders, like your legal team, or your PR team is like, “Oh, no, we have a product launch. We don't want to look bad in the press right now. How can we bury this?” Or executives feeling that fear and saying, “Can we just lie?” Whatever it might be, I think, a take away from that is if you can avoid having these conversations for the first time around your level of risk appetite and transparency while an incident is actually happening, when all the fear and reality of that situation is right in front of you, and try to have those conversations before you're in that position. So, your legal team can go into it, because a lot of their fear probably comes from the fact that if something does lead to litigation, well guess who's on the line for that? It’s the legal team who looked at and approved the comms. A lot of it comes down to fear and uncertainty around what is okay. And of course, people are going to lean to like, the safest possible thing they can cling on to in an incident, which often is really sterile, generic comms that sound like a lawyer who wrote them. So, having like safe spaces to have those conversations and experiment with like, what would a risk look like here? What could go wrong? Before you're ever in that situation can alleviate a lot of that. [00:40:46] J: I like how Ashley's effectively given us an example of anticipated hindsight here. So, think about the problem before it occurs. It literally is the solution. [00:40:55] AS: Yes. There are ways to do that. And incident response in SRE, they are really difficult because there are spaces where there's not a lot of safety to fail. The stakes tend to be high. So, we have to manufacture those with things like game days, and tabletop exercises, and simulations. The ways that we can experiment a little in that space, before you're on call, and your palms are sweating, and you're terrified that if something happens, you don't know what you're going to do. [00:41:25] NM: Awesome. Just to say, again, channeling the technical persona. One of the difficulties with the technical persona is that although you're obviously relying on comms folks, you're in a team kind of embedded with other people. There is something fundamentally alone about being on call and responsible for an incident, which is not very comfortable, and which interacts poorly, kind of, cognitively, shall we say. The emotions of being under threat, particularly in, should I say, at will employment environments, mean that the true shall we say, creative sparks that you need to have, when you are doing fundamentally difficult kind of incident response. These two things are really intentioned. So, when you're on your 14th hour of an on-call shift, which got extended because of a gigantic outage, and no one really knows what's going on, and it's late, you're tired, you haven't eaten. Nonetheless, you're expected to do that complicated model, building devalidation and et cetera, while also having the threat of, in some cultures, continued employment, et cetera. That's the fundamental sympathy at the root of the technical persona and their job in these kinds of contexts. [00:42:58] AS: And I cannot resist squeezing one last plug in with that, because you just teed me up so perfectly for it, Niall. That cognitive overhead or whatever you want to call it, is a lot of what we aim to solve for with Rootly. We have a customer that's actually done a really good job of this, where they've set up an incident test, that is actually like a little mock incident that really will prompt you through when any of their people start an on-call shift or incident command. And you can just kind of shake the rust off and run this little test and it prompts you through. It says, “Hey, you're the incident commander. That means you're responsible for this, this in this. Here's the rules you should be aware of during an incident. Here's a reminder that your status page hasn't been updated in 30 minutes, you should do that.” We just kind of take that away, so that it doesn't have to all live in your head. That's one of the really great things about using automation is you don't have to use it for just your technical response. You can actually use it to create some psychological safety, that you know this tool is kind of helping you along. So yes, that's something that I particularly love about some of the ways that we've seen our customers use automation. [00:44:12] J: That sounds really cool. I love the idea of bringing not just having customer experience or developer experience, but here having on-call support person experience built into your workflow. That's really cool. Remind us if people are interested in Rootly, where do they go to check it out? [00:44:30] AS: Rootly.com, and we do, do demos that are completely free and we can personalize them. If you're interested in that, hit book a demo on our page and we'll set you up with someone who can walk you through the tool. [00:44:42] J: Fantastic. As promised, we'll end on another one of my shenanigans. I think my first week on call, which was also just week three of me having a job, before that I was at uni, I still didn't know any of the words. I barely knew what NGINX was, and the terms production, staging, development, were all new to me. So, I mixed them up. Someone flagged a very minor thing to me saying, “Oh, in our dev environment, this port is publicly exposed.” Someone sent me that like 9pm, no customer data, the company's barely known at all, so no one's going to specifically go and attack it. I figured, better safe than sorry, no one's going need dev anyway. So, I decided that we should just shut the server down and do it the next morning. However, since I mixed up the words, I just put on the general Slack channel, we're shutting down production. [00:45:48] AS: Did you put an @here on that? I'm sure that went over really well. [00:45:52] J: No @here, @everyone, of course. It's not just people who are there. Ashley, Niall, thank you very much for coming on the show. I really enjoyed talking to you and looking forward to this episode being published. [00:46:05] AS: Thanks so much, Jeff. This was really fun. [00:46:07] NM: No worries. [END]