EPISODE 1818

[INTRODUCTION]

[0:00:01] ANNOUNCER: A distributed system is a network of independent services that work together to achieve a common goal. Unlike a monolithic system, a distributed system has no central point of control. Meaning, it must handle challenges like data consistency, network latency, and system failures. Debugging distributed systems is conventionally considered challenging, because modern architectures consist of numerous microservices communicating across networks, making failures difficult to isolate. The challenges in maintenance burdens can magnify as systems grow in size and complexity.

Julia Blase is a Product Manager at Chronosphere, where she works on features to help developers troubleshoot distributed systems more efficiently, including differential diagnosis, or DDX. DDX provides tooling to troubleshoot distributed systems and emphasizes automation and developer experience. In this episode, Julia joins Sean Falconer to talk about the challenges and emerging strategies to troubleshoot distributed systems.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:18] SF: Julia, welcome to the show.

[0:01:20] JB: Thanks, Sean. So nice to be here today.

[0:01:21] SF: Yeah, absolutely. I wanted to start off digging into your background a little bit. Can you talk a little bit about your journey into the world of microservices, observability, what led you to Chronosphere, and why you're interested in these issues around troubleshooting?

[0:01:39] JB: Yeah, absolutely. Well, I started out as a librarian, actually. Maybe not the most traditional career path into tech. I worked at the Library of Congress. I got a fellowship there. I went from Library of Congress to actually working at The Smithsonian. I will say, it was less like, what you think of as traditional librarianship, written word librarianship; a little more digitally focused librarianship. I was working with scientists and researchers, and I was helping them store and organize their data, so that they could ask questions of it and get the answers they needed, going from information to insight, as I used to say.

I was working in DC at the time, and I did eventually move over to work at a company called Palantir. It was maybe less well-known in 2014 than it is today. The reason I moved over is, at least at the time, they really talked about their software as a fundamental tool to help the government do something very similar to what I had been doing as a librarian. That is, understand, organize, analyze their data in a central location with a central toolkit. I think government agencies faced really similar challenges to those faced by the scientists I had been working with, which is the data was stored in silos, and each silo was organized differently, and you had different tools to work with each silo. Very few people really had been putting in the manual effort to understand how to work with all those different data silos and get that data together to provide insight.

You can probably see where the through line is, my information to insight role as an individual contributor to going to work at a company where that seemed to be their whole purpose. It was really exciting, and I really enjoyed that path from librarianship into tech. While I was at Palantir, I started out in, again, that government facing side of the business in a customer facing role. The first time I actually engaged with observability, I was actually, what we would say is high side. I was in a customer secure computing facility. I had been on call. It was late at night.

Our developers for that software, they weren't always able to get on those government sites, right? They weren't always able to come out there and actually get hands on keyboard to see what was happening when something went wrong. They would rely on people like me to sit at the computer, be on the one phone line that could connect to the outside world and be their hands and follow their instructions. I think my first engagement was, "Hey, I need you to grep for something that looks like this." I was like, "Cool. What is grep? I don't know."

They really walked me through what it means to SSH in somewhere, what it means to grep, what a log is, what a metric is, how to describe what's on a metric dashboard so that they can guide me through what else to look for to help them diagnose the problem. It was really interesting. I really enjoyed engaging with that side of software, and it really demystified software a lot for me, which I appreciated.

As I spent time at Palantir and as I grew and the company grew, I actually moved into their product org. That's where I started learning about the difference between monolith and microservices and on-prem infrastructure and containerized infrastructure. Because I was working with teams that were doing both, some that started natively building their services in CATEs, others that had built services in a monolith and then we're trying to migrate them over to work in a more containerized environment and split that up into microservices. It was really challenging, and there were challenges on both sides. I enjoyed helping people with those and working on those challenges, which brought me to Palantir's central observability team, which at the time was called their Signals team. That was the whole purpose of that team.

On that team, our challenge was to take all that telemetry data from all of the software, whether it was on-prem, or commercial cloud, or GovCloud, or secure cloud, monolith, microservice, whatever it was and developed tools and methods to bring that data into a central place where they could use it to troubleshoot issues. Of course, that did bring me to Chronosphere, I think, pretty naturally. We actually interviewed Chronosphere as a vendor at one point when I was at Palantir in that role. They were just, honestly, some of the most transparent, expert, nicest vendors I had ever interviewed. I was just like, "This company gets me. They really understand my problems. They understand my engineer's problems."

Palantir, at the time, I had been there for six years, had gone public. I felt like I had reached the end of what I had wanted to do with that company. I was looking for a new role and Chronosphere seemed like a natural fit.

[0:05:45] SF: Awesome. Yeah. I mean, I actually think that the path from librarian to product management, you're working in data, working in the world of microservices and stuff like that, is not necessarily that crazy a path, because if you think about some of the work that has come from organizing books and structure taxonomies, that's probably the original inspiration for ontologies and things like that, which also, then we do to Palantir, where they're big proponents of ontologies and done a lot of work in that space as well. A lot of the foundations of what we think about databases and the stuff probably came from being inspired from the way that we thought about organizing books in libraries.

[0:06:26] JB: Absolutely. I like to say, librarians have been organizing data for 2,000 years, right? These are not new human problems. These are problems that we've always had and we've always tried to develop tools to try and fix, and data is just another form of information, right? It's different in size. It's maybe different in complexity. It's different in the pace of change, but yeah, you're using some of the same foundational principles, like how to store it? What's the best way to store this for access? What is the best ontology and way to organize it? What metadata do we need to index, right? The Dewey Decimal System is a way of indexing metadata so that you can find things quickly. It's all related. Yeah.

[0:07:00] SF: For our younger listeners, you'll have to go and do a Wikipedia search on what the Dewey Decimal System is. In terms of these problems around data silos, it's interesting. We've been working on that problem for probably as long as humans have been writing down and storing information. It doesn't seem to be going away. It's like, we're actually seems to be getting worse if anything, because we have more and more data to manage. What are your thoughts on the taking a step back and just looking at that challenge, how do we make progress to breaking down some of these silos?

[0:07:32] JB: Yeah. I think it really is about understanding the needs that we have from that data. What questions do we want to ask? Who's asking those questions? How fast do they want to get answers? Because if we start there from a need-based approach to how we want to organize this data and reference across this data, I think we're going to be able to organize and break down the silos faster for the right people. That's probably also the product side of me talking. I'm always talking about, what's the problem we're trying to solve. If the problem you're trying to solve is organize the world's information and put it into a single place so that everyone can use it, that's massive. You'll be working on that forever. You will never be done. There will always be something.

[0:08:07] SF: Yeah. You will be working on that for 30 years.

[0:08:09] JB: Right. That was their original charter. Organize the world's information. Do you go to Google and always find exactly what you need right away? Maybe, maybe not. It's a hard, hard problem. I think you really have to start with what's the outcome we want and how do we build tailored solutions to work across the relevant data for that outcome? Less of the full-scale approach, we're always going to have all the data in one place we want, but we can find better ways to let the right people access it in the right way.

[0:08:35] SF: What are your thoughts on some of the challenges that things like microservices introduced? We had the monolith, three-tier architecture. There's a certain simplicity with that, but of course, we run into some scalability, both from just an engineering perspective. If everybody's working on the same code base, it's all part of this large piece of software that gets deployed somewhere. It can really slow things down, we break it apart, makes us more agile, more nimble, but then, we also, as we break it apart more and more, we potentially introduce a lot of challenges from a distributed systems perspective, just brittle infrastructure requests and responses. One part of this complex mixture of dependencies goes down, the whole thing goes down. What are your thoughts about having worked in that space for a while?

[0:09:21] JB: Yeah, a couple of things. The first one you mentioned is with microservice, I think, one of the biggest problems they introduce is where is the problem coming from? I'm going to reveal my age again here, but I think in the past, if you heard a phone ringing, it was like, well, it's one of three phones in the house, because they're all connected by wires to the wall. That was your model of focus troubleshooting. You knew where the problems were likely to occur. You had a pretty good understanding of the system.

Now, if you have a problem in a distributed system, it's like, when I lose my cell phone, and I ask the Google to ring my phone to help me find it, and now I have to figure out where that ring is coming from. It could be anywhere. It could be the kitchen. It could be in the refrigerator, true story. I've done that. It could be in the car. It could be in a friend's house where I was two weeks ago. Microservices introduced the problem of, is it my microservice? Is it your microservice? Is it my dependency on some other internal team? Is it my dependency on my cloud provider for some auto-scaling infrastructure that I don't even understand how I interact with it? I just send the request across the API and hope it works.

You just have so many more places where that could be coming from. I think that is a huge problem with microservices. It introduces a first-order problem of where did it actually happen? Where did it actually start? I think to add on to that, not only is that the problem, but you're having to dig through so much more data to find the answer, because you're running containerized infrastructure. The whole point of that is so that you can auto-scale. It can scale up when I have more load. I can scale down. It should save me money. It should make things easier to deploy, to run, to change over time.

That also means that all those deploys and scales and runs and changes are introducing more data, more data volume, more data complexity. You're going to have higher cardinality, more interesting facets of data that you're having to rule in, or rule out, where you're trying to navigate to find out where things are coming from. You have that compounding problem of could be coming from a thousand more places than it used to. I have a lot more data about each of those places to dig through than I used to have.

I would almost say, this is my own conclusion, but looking at those two things, the other problem I think microservices introduce is, we rely on fewer and fewer people at our organizations to really understand and be able to solve that problem of where the trouble is coming from. I think you get this hero. I saw this at Palantir. I still see it at my current employer, although we're really doing our best to get out of this with the different tools we're developing, and I think we're making progress. Yeah, you have your organizational heroes. For someone listening, if you're working at one of these companies, thinking your head like, "Who do I call when the incident's really bad?" Probably four names on that list, and that's a problem.

I think that's really the ultimate problem that microservices have introduced is you're over reliant on having the right people in the right incident room at the right time to fix a problem, and that's extremely brittle. What if those three people win the lottery, right? Now you're out of luck, and you don't have anyone who can solve that problem, and I think that's a huge risk in companies that are running these microservices today.

[0:12:16] SF: Yeah, this idea of a hero for instance has come up before in other interviews I've had with people who work in this space, so I'm not surprised to hear you talk about that. That is a huge, huge challenge. Essentially, you have with an incident, a bus factor of one, or three. How do you fix that? What can organizations do to help solve that problem?

[0:12:36] JB: Yeah. I think, the first thing you have to do is you have to know what data you need and what data you don't, right? Because we talked about that data explosion just now. Not all that data is necessary, and especially working at the company I work for now, I see customers consistently reduce the data they actually store, the data they have on hand to solve an incident by 60%. That's a huge number, and that vastly simplifies the troubleshooting process. If you can just get rid of some of that noise, so that every time you're facing an incident, you're working with a much smaller, more relevant data set.

The other thing organizations can do, I think, is to make that data accessible without requiring expertise in the tool. I was actually talking to a prospect, and he said he used to work at Google, and he was like, "The tools that our SREs built to dig through our data were terrifying." You built super complex things that only a few people understand, or maybe if you don't build things in-house, you purchase something from a vendor, and it requires you to learn a new query language, right? It doesn't have to be in-house to be complicated to learn. If you're having to learn new query languages and you have tools for all, you have some in-house, you have some things you purchase from a vendor. Now, you have to learn each one of those tools, you're just going to keep running into this problem.

I think understanding what data you need and don't need and trimming it down to just what you need, so you have a strong starting point. Reducing your tool sprawl, or focusing on tools, maybe you do still have two or three, but tools that are really easy to walk up and use. You don't have to spend a lot of time learning how to work with the tool, right? The tool is walk up friendly, or built for your knock, or your novice user.

I think, then, the other thing is you just have to have those tools that are also built to handle change, because your data is going to be changing all the time, and that's where the expert often is relied on, because no one remembers what happened two weeks ago. No one knows how that dependency arose, or why it's calling this thing anymore. You have to have tools that can help you understand the history and the context of where you are right now, so that anyone can walk in and help fix.

I don't think heroes are bad. I just think, I'd rather invest in heroes as people to really dig into the root cause after the fire is put out, if that makes sense, right? Let's make it really easy to put out the fire, get back to normal functioning, then bring in those experts when they have a little more time to go deep. Let them dig in, let them ask complex questions. But use them in that way where it's less brittle and less something that you're depending on for your entire business revenue to keep working, right?

[0:14:56] SF: Yeah. I mean, I think the challenge with the hero approach is that it's just difficult to scale, right?

[0:15:01] JB: Yeah.

[0:15:02] SF: If someone's unavailable or whatever it is, you don't want to, essentially, be in a situation where you can't solve an incident because someone's on vacation. It feels like, in a lot of ways because of these challenges that microservices distributed systems bring in, there's been a lot of brain power that's been put into, how do we improve on grep and logs? How do we basically put a nicer software over top of being able to do distributed grep?

[0:15:29] JB: I think you're totally right. The hero problem not only is it hard to scale, it burns people out. That's the other thing that in general, it's just a human. I don't want to burn out my friends, regardless of whether or not they scale, I don't want to put that pressure on them. Yeah, nicer tools. Nicer tools are great. Tools that don't require you learning a query language are great. I think there's also something here about learning to use different types of insights together to give you a holistic view, where you can lean in and say, what can I learn from my metrics? What can I learn from my traces? What can I learn from my logs? How do I use those three things together, instead of always going straight to grep for logs? Can I actually use my metrics, or traces to find a signal earlier, locate the place it is?

Then when I grep for logs, I'm grepping through a much smaller portion, right? I know exactly what I'm looking for at that point, so it's easier to find. We think a lot about, what are the right purposes of each of these tools? How can we combine those purposes to give you a faster to learn, easier to use experience?

[0:16:27] SF: You talked a little bit about this idea of, essentially, cutting down the amount of data that you're storing, so that if you can cut things down by 60%, then it's just easier to essentially deal with that volume of data, because you're probably going to have less noise to signal ratio. How do you do that? How do you determine how to cut out 60% of this data? Because I feel like, when it comes to things like, telemetry, logging, monitoring, people are just like, "More data is better. I don't know what I'm going to need, so I'll just keep everything, essentially."

[0:16:54] JB: Yeah, the pack wrap problem is something that we hear customers talk about. Everyone wants to keep everything. A couple of ways we think about this. First of all, we think about what you're already using. What is in a monitor? What is in a dashboard? What are people searching for today with the query tools, what they have? What are your service accounts calling for repeatedly to do whatever financial analysis they might be doing on the back end? That's a great indicator of what data you need. Honestly, that's a small fraction of your data. It's shocking how small a fraction of your data that is.

You can also look at that and look at best practices. Dashboard templates that exist out of the market for monitoring CATEs, those are pretty well understood problems, right? You know you're going to need your container metrics and you're going to need them to be at this interval and in this summarized way. That's pretty easy to look at that. Look at what your people are using and combine that to get a really good understanding of, hey, what do we need that we know we need, because people are looking for it? What do we think we should have? Because this is what the market is saying. Let's put that together and let's see how much of my data fits one of those definitions and everything that's not in that definition, let's throw it away, because there's no utility for it today.

Now, I think the other piece of this is I mentioned a data change, what you need might change over time. You also need tools that let you change what you're collecting, or what you're storing very dynamically. It shouldn't take you re-instrumentation and redeploy to change what you're collecting. It should be something that you can do from a central location when you start to see something new. You have a customer, they have a metric, no one's ever used the metric, they drop the metric. Next week, they see a ton of queries against this metric, and someone setting up a dashboard and those users are reaching out saying, "Hey, I can't find this metric." You're like, "Oh, I didn't know you need it."

Now, if I have a tool that can just say, switch it on, you have it tomorrow, people are going to be a lot less worried about dropping data today when they know they can get it back as soon as they need it again. I think that's the maybe two-sided way of having this problem is first, you look at what people need, trim down to just that. Then as those needs change, have dynamic tooling that lets you adjust that collection to meet what people need right now.

[0:18:51] SF: Can you talk a little bit about Chronosphere's differential diagnosis and how that potentially relates to this problem that we're talking about?

[0:18:58] JB: Absolutely. Differential diagnosis is inspired, actually, by what we saw these heroes doing when we went to talk to our customers and said, what's your process when you call in the hero? What do they do? It's based on their diagnostic process. DDX does basically, does what they do with one click, it takes all of the data about the thing that's having the problem, right? I've narrowed it down to this particular endpoint and this service. It says, cool, what data do we have about that endpoint and that service? What are all the facets of that data? Let's take those and let's split them up into piles. Let's look at things that are bad and things that are good. Let's compare those two things.

Bad, good, you've got errors, you've got successes, you've got really high P99s, versus really low P50s. DDX does all that split up in dimensionality analysis for you and just presents you with the results. You can start to find outliers, because that's what these heroes are doing is seeing, what's unusual about the things that look bad, right? How does that compare to the things that look good? What can I change? How do I make that bad data look like the good data? DDX does all that for you in one click.

I think really, the power of this and how this works with all we've just talked about is it's doing that analysis on a really good set of data, right? We know this data is relevant. We know these facets are valuable. We're not doing it on a big pile of who knows what. That's probably full of noise. We're doing it on data that's hopefully already high signal, so that you can trust the results. I'll also say one thing here. We haven't talked about this yet, but everything we do with DDX is transparent. We present you with user results. You can choose to act on them very quickly, right? The machine does the pattern analysis on the high scale data. You bring your human knowledge and context to that and say, "Yup, that build version. I did forget to deploy that to Japan," real life example. "I need to go do that deploy and everything will go away."

You also have to trust the system. Anytime you're doing a human-machine partnership like that, you really want everything the machine is doing to be incredibly transparent and verifiable by the human. Because I think we as humans, the first time a system says, "Something looks funny," we say, "Are you sure? Let me check. I don't know. I don't trust this." The other way I think DDX works with all that we've talked about is it makes all of its results verifiable by the humans that are actually working with it, so that you can trust it over time and learn to maybe let yourself lean on it a little bit more. Let your novice users bring you conclusions from that, because you know as the expert, you can always go in and verify their results if you're suspicious. And it lets you leave the room. It lets you leave the room and leave that diagnostic work to maybe people who are less familiar with the system.

[0:21:28] SF: You talked about how some of the inspiration for this comes from, looking at what the hero is doing. Is there also an inspiration from the world of medicine? If you look at differential diagnosis there, that's about distinguishing diseases from others that have similar symptoms. Is there some inspiration that came from that world as well?

[0:21:47] JB: Oh, absolutely. I should have said that earlier. I'm glad you brought that up. Yeah, we looked at this and we were like, what are people doing? They're dividing symptoms into piles. They're saying, what do these symptoms tell me? What do those symptoms tell me? How do I look at the difference between those to identify a most likely root cause? I happen to have family in the medical industry. I think a lot of us do. We said, hey, that really sounds like differential diagnosis. I have to give credit to the TV show, Dr. House, another old TV show, right? That's what he was famous for is, "Let me look at these things. Let me compare them to each other. I'm going to use that to get to a diagnosis much more quickly than I could by just looking at one symptom and seeing what that can tell me."

Being able to do that's a maybe more cluster comparative analysis is what differential diagnosis is about. Absolutely a true inspiration from the medical world. Those people are some of the best diagnosticians that we have in the human population. I love drawing inspiration from other fields, where it makes sense.

[0:22:39] SF: Yeah. They've been dealing with big data problems, before there was even the term big data.

[0:22:43] JB: Oh, my gosh. Yes.

[0:22:46] SF: Yeah. I think the human genome was mapped back in the 90s or something like that.

[0:22:50] JB: We still don't understand half of what it told us, right?

[0:22:55] SF: In terms of DDX, how does this work behind the scenes? How is it figuring out how to point you in the right direction?

[0:23:00] JB: Yeah. Like I said, it takes all those facets of dimensions. When I say facet or dimension on your data, we're saying, let's look at all the labels, let's look at all the tags, let's look at all the values for each of those labels or tags, and let's look at the recurrence of them on the things that look bad. Very simply, let's say, you have a pile of data, you've got 100 requests, right? You've got 100 requests, and 30% of them had an error, and 70% of them did not. 30, I don't even need to say percentages, right? This is 100. 30 and 70. 30 had errors, 70 did not. We take those, we divide them up. Then we say, okay, let's look at all the facets, so all the tag value pairs on each of those piles, and let's see which ones recur the most frequently in each pile.

Every time there's an error on 100% of the errors, do we see this build version? On 100% of the successes, do we see this cloud region, right? We go across those and we order those from most prevalent to least prevalent in terms of tag value pair in each pile. That's how we start to present you with those results, as errors seem to have this build version, this cloud region, this user token in common. Those three things are greater than 95% of your error requests. Then we go look at your successful requests, and we do the same thing. Maybe on your successful request, we see that it's a completely different build version. Maybe it's the same cloud region, so maybe that one's a red herring, right? Because it's the same prevalence on errors and success. Maybe successes are then equally spread out across all your user tokens.

Maybe it's something about this user request with this build version. Did the build version introduce some new validation in that user request? They happen to be used to sending in something that's no longer valid. You can start correlating those two things together when you see that those are common in errors. You can also take out the noise of what's common in both by doing that side-by-side analysis.

In DDX, you can also rank by different things. You can rank all your results based on what you see in errors, so that you can see are the things that are highly prevalent in errors prevalent, or not in successes? Because if it's the same prevalence, if the same tag is in 90% of your errors and 90% of your successes, you can rule it out. I think that's the other part of this differential diagnosis is being able to build a hypothesis. I think it's this. Then prove or disprove it by iterating with the tool and being able to say, "Let me rule this out. Let me rule this in. Let me continue to iterate and find the things that are most outlier in my things that look bad so that I can fix them."

[0:25:26] SF: In order to do some of this pattern recognition behind the scenes, are you using some form of clustering algorithms to group these based on the different features that essentially have the error, like the region, the deployment took place in?

[0:25:39] JB: Yeah. I think it's even simpler than that. We're looking at counts and then we're ranking them. It's something that we are able to do, I think, because of the fact that we can do these things at scale. Sean, honestly, that's as much as I know about this answer. At that point, that's where my knowledge ends.

[0:25:55] SF: No problem. A lot of, I think, incidents tend to be related to change management. Someone makes a change, then, of course, that results in some outage, or spike in latency, or whatever it is. Why is it that these user impacting incidents tend to be related to change management, versus some other type of issue?

[0:26:17] JB: Yeah. I think we see, absolutely the first step you take when you see an incident is when was the last deploy? What was in the last deploy that could have caused this? Anytime you're changing code and the code meets the road, you're going to open yourself up to pushing code paths that could not be covered in testing. When you're when you're developing software, and I say this as a product manager, not a software developer to apologies for any inaccuracies to my software developer friends who are listening. What you do is you feel responsible for evolving the code of your own service. You say, "I want to make this change. I think this change is going to make things faster. It's going to handle new data types. It's going to fuel this new feature that my product manager really wants to get out to customers. Great." I've written the code to make the change. What's my next step?

Well, hopefully, your next step is writing tests. You're writing unit tests, you're writing integration tests, you're writing end-to-end tests. You can't write tests for every possible facets and permutation of how this code will interact with all of those dependencies up and down the chain. We talked about microservices. But it's not even just about microservices, it's about what it's going to encounter in the wild. Muhammad Ali, everyone has a plan, until they're punched in the face. Every code looks good, until it runs in production.

You write your test to cover as many reasonable happy paths as you think will be tested in production. Then you release, then you hit production data, then you hit some weird configuration in one tenant that you didn't know would exist. There are always unknowns that happen when you reach that production release point that couldn't be covered in tests. That's why the first thing people ask when they're trying to troubleshoot an issue is what changed? Because probably the fastest way I can get out of the issue is to correlate it with a deploy and roll that deploy back. That, if every incident could be fixed by rolling back a deploy, or turning off a feature flag, people will be so much happier because then you're out of the fire, then you can bring in that hero to actually work on root causing why that deploy had that problem.

What kind of workflow did it run into that it did not expect and how to fix that? I think DDX also brings that into play. We can show you your deploys in your system and let you do that DDX analysis for things before and things after to help you understand whether or not the deploy was the root cause, or what in the deploy changed that can point you to what you need to go and fix before you roll out again. That's, I think, I mean, maybe a long-winded version of your answer of why change events matter, why deploys answer, or matter, is they're the first point where the rubber meets the road and the road always has turns that you don't expect.

[0:28:42] SF: We started off by talking about, how do you essentially get away from having to depend on this heroism that tends to happen within organizations to deal with these types of incidents? DDX is attempting to do that. But what knowledge does a developer who's using DDX actually need in order to navigate and troubleshoot the services? Is there a heavy investment that they have to make in terms of like, okay, well, now there's this new tool that I need to understand and use in order to just debug these types of issues?

[0:29:11] JB: Yeah. No. Actually, that was really important for us when launching DDX is you could go into your observability system, because a monitor fired, right? What if it said like, high error rate at this endpoint in my service? You could go directly from that to your service page. If you looked at that service page and said, "Yup, that looks like my problem," you could just click a button that said, differential diagnosis, and get to those results. You didn't have to learn anything else about the tool. That was really, really important to us. You didn't have to learn a query language. You didn't have to learn how to navigate a bunch of things that didn't feel familiar.

Monitors feel familiar. Clicking a button that says, "Diagnose this problem," is a really easy single step to learn. It doesn't require you to learn anything about the underlying data. It doesn't require you to learn a new query language. It just gets you some results. Now, if you do want to go and understand what's behind those results and look at all the data, we present it to you. We try to give you a UI to make it really understandable. At that point, maybe when you're first learning the tool, you want to go talk to someone and say, "Oh, what does this log mean?" Maybe I'm less familiar with that, but hopefully, that's more about understanding your system and using the expert to help you build the context on your system and not having to use your expert to help you use the tool, or learn the tool. Making it something that was a single button that started at something that was already really familiar, like a monitor was really important to us in building DDX.

[0:30:31] SF: Can you talk a little bit about this concept around hypothesis-driven troubleshooting and why that's important?

[0:30:37] JB: Yeah. It gets back to what those doctors are doing when they're doing differential diagnosis, right? They're looking at the symptoms and they say, "Okay, based on the symptoms that I see right now, I think it is Cushing's disease, right? Or in this case, I think it is a cloud region. The next thing they do is go and try to prove, or disprove that, because the easiest way to get to a fix is to say, "Hey, based on the data, I have X. Now that I think that, what other data can I collect to prove myself right, or prove myself wrong?"

Usually, people walk into an incident and these heroes have four theses off the top of their head. They're like, it's probably that we just spun up a new region. It's probably that we just did a deploy. It's probably that we just onboarded a new tenant and we don't know what their traffic looks like, and it's doing something funny with our APIs. We see people coming in with those hypotheses. They needed a tool to be able to filter down to the relevant data for that hypothesis and put it to the test to see if it correlated or not.

This is about more than DDX though, and I think this hypothesis-driven testing and troubleshooting is something that we want to continue to bring into the Chronosphere experience as a whole. Because I think it's just so much easier to learn. It feels natural. It feels like what people are already doing. So much easier to learn that than to learn to say, well, if you write this Prometheus query and then do this summary and then do a rate over this window, you'll be able to find out the answer, right? That sounds a lot harder than if you see the data and you think it's X, push this button to get the information about X to tell you whether or not that's true.

It's about doing troubleshooting based on probabilities across high scale data, instead of trying to do troubleshooting by writing pinpoint accurate Prometheus queries to try to give you a specific answer. I think, also, speaking of giving you a specific answer, one more thing I'll say here is hypothesis-driven troubleshooting is about being honest with yourself about what the data is showing you. I think, often, if we're trying to find an answer to the problem, it's really easy to go down the garden path and give into confirmation bias and find information that supports the thing you're already thinking about.

This whole idea of hypothesis testing is we'll give you data that hopefully lets you see really upfront, whether you're likely to be right, or whether you're likely to be wrong, because we don't want you to go down the garden path, tell the incident room it's definitely this, shut off all traffic to that cloud region, and then realize it was never that in the first place, and you missed something else. Now, you're still in that incident fire room, plus you look bad and have egg on your face.

The philosophy here is help people do what feels natural, help people stay away from accidental confirmation bias, and hopefully, by building those kinds of interactions into the product, help them fix problems faster.

[0:33:12] SF: Are those patterns for those different types of hypotheses, are they common enough across organizations that you could encode them automatically into the product so I can essentially say, okay, well, it's most likely one of these five things, and essentially, I can click a button, some magic happens and it tells me whether that is the case or not?

[0:33:32] JB: Yeah. It depends. I wish I could give you a better answer than that. I would say, there are some typical classes that often come up over and over. Imbalances in traffic between regions. That's why I keep saying cloud region, right? Imbalances between tenants, so a specific tenant configuration. If you're someone who's B2B and you're working with businesses, that can be a common cause. Problems across different environments, or CATEs namespaces. Those typically come up over and over as easy ways to start identifying what the problem is and why it's happening.

That said, we give our customers the ability to decide what they want out of the box, because each customer is a little bit different. Not every customer is B2B SaaS who wants to track things by tenant out of the box. Some people serve the general public and doing this analysis out of the box based on customer ID is just going to be extremely noisy, because they've got 90,000 customers for their software. That's just not going to be useful. We do give them the ability to tailor the experience for their organization. We also talked to them about adding instrumentation. If they do want to add more custom tags, or custom labels that this tool can then work with to help give them that out of the box analysis, we'll talk to them about that and have them add that and then use that in the software. Maybe some things are common. A lot of things are unique. We try to give people the ability to tailor the tool results based on what their organization sees and their incidents over and over again.

[0:34:51] SF: I think there's a lot of interest in a lot of companies now that are looking at how do you use AI, especially newer techniques around generative AI to automate a lot of things with troubleshooting, standard SRE tasks. First of all, what are your thoughts on the likelihood that we'd be able to automate a lot of this stuff away and how far away do you think any of that is actually from ever happening?

[0:35:18] JB: I'd love to say, tomorrow, right? That would be great. At the end of the day, my friends are engineers, they want to write code. They don't want to troubleshoot problems.

[0:35:26] SF: No one loves on call.

[0:35:28] JB: Yeah. No one says, "Yes, I'm on the platform on call rotation this weekend." I'd love to say, they're around the corner. I think a couple of thoughts I have at the more general level, AI, LLM, machine learning, call it what you want, all of these things rely on good data to work from. It's really easy for these things to hallucinate, or to start presenting you results that, yeah, they look funny, but when you look at it as a human with your contextual knowledge about the problem, you're like, that's nothing. That's just noise.

I think a big problem that we need to solve in order to make those kinds of tools effective in observability is solving the problem of the data they're working with and making sure that data is really high quality, so that we can start to trust their insights a little bit more. I think that trust is another piece of it, I think for these tools to work. They have to be able to tell you what they're doing, so that you can verify it. Like I said, every time I work with a customer and they say, "Oh, I have an anomaly," I say, "Cool, what do you do when you see that anomaly?" They're like, "Well, I go see if it's right. I don't really trust the system. I need to be able to understand what's in that black box." I don't want any black boxes, actually. Let me just put it that way. No black boxes when it comes to AI. You need to be able to tell me how you came to that conclusion. You need to be able to replicate it, essentially, so that I can watch what happened and trust it. We need to build up that trust over time, so that these don't become systems that are just training your developers to not look at them, if that makes sense, right?

I think we need to solve those two problems for them to be really transformationally effective. I think there's potential. I think we can certainly progress down that path. It's a place where we as a company would like to invest. I think in some ways, we have an advantage, because we're starting with a really good data set for all of our customers because of that trimming that we talked about earlier. We're really concerned about being able to be transparent, being able to take out that noise, building a system that people will trust out of the gate, and building a system with that caution in mind.

I certainly don't want to build something that goes all the way into automated rollbacks, and then no one can get a deploy out. I'm in a SEV, because no one can do a deploy, because the system keeps rolling it back. But I can't see in the box why it's deciding to roll things back. Now I'm just like, "Ah, turn off the AI." I don't want to get in that scenario. I think that's also a potential if we're not careful with how we develop these tools for observability.

[0:37:42] SF: Yeah. I mean, I think that when it comes to leveraging AI for things like writing code, there's some advantages there that maybe don't exist as much when it comes to observability type of tasks. Because one, there's a massive amount of code that you can use to train this stuff on. Additionally, even though most people using your GitHub Copilots of the world, inherently don't necessarily trust the output. There's a lot of checks and balances, essentially, between copying a piece of code from whatever, to actually hitting production, because it's probably going to go through integrated tests, like CI/CD. There's a couple of compilation processes. Ideally, if there is some major obvious mistake, it would be caught in there. I think that's a little bit more challenging when you start to get into these largely human-driven processes of debugging and trying to figure out what's going on. It might actually be a fairly complex set of things that you need to adjust in order to solve some multi-tenant error, or new region deployment issue.

[0:38:40] JB: Right. Human driven processes with human driven problems behind them, right? Humans are the hardest thing for an AI to figure out, because we're always changing. We are the most confusing creatures on the planet. Therefore, the problems we can introduce to a system are tremendous. Yeah, I think you're right. That's really hard to tackle with AI. Whereas something like writing code, you've got so much testing, like you said, built in to validate that. You've also probably got some knowledge about what looks good and what looks bad. I think there are other use cases too, where it makes sense in this situation, like explaining what you see, right? If you are looking at a dashboard and a prom QL query, and you're like, "I don't understand what this query is trying to do," that's a really good place to put an AI to help you with that human language translation for something where again, you have a ton of data available on what prom queries mean out on the Internet, right? Look at stack overflow. You probably have great data there. There are places for it. I think solving the troubleshooting problem is a really hard problem that'll take us some time to get to.

[0:39:35] SF: Yeah. I guess, I think the take here would be there probably be some assistive technologies built in to maybe help make people more efficient, but you're not going to be able to just have quite some AI magic black box that just solves all your problems.

[0:39:51] JB: Yeah. Maybe someday, but not now.

[0:39:53] SF: Based on that, where do you see things in observability and troubleshooting tools evolving in the next couple of years?

[0:39:59] JB: Yeah. I mean, certainly in observability, everyone says open telemetry, open standards. I'm going to say it. I think we do. I hear it more and more from customers. No one wants vendor lock in, and by vendor lock in, I mean, no one wants proprietary formats that they don't control. It gets to that no black boxes thing. It gets to the, I can choose where I put, what tools I use, how I combine and recombine this data. I do think an observability theme will continue to be open telemetry. That standard has also matured tremendously. I think we're close to a tipping point where it becomes easier to adopt open telemetry than to not adopt open telemetry.

I think the other thing is, I mentioned tool sprawl right and needing to learn a lot of tools. I hope that a trend in the coming years is fewer and fewer special purpose tools. This is my log tool. This is my trace tool. This is my event tool. And more platform-based tools where we do bring relevant insights together to give you a full picture. I hope for my engineer friends' sake that that is true. I just think we're going to get better insights when we bring the data together and can combine analysis from all sides.

AI is, of course, not going away. We're going to see it evolve, right? I would be remiss if I didn't say that that's going to continue to be a trend in observability. I hope we see it progress. I'm really excited to see what it can do. I hope that we can do AI in observability with accuracy and with transparency, so that developers can really start to lean on that tool and trust that tool.

I guess, the other thing is I just see the acceleration of data accumulation, like that data growth. That's just going to keep getting faster. We're going to keep doing microservices. People are going to migrate over to containerized infrastructure. The data volume problem is not going away, and if anything, it's going to grow.

[0:41:38] SF: Yeah. I think that first point that you made about moving to these open standards, like open telemetry, I think that's a trend that we're seeing across the industry, even outside of observability. But if you look at the decomposition of the warehouse, investment in open table formats, like Iceberg, and then even to infrastructure as code, Terraform and then OpenTofu and things like that. People don't want to be vendor locked in, essentially.

I also think, your point about moving away from some of these point solution type of approach to more of a platform where you're bringing a lot of these data together makes a ton of sense, because even outside of you might only have a snapshot of what's really going on if you're using these more narrow point solutions, no one wants to also have to go to seven different tools to try to figure out what's going on.

[0:42:25] JB: You're right. You're so right. I don't want to have to do that.

[0:42:29] SF: Absolutely. Julia, this has been great. Anything else you'd like to share?

[0:42:33] JB: Sean, thank you so much. It's just been a pleasure and I hope we can talk again when AI really starts to transform the observability industry and talk about what that's doing and how that's going to work in the future.

[0:42:43] SF: Fantastic. Well, thanks so much for being here and cheers.

[0:42:46] JB: Thank you.

[END]