EPISODE 1586

[INTRODUCTION]

[0:00:00] ANNOUNCER: Observability software helps teams to actively monitor and debug their systems. And these tools are increasingly vital in DevOps. However, it's not uncommon for the volume of observability data to exceed the amount of actual business data. This creates two challenges, how to analyze the large stream of observability data, and how to keep down the compute and storage costs for that data.

Chronosphere is a popular observability platform that works by identifying the data that's actually being used to power dashboards and metrics. And then shows the cost for each segment of data and allows users to decide if a metric is worth that cost. This way, technical teams can manage costs by dynamically adjusting which data is analyzed and stored.

Martin Mao is the co-founder and CEO of Chronosphere, and he joins the podcast today to talk about the growing challenge of managing observability data and the design of Chronosphere.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His bestselling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk and their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[INTERVIEW]

[0:01:43] LA: Martin, welcome to Software Engineering Daily.

[0:01:44] MM: Thanks for having me, Lee. I’m looking forward to our conversation today.

[0:01:47] LA: Great, great. So, there are a lot of observability applications out there, and it's a very full field of companies and products. You focus on an aspect that is increasingly important to more and more people, and that is reducing the amount of money that they spend on observability and on monitoring. How do you do that?

[0:02:08] MM: That's a great question. It is one component that we do focus on, and we focus on this component, not just because, as you mentioned, these tools are getting more expensive. There's more data being produced. But that trend actually happens at the same time as more companies adopt cloud-native architecture, as more companies are moving to containers on a platform like Kubernetes, and moving to micro services. It's really that shift that is causing the huge generation of data there. Because you can imagine now, we need to look and monitor every single container and every single microservice, and there's just more of those than there ever was before in the previous architectures. 

So, because our product is actually tailored at these environments, and these use cases, and these are then newer environments and architectures that the old tools just weren't really great at. Because we targeted these architectures and environments, that's what led us to realizing that the cost exploded, and that was one of the first problems that we had to go and tackle.

When we looked at this problem, the first thing we decided to do was try to make the unit economics cheaper. We quickly realized that you can do that to a certain extent, but there are diminishing returns. You can only compress the data so much. And very quickly, we realized, actually just going down that path and trying to make the backend more efficient, and unit economically cheaper was actually not going to work. So, we applied a different approach to the problem, and the approach we have is, it's perhaps often a gut feel of everybody that we produce a lot of data. But how much of this is really useful? If you ask most people in the world, if they think all of their observability data is useful, most people will say, “No, I don't think it is useful.” But the trick is, how do you know which sections of data is useful, and which sections of data are not useful? I think, that's a really hard problem to go and solve. 

Because if you think about as engineers, how you go about observing something, you're going to instrument first. So, you go and think about all the possible signals that you'll want to have, eventually, when you want to debug, and you have to go and instrument that in the application ahead of time before you create your dashboards, before you create your alerts, because you can't create those if the data does not exist, right?

So, it's very hard for engineers to really understand exactly how the data will be useful until they face an incident, and they learn at that point in time. However, you can imagine, once you do that, you never go back and re-instrument or instrument less. The pattern is you always add more and more data there. So, the way we tackle this problem at Chronosphere is that we give engineers and end users the ability to understand how the data is used and assign value to it. The way we do that is analyze is the data, or does the data power a certain amount of alerts or dashboards? Versus if it doesn't. So, you can imagine if we can detect that, there's a huge section of data that is never used in any dashboard, oin any alert, by any user. Perhaps the value of that data is diminished. Or even how that data is being used.

If you are only ever looking at the sum, or the average of the data, and never all the underlying raw data, you can imagine there’s optimizations there as well. So, the trick for us is to both cost account every piece of data so we can measure what the cost of all of it is, and then show our end users that value of that data and allow them to make some decisions themselves on how much they want to pay for certain sections of data, and that's how we go solve the cost problem here.

[0:05:30] LA: So, it's reduce the cost of data as much as possible, and then show usage and therefore importance.

[0:05:37] MM: Correct. In showing that the usage and the importance of the data, then allowing the end users to make some decisions on – there's no magic answer. It's about the end users making some decisions on, yes, I don't believe this section of data that's costing me, for example, $10,000 a year, this single metric may be costing me $10,000 a year, because of all the high cardinality that's on it, and it's not being used anywhere, is that a decision I want to make where I no longer want that piece of data, right? So, it's about providing all the tooling and visibility around the usage and the value of the data, and then also providing the tools to go and remedy that without having somebody to go back and re-instrument or uninstrument and redeploy their applications. So, the way we do it is that you can, in the platform, go and optimize the data dynamically, as opposed to having to re-instrument it at all.

[0:06:31] LA: So, basically, it's a remote configuration of the data collection agent that allows you to decide what type of data you're collecting, how much of it, granularity, that sort of thing?

[0:06:41] MM: Almost. Well, we actually don't do it on the agent side. We actually do it on the server side. The reason for that is an agent can only have a localized view of the data. Whereas on the server side, we have a global view of the data, right? So, in some instances where you can imagine, we need to go and calculate the sum for a section of users, it's much easier to do that in a centralized location. But yes, we are dynamically configuring and manipulating the stream of data on the server side, and that's how we're able to achieve a lot of these optimizations.

[0:07:11] LA: So, the philosophy is the agent generates as much data as it can, sends it to the service, and the service makes a decision about which is filtered in, which is filtered out, which is stored, which is aggregated, all that sort of stuff. That's done at the service.

[0:07:24] MM: And the trick here is, our pricing model is based on what data comes out of that and is stored. So, we're incentivizing end-users to instrument as much as they want, because that's what they want to do, right? But the pricing model doesn't disincentivize them. It doesn't cost them anything to generate more data. It only cost them money if they use the valuable data and choose to store the data that they believe is valuable. So, the incentives are generally lined in that type of pricing model.

[0:07:50] LA: Got it. Got it. Yes, that makes a lot of sense. Certainly, I've spent a lot of time talking to customers about observability. One of the things I used to hear all the time from people is that the fear of missing out is what drives a lot of their spend and in observability. Do I need this extra data point? Well, probably not. But I don't know. And the fear of having that being the one piece of data that would have solved in availability out, and six months from now that cost $10 million. The fear of that alone makes people say, “I'll just add that one more piece of data, that one more piece of data, that one more piece of data.” And it's a normally a very easy risk calculation of collecting a piece of data, versus the risk of that being the one piece of data that's going to catch something. Even if it's a very rare likelihood, the cost differential between a down application versus collecting a single point of data is huge. Any risk whatsoever, typically, goes into the side of, we'll just collect data, and we'll go from there.

But then, what I hear companies do from that point on, and let me just finish this, and I got a question for you from that. What companies do from that point on, then they start looking at the aggregate cost of all the data and realize how expensive it is and say, “We can't afford that. We'll start randomly, and I do mean randomly filtering out the data. We'll collect data from 5 of our 30 hosts, instead of all 30 hosts. Which five? It doesn't matter. Just pick five.” And using a statistical analysis, so to speak, to make the decisions, which doesn't always work.

I mean, if statistical analysis worked all the time with observability data, we would only have average data, we would never have detailed metrics because it wouldn't matter. But you know that, I know that. Hopefully our listeners know that. That doesn't work that way. So, when you lose the statistics battle, it costs you money. So, it's important that you make the right smart decisions about what data to collect and what data not to collect. And it sounds like what you're doing is you're leaving it up to the customer, which is, of course the right answer to make those decisions. What I wonder is, who is the right customer? Because the motivation is to reduce costs tends to be an upper management issue, while this data point is going to be important to us, yes or no? And how much to store is usually an engineering or QA, or an ops person's decision. How do you make that correlation between those two?

[0:10:30] MM: Yeah, that's a fantastic question there. The first thing I might address is that fight, because that that fight is there, that fear is there of like, maybe I need this piece of data, or we actually find in practicality in production, especially, in the way that we're doing it where the raw data continues to be omitted. Is that because it continues to be omitted, and you can dynamically change what gets through or not through to the system?

What ends up happening is, if in a major incident, you need this piece of data, it's still being omitted, and you can start storing it from that point in time in a dynamic fashion without having to redeploy any application, right? So, in all practicality, where you control on the server side, and you don't need to redeploy the application, actually, if there is a big incident that you need this data again, you can dynamically go in, start seeing that data as you want.

Now, the edge case there is, there was a blip for one second, and we missed it. And now, I won't be able to look back and say, “That is very true.” But if blipped for one second, generally the system is not down at that point in time, it's automatically recovered. You don't need perhaps that piece of data anymore, because it's not happening at that point in time. And perhaps the application being down for a split second or for a minute didn't actually cause that much impact.

So, we talk to companies about the fear all the time. However, what we find in practicality is, again, if you can dynamically switch on and off what data you store, generally, in all practical nature, that problem is not there. And if the incident is really bad, and it goes on for multiple minutes, you can imagine, you can start viewing that data again. It becomes unfiltered out again. So, that's practically how we approach that problem.

To the dynamic of, and to your point, incentives are maybe misaligned sometimes, right? The cost incentive is from upper management, and perhaps from the finance department. But the goal is the KPIs for the engineering team to keep the application up and running. They're often at odds with each other.

So, for that, what we found is, you really have to make this part or this problem self-service, and I find it the same as your typical cloud infrastructure management costs. I think every engineer these days understands that I can't just spin up a bunch of instances and VMs and leave them there, because it's costing the company money, right? Even though it may not be directly impacting my KPI in a negative way, we do find that, especially in today's economic climate, engineers do have a financial responsibility to the company. The main problem is that without the visibility, it's really hard to do themselves.

The way we go solve that problem is we allow a centralized and management team to define, essentially, quotas or budgets to individual teams, or individual engineers for, “Hey, this is how many or how much observability resources you can have.” And then allow the end users to view and manage that themselves, so they know when they're going over the allocated amounts, and then also have the tooling to go fix that, right? Because you don't just want to be told you're spending too much. You want to know, “Okay, well, what am I spending it on? Where is the wastage? How do I go and fix this?” That goes back to some of the features that I mentioned earlier.

What we found is that as you sort of democratize that, and self-serve that into the end user themselves, what we generally find is that engineers generally want to do the right thing. They have financial responsibility to the companies, and they generally manage these things themselves.

[0:13:53] LA: Right. An example of having the data helps, right? So, engineers want to save money when they know how important it is. They don't when they don't understand. So, like you say, the idea of the budget says, here's how much we can spend on observability in your area, and you spend it however you want to, and that gives them the opportunity to make smarter decisions.

[0:14:13] MM: It's that. You can imagine the same thing of like if you get told every day that, “Hey, by the way, there’s these 10 VMs that sit here idle, do you still need these machines?” In the same way, it's like, “Hey, it's not just the budget.” It's like, “Hey, you have this metric that doesn't appear in a dashboard, doesn't appear in an alert. No one's ever viewed this piece of data. It's costing you know, $10,000 a day.” Is that something that you want to go and remedy? Generally, if you provide sufficient information, generally, a decision is pretty easy to be made there.

[0:14:44] LA: That makes sense. I talk a lot about MTTD and MTTR. Mean time to detect, mean time to repair. But those are really two metrics that go to the duality of observability that I'd like to think about. Observability does two things. One is when a problem is going on, it gives you information about what is going on within the system to make some decisions about what to change. But the other thing is, when you're not looking at your system, it tells you when you should be looking at the system, because there's a problem that's about to start or is just starting right now. Those ultimately, those two areas feed into both mean time to detect and mean time to repair. And both of those are important.

I really love the thought process of, if we don't need this information now, we can easily turn it on when we need it and deal with it. And that works great for the, there's a problem, I want to fix the problem, give me the data to help me fix that problem. But there's still the other half of the equation, which is I mean, something to know how the system works so it can alert me when there is a problem, the whole learning notification part of that. That's where, I think, a lot of the mindset that, “I just need this one more piece of data, I need more data, I need more data, and I better not turn anything off, because this one metric might be the one that sends me the alert, when that sort of issue crops up.”

I think one of the fears I have is just because the metrics isn't in a dashboard, or isn't in a in a notification stream or whatever, it doesn't mean it's not important. It just means nobody's noticed that that might be important. So, I just put out a lot of information there. I'd love to hear your thoughts on that. No specific question other than what do you do to resolve both ends of that observability problem? And how do you deal with the issue of data relevancy is not necessarily the same thing as data visibility?

[0:16:43] MM: Yes. I mean, it's a great question. I'd say probably two components of that are pretty important. The first part is that our systems are getting more complex these days. So, one pattern I see is that, when we run on a lot of containers, and they're all ephemeral, and there's a lot of dependencies and microservices, one trap I see is that often, everyone is trying to monitor everything, every small thing, that might go wrong. The whole point of this distributed architecture we have these days is, generally, the system is resilient to minor changes, right? If one instance of application fails, or one container fails, the system is generally resilient to it.

So actually, what we find is, if you're tracking your top-level SLAs, and your tracking, like, is this, like, let's say a container failed. But the question often comes to, so what? If I could detect that ahead of time, is it better? Potentially. But if I design my system properly, if a single instance fails, technically, the whole system should still be there. I still need to go remediate it later. But I don't know if I need to know every small signal.

Generally, there's almost too much noise, and you really want to know when something goes wrong. Is it impacting the customer? Is it really having an impact on the business or not? I think that's one way to think about, okay, maybe I do need the alerts and the dashboards, on the things that matter, which are the things are impacting my business, on everything else, not that I don't necessarily care, but I should be designing for ways in which single pieces of things could potentially fail.

The other thing I'd say is, it is true that perhaps you don't have the perfectly crafted dashboard yet, and I would say that generally, through an experience of debugging and incident, new dashboards, new alerts get created, new playbooks and run books get created, right? But that's part of the process here where even if you weren't not to think about, because it's impossible for an engineer to think about every failure scenario ahead of time. You're going to run into some of them for the very first time. The important thing is when you do run into them, can you then learn from that experience, and in the future, not have the same thing?

So, in the future, know that this new incident, and this new type of failure domain can be caught with this one particular metric here. And I now know that that is important. And perhaps I now want to start paying for that. Because the opposite side of that, and I do agree with you Lee. So, I'd say, if things were free, or very, very cheap, I'd say yes, store it. But when it comes to practicality, it's actually impossible to store all the data. So, I think this may be the most pragmatic way to try to sit between the lines of too much data and having it be untenable from a cost perspective. Because as you mentioned earlier, once that happens, then the other techniques, like only observing one out of every five hosts is I think, a much worse way to go than perhaps this way of like evolving the way you value and use the data over time, and changing the shape of the data for sure. But not storing and looking at everything or trying to look at everything because again, like pragmatically, I think is pretty healthy to do.

[0:19:44] LA: I kind of think of it this way is apps mature. And when apps are immature, when they're brand new, they have problems and as they mature, they get hopefully either more and more stable or more and more feature-rich, and hopefully both as time goes on. But that also means that as part of the maturation process of the application is the maturation process of dashboards and observability data that's needed, and systemic knowledge about what happens and what types of things go wrong. So, that's run books and playbooks, things like that, that help you know better how to handle with all these things. You're right. As the system matures, your overall system matures, you'll have more data and can start making smarter decisions about what data to turn on and off. But at any given point in time, with the knowledge and information you have, you can make point-in-time decisions about what to monitor and what not to. And worst-case scenario is you make a wrong decision, and you correct it later on, and that's a level of maturity that goes along with that.

But then part of it becomes how do you store that collective knowledge? And is this a problem that you and your company help solve to? Or is you see that as a separate issue, the knowledge base issues, and when it comes to observability, specifically?

[0:21:03] MM: It is definitely something that we are trying to help companies solve both on the – to the points we've covered already, on the usage of particular data. So, you can imagine the usage of data is not just for the cost purposes. We start to gain an understanding of which signals are the most critical, because you can imagine on the opposite end, the ones that appear on the most amount of alerts, and the most amount of dashboards correlate to these are the signals they get leveraged the most, and this is perhaps the more valuable data there.

So, on one end, there is help from that perspective. On the other side of what we're finding, this is a different trend outside of the cost that we see in the industry is that, with all of this raw data, it becomes almost overwhelming for end users to sift through this raw data, right? Because there's so much of it. There's so many dimensions on the particular data. What we often find is that engineers want to ask particular questions of a system like, “Hey, what are my downstream dependencies? And what are their latencies when my latency spiked?” That's a very human understandable question.

However, you can imagine, when you try to go to an observability system, and answer that, you can imagine you may have to start to dig in quite a lot manually and start to do things like, “Okay, well, for my dependencies, do I need to go to my APM tooling? Or do I need to go to distributed tracing? And do I need to understand how to go get that view of the data that I need?” Then, once I know what my dependencies are, do I then need to go and understand in my human mind, okay, I talked to one service over HTTP, and therefore that request is being proxied through NGINX. And therefore, I need to go look at my NGINX metrics in order to go and find out what the latency spikes are like.” All of that seems like extra tax to get around the fact that it's all raw data, and you need to know a lot about the raw data to navigate it.

So, we're also trying to help end users navigate that in a much cleaner way, because all of the data and the meaning, and the knowledge that you speak of is there in the data, it’s just generally presented in its raw form and we don't think that that's effective at all. We think that there's a much better way to present that data, to answer the questions that I mentioned earlier, in a much more straightforward, because that's really what the engineers care about. Most folks who have to operate these days, are not historical operators. It's a developer that's writing the application. It's not an SRE. It's not a person who's been an operations team for a long time. So, they're just not used to using these tools all the time. And you can also imagine, they have to understand and learn a lot of things these days about security, about observability, about CI/CD. There’s just too much overload, I think, on the developer to become, let's say, an observability, expert, or know how to navigate the raw observability data. So, we're also trying to help from that perspective as well.

[0:23:45] LA: So, what you're really talking about here is observability centralization, right? Is the combining of all observability data into a central location and have it all available from that one location, rather than from the individual sources. So, you don't need to go to NGINX to get NGINX data. You go to your primary dashboard, which might have information on NGINX, as well as on this, and this, and this, and this.

[0:24:11] MM: Part of it is definitely putting it in one location. And the second part is not even have to think about the fact that it was NGINX, right? Or how I compose that dashboard. It's about perhaps automatically extracting the meaning from the underlying raw data, which is possible because all the raw data is there. It’s just generally presented in its raw form. You can imagine, generally, you have to go create the dashboard yourself. Then, at that point, you have to know that it's NGINX. And therefore, you have to go to go okay, this is how I go query my NGINX metrics and whatnot. But yes.

[0:24:39] LA: It's an extension of the tracing problem, right? Tracing metrics, where you collect all of this data from all these disjoint services that is meaningless until you create the path that a request takes, which is selecting the right metrics in the right places, and putting it together in a different organization structure. That's really what tracing is, but you're talking about more are tracing plus, plus, where you include everything and including NGINX log files, along with APM data, along with infrastructure information that's relevant to that request, et cetera. But tying all of the data that's relevant to a particular type of thing that's going on, generalize it that way, and making it all available as independent data, versus tied to a specific thing.

[0:25:23] MM: Exactly. I think, with tracing, it's easier because the data comes from one source. The whole world has consolidated on open telemetry. So, you know there's only one pattern and one source of data. Whereas you can imagine on metrics, you have all types of different sources, right? Then again, you have to put all the data into one physical location, for sure. But then the extraction of the meaning is, I would say, much harder in the metrics world than it is in the tracing world. The tracing world sort of had the benefit of much better foresight and design of one protocol, and one transport layer, one client for a lot of these data.

[0:26:00] LA: You’re right. It's a much simpler problem. With metrics, you're trying to figure out how does CPU on this core compared to network packet rates for something, some transaction somewhere? And it's like, correlating those two is, what is the meaning to correlate those two? And should they be correlated? That's really a much harder problem where then, it is with tracing where it's like everything is tagged, and has a request, and everything is magic, and it all works together perfectly all the time without any problems, ever.

[0:26:31] MM: Yes. I don’t know about that. But it's definitely easier and the more a tractable problem, for sure.

[0:26:35] LA: Absolutely. Yes, I was obviously being facetious there. But absolutely. So now, we've touched on two types of data. Obviously, we focused heavily on metrics. We talked a little bit about tracing. But there's actually a third piece of observability data that is very, very important, and that is logs. I love logs. I love talking about logs, not because I like looking at logs. But I personally find logs to be one of the most ignored piece of observability data, and hence, the most wasted. The vast majority of generated logs end up being completely unviewed, unexamined, and unutilized in any way, really. Yet, when you think about it, log entry can be a real savior, right? A single log entry can solve a problem for you. What are your experiences with log data, and how does log data fit into this overall view that you're talking about here?

[0:27:32] MM: I would agree with you. It's definitely often forgotten, but an important source of data for sure. I think that is the case, partially because I imagine – I don't know this, for sure. But I imagine logs came first, and then metrics, and then traces. And because it's the oldest, it probably is the most complex. There is no industry standard protocol, industry standard client. It is a messy ecosystem, and I think because of that, it is a much harder problem to go solve. Hence, a lot of projects like open telemetry for, for example, probably put that last for those reasons. It's the hardest piece to go solve there.

Then, to your point, I think, it is valuable because it can provide details that metrics and distribute traces cannot and cannot provide practically. So, I do think it is an important piece. I think the key for logs is, and you mentioned this earlier, there's so much of it. It's often how do you find the relevant ones, right? For us, the key to logging is generally people don't start with logs first. You get an alert, that alert is not off of a log, normally, it's off of a metric, and you start looking into it, you start looking at your dependencies, then perhaps you start looking at this, to retrace that. Log data is often the last place to look at because you want the fine-grained details there. The other data sources are really helping you narrow down the scope into which area and which specific section of logs.

And I think that's the key there is, can you use the other types to point you to the exact log messages that you're looking for? And if you can, that gives you the detail and time. If you can't, you can imagine, in one case, you perhaps only know which server it ran on. Then you got to pull up the logs for the whole server and all the applications are running on and trying to sift through and figure out what are the relevant logs for my debug use case. Whereas perhaps, in a different scenario, you know exactly the timestamp, you know exactly the container that it came from, and you know exactly the error message you're looking for, then you can pinpoint the exact log messages you're looking for there. I would say, that that in my mind, is key to bring logging into the flow and make it a useful data type.

[0:29:37] LA: That makes a lot of sense. So, is it correct to say that its logs are on your roadmap? They're not an integral part of the Chronosphere product right now. Is that correct?

[0:29:48] MM: That is correct. However, I would say that what we have found is, Chronosphere is a SaaS platform. So, with that, companies do have to egress the data to us. Generally, with log data, a lot of folks want to keep it within their networks, within their environments, because egress is expensive. So, for us, the way we approach this now is we integrate with whichever logging product you have today, whether it's Splunk, or your cloud logging product and whatnot. And we have the ability to go from a metric or a distributed trace to the exact log messages you're looking for in that particular system, right?

[0:30:21] LA: So, you don't have to ingest the logs.

[0:30:23] MM: Exactly.

[0:30:24] LA: You can connect it to the log. Okay, that makes sense.

[0:30:26] MM: Exactly. So, our trick is to connect it there, so you can view the actual log messages in Splunk, or in whatever logging tools that you do have. But the point is, it takes you to the log messages you care about in context from the rest of the flow, without having to egress everything. For the companies that we work with, generally, the egress costs alone to get the logs out, and makes it impractical for a lot of companies to use the SaaS solution for logs, unfortunately.

[0:30:51] LA: You're right about that. That is one of the big issues with observability data in general and SaaS observability platforms, which we're both very familiar with specifically. But it is a bigger problem with logs than it is with other types of data.

[0:31:07] MM: Yes. What we found on metrics and traces is that because it's highly structured, and a lot of the fields and the texts are repeated again, and again, because you can imagine the label values are repeated again and again. You can actually get really high compression of that data on the way out. That's actually not an issue. Logs, a huge part of logs are unstructured, or this field where there's just like a stack trace or something like that. Actually, the compression of that over the wire cannot be at the same rate. So, it's actually much harder and much more expensive to egress logs outside of an environment.

[0:31:36] LA: Makes a lot of sense. So, let's talk about AI. I don't see you talk a lot about it on your website. But I'm interested to know how AI fits into this. Because AI is becoming an important part of observability. One of the things that in general is true with AI is and I'll be very, very, very high level, 150,000-foot view here, is AI does better with a larger learning set. The more data you have, the more relevant data you have, the better AI can be at giving you analysis of that data. That's worked very well into the observability platform of collecting more and more and more and more and more data. You get more data, your AI gets much better, everyone is happier. How does it work into your model where you're actually trying to reduce the amount of irrelevant data? Irrelevant data can be very helpful to an AI. It can also be ignored by the AI, but it can be helpful to the AI. How do you deal with those issues?

[0:32:33] MM: Yeah, when we look at AI, there’s probably two parts of the domain that we look at. The area that's more recent that everyone is going crazy over right now, the large language models and the Generative AI. I think that does play a role. But I believe that role is hoping the interface to the data to be more conversational and English-based, as opposed to PromQL-based, for example there. What we find when you go down that path, and why perhaps you don't see much about it on our website is, it's seems a little bit of a novelty right now. Because when you apply a publicly trained model to the problem, it can be helpful in helping you let's say, describe the functions of an open-source query language like PromQL.

What it can't do, though, is it doesn't actually understand your unique observability data. It doesn't understand what a service is in your data, what an endpoint is in your data. Because of that, it can't answer questions like, “Hey, what is the latency of this endpoint?” It doesn't know how to go do that. 

So, what we have found is it's a bit of a novelty right now. However, we do think that eventually there'll be a good use case there. The thing is, you need to extrapolate the knowledge from the underlying raw data, and that goes back to what we talked about earlier, that process of automatically extrapolating as much meaning as you can from the raw data, and using that metadata and that knowledge graph as the training set that's unique per company. Because each company, you can imagine, has – every company's applications are different. Every company's architecture is different. Having that be the training set per company model, then enables something to be unique there. That's our belief and that's the part that we are on, on that side of things. Some of the more modern hype on AI.

On the other side, which we've been talking about, I think, as an industry for a very long time, is root cause analysis. Can you just throw all of your raw data into machine learning and have it tell you what is the root cause? And what we found there, not just from our time here at Chronosphere. But at Uber, where we worked on this problem for many years, is that correlation is pretty easy to detect in time series data. You can imagine you can look at the shape and correlate a bunch of them. Causation is almost impossible, because the raw data doesn't give you enough information to really say that that's the causation of things. And because of that, what we found was the –

[0:34:55] LA: Well, that’s the problem humans looking at data too. But it’s absolutely –

[0:34:57] MM: That’s the problem with humans looking at data. That's why if you think about it, we’re human, when they look at two graphs and the correlate it, they can't tell what the causation is by looking at the two graphs. There's some other knowledge in their mind that knows, “Okay, well, this has a dependency on this.” Or like, “The CPU has this weird dependency on the network or saturation.” There's something else. There's other pieces of information that doesn't exist in the observability data to really tell you causation. You can imagine, there's so many sources of that data. It's almost impossible to feed all those sources into a particular model here. But what we found when you don't do that, and you just try to train on the raw underlying observability data, what we found was, I actually thought the results weren't bad. The signal noise ratio of, let's say, automatically detecting and alerting you when something is wrong, was about the high 60s, about 66%, 67%. I actually thought that was pretty good for just the machine learning model, looking at raw data, with a two-thirds hit rate on detecting issues.

[0:35:56] LA: It's a lot higher than dumb filters and triggers, which has been the historical norm, up until recently.

[0:36:04] MM: Right. And I actually thought that that was pretty good given that this is the state of the art with these systems. But what we found was the behavior we found when we tried to roll that out was very interesting. What we found was, engineers would not accept this, because you can imagine when they get woken up at 2am in the morning, if they configured the alerting the thresholds themselves, they have no one to blame but themselves or their own team, and hence, there is a lot more leniency there, and there's a lot more patience of like, “Okay, I'll just tweak this and make it better next time.” When the machine wakes you up at two in the morning and they only have a 67% signal-noise ratio and hit rate, what ends up happening is the human engineers reject that, and they say the machines are bad, and I don't want to turn it on.

So, the problem is, actually, I think until you get to a really high signal-noise ratio, and I believe it's probably going to have to be above 90%, 95%. I don't think these systems will be accepted. The problem there is that the gap between 60 to 90 is a really long tail. I just haven't seen much progress towards that. So, our approach here at Chronosphere from that learning is not necessarily replace the engineer door automatically, but really, okay, what answers and information can we bring to the engineer to make that decision, and bring to them very quickly? So, things like correlation, for sure, but not telling them this is the causation, because we just don't know if that's the causation or not. But yes.

[0:37:25] LA: Your focus is putting decisions into the hands of the engineers, versus telling them the answer.

[0:37:30] MM: Bringing very interesting information to the engineers and allowing them to make the decision there with probably additional knowledge in their minds, as opposed to just on the raw underlying data. But yes.

[0:37:42] LA: Well, this kind of leads into the next question. So, this is at least a partial answer to the question. But I think there's more to it than that. But where do you see the biggest challenges in observability in the years to come? What's the biggest struggles we're going to run into in the observability space?

[0:37:59] MM: Yes, it's a great question. I think that probably two to three current challenges we're facing now are going to be around for a while, I think. The first one, we talked about the beginning of the conversation today. The trend is the cost of these things becoming a point where they're not tenable anymore. Often, observability bills are becoming higher than our infrastructure bills. You can imagine that's just not a trend that can continue there. That's a trend that has to reverse for the industry. I think that's a trend that's going to take a very long time to go and fix.

So, it's probably the first one, I think. It's going to be a challenge. It's going to play the industry for many years to come, and it already is now, especially in the current macroeconomic climate, I think, it's extra painful. That's probably the first one.

The second one, I think, we also talked about a little bit is, how good is that experience and effectiveness of the tool for the developers? Because architecture is becoming more complex and we see that the effectiveness of these tools for engineers and developers are getting worse. So, can we reverse that trend and really reduce your MTTR and MTTD there? We talked about some the cost impact onto those two things with the data. But just the effectiveness of the tools are getting worse. We're seeing stats like year on year, there are 19% more critical incidents every year as opposed to less, and that's a really worrying trend as well. And I think that's one that the industry has to fix, just as the architecture changes.

I think, if we were all still deploying monoliths on VMs, I think the tools and the APM tools that we had, and we've developed for that world worked just fine. However, as we transition and the problem changes, and now you have to solve the problem on containers with microservices and much more distributed architecture, that problem needs to be solved again. Then, the last one, I think, that I think the industry is doing a pretty good job of is adopting open source standards. So now, there's a lot more open-source tools out there for sure. But I think the best thing for companies out there is the creation of open telemetry, because it does standardize the protocols at least, of data production in this particular space and industry. I think that's great. It's been a good start. The main, I think challenge for that path is logs. Because as we just talked about, it's a much harder problem to solve than for traces where they created the open telemetry protocol of the metrics, where the industry was already consolidating around one standard, which is Prometheus. Logs that just isn't there, and there's, for a very long time, going to be many different protocols, many different client versions, and I think that's going to be a really hard problem for the open source community to go and solve. But I think it's one that the world is trending towards solving.

[0:40:41] LA: Make sense. So, let's talk about how you got here. Now, you have a long and interesting journey. I mean, you mentioned Uber. We talked a little bit about working at Amazon. But why don't you – you’re an Aussie by birth is, and by training, right? I mean, you went to school in Australia, and then you started at Microsoft. You were working on one of their first SaaS projects, weren't you?

[0:41:05] MM: You're correct. I started working at Microsoft on Office 365, which was the online version of Office. So, taking Microsoft Office 2010, it was back in the day, and trying to move that to the cloud. While it was positioned as a cloud product, I would say the software development mindset was very much of a box product still. So, we did it. It happened, but not I think in the right way back then. I think they fix those practices now. 

[0:41:31] LA: Makes sense. And then you left Microsoft, and now you started having some of the same thoughts and challenges that I remember running into when I worked at HP and that is, how do you get release cycles down under the two-year mark, under the 18-month mark? I know, this is a time where in QA now. This is not the time you so and so. And if you want to make that release, it doesn't occur. And this release is going to be in 2023, or whatever it ends up being, the big company mentality, and certainly not the DevOps, not the methodology. You struggle with that. But what you did is you left Microsoft and you moved to Amazon. Why don't you talk about what happened at Amazon? What you did there?

[0:42:14] MM: Yes, correct. It was exactly for that reason. The speed of development and they were really treating it like a box product. I think the appeal of Amazon was different. They were doing SaaS properly. In fact, it wasn't even SaaS back then. I think it was like, just the beginnings of pass, more than anything else, right? They seem to be doing it in a much better way. For me, a mentor of mine at Microsoft actually moved over and took over a particular team. There was some, I believe, some reorganizations that happened in that particular team, and they were missing some technical leaders on the Windows part of the EC2 team. And he asked me whether I wanted to join that and start that journey. So, seemed like a great opportunity to work on EC2. One of the core services at AWS, especially on the Windows side of things, and then having just come from Microsoft.

So, I started there from doing everything from writing driver code to make Windows work for the hypervisors, that Amazon were running, all the way to writing observability tooling around, can we successfully launch Windows instances of EC2 in effect? Or instances of EC2 with the images? There was a point in time where that wasn't guaranteed there'd be new image, and we didn't know whether that would actually could be successfully launched, as an instance, in our regions around the world. So, worked on that, and then eventually moved into more of the management systems around EC2. So, things like, I think they ended up calling this as an EC2 Systems Manager, but remote commands and fleet management of EC2 instances is where that ended up there.

[0:43:43] LA: Then, so that's where you really started with your observability background, and then you took that on to Uber, which, I'm sure is a whole another set of stories. But that's where you essentially got the idea for Chronosphere. Is that correct?

[0:43:56] MM: Correct. Correct. So yes, I ended up moving to Uber, which when I joined, had two co-located data centers and didn't use the cloud at all. So, that was an interesting shift coming from Amazon, and AWS, for sure. Yes, we were trying to fix infrastructure issues for Uber, and one of the big ones was observability. In particular, trying to help Uber move to public cloud and use cloud native architecture like running on containers and microservices. That's when we ran into these problems ourselves. That's what ultimately led us to creating open-source tools to go solve a lot of these problems. And then ultimately, that led to the creation of Chronosphere and trying to solve these problems for other companies around the world.

[0:44:37] LA: Great. Then you started Chronosphere. So, what's the history behind the name Chronosphere itself?

[0:44:44] MM: Yes. We had to have a name to incorporate, so those time pressure, to pick a name at least. And it actually came out of just using random prefixes and suffixes. A lot of observability data is time series space. So, words that had something to do with time, like chrono where possibilities for prefixes. And then, suffixes, I think, we were just trying like stack or platform or things like that. And Chronosphere, I would say, it sounded the best out of the worst options that we had there, and we thought we could live with it for now. We can always change it later if we wanted to. And let me just tell you now, you're never going to change it later, for sure. So, it stuck.

But the other thing was the SEO was good as well. When we googled Chronosphere, I think there was like, one old reference to something, Command and Conquer. And then it was also, I believe, a Greek death metal band. We're like, “Okay, we could probably out SEO these results.” That's really what went into building the name, and we always thought we'd come back and change it later. It never happened, and it is just what it is now. But we’re very happy with the name.

[0:45:49] LA: Yes. I know how those things go. Great. Thank you. I hate to say it, but we're out of time. Martin Mao is the founder and CEO of Chronosphere and he's been my guest today. Martin, thank you so much for joining me in Software Engineering Daily.

[0:46:03] MM: Thank you so much for having me, Lee. Always great to chat with you and looking forward to the next one of these.

[0:46:06] LA: Definitely.

[END]