EPISODE 1833

[INTRO]

[0:00:00] ANNOUNCER: Modern cloud-native systems are highly dynamic and distributed, which makes it difficult to monitor cloud infrastructure using traditional tools designed for static environments. This has motivated the development and widespread adoption of dedicated observability platforms. Prometheus is an open-source observability tool designed for cloud-native environments. Its strong integration with Kubernetes and pull-based data collection model have driven its popularization in DevOps. However, a common challenge with Prometheus is that it struggles with large data volumes and has limited cost optimization capabilities. This raises the question of how best to handle Prometheus deployments at large scale.

Eric Schabell works in DevRel at Chronosphere where he's the Director of Community and Developer. He also is a CNCF ambassador. Eric joins the show with Kevin Ball to talk about metrics collection, time series data, managing Prometheus at scale, tradeoffs between self-hosted versus managed observability, and more.

Kevin Ball or KBall, is the vice president of engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action Discussion group through Latent Space. Check out the show notes to follow KBall on Twitter or LinkedIn, or visit his website, kball.llc.

[EPISODE]

[0:01:35] KB: Eric, welcome to the show.

[0:01:37] ES: Hey, thank you very much. Nice to be here.

[0:01:39] KB: I'm excited to get to talk with you. So, let's maybe start out with a little bit of introduction about you and what brings you here and maybe start touching on Prometheus and Chronosphere.

[0:01:49] ES: Sure. So, my name is Eric Schabell. I work at the Chronosphere, an observability company. I have a position that has a pretty long, weird title. So, I generally just introduce myself as the director of evangelism. You recognize in a startup that stuff gets pretty fluid. Every year we want to reset the target, reset focus, so things are constantly growing and evolving in different directions and you end up getting more and less and other things under your umbrella. So, I do a lot of stuff around DevRel basically, in the observability space is a good way to put it.

[0:02:20] KB: Awesome. Well, let's maybe then talk about what is Chronosphere and what is Prometheus, which we came on to talk about.

[0:02:25] ES: Yes. As Prometheus is the topic we decided to discuss here, I'll kind of twist it in that direction. So, Chronosphere is an observability platform, it's a SaaS offering that ingests open standards. Cloud Native Computing Foundation is something that me and my team are a big part of. Two of us, including me, are ambassadors of the CNCF. We like to help out, contribute, and talk about and do everything we can to promote open source in the sense that that's, we believe to be the best way to get started with basically anything.

I've devoted most of my life in the AppDev space to it first before I came over to the observability side, so it's a continuation of what I normally talk about. In relation to Prometheus and OpenTelemetry and Percy's a new project and Fluentbit, which is also something that we're very integrated with and a couple of the founder and a couple others work at the Chronosphere, all these things are the standard way of delivering and communicating over telemetry protocols, collecting, storing, querying, that kind of stuff, is all integrated into our offering.

So, anybody that starts out initially at a smaller scale, in the open source world, finds themselves growing and scaling and having more and more difficulties managing teams that are growing, that are spending all their time managing the infrastructure and trying to make it all scale. I find it quite nice to be able to unplug something like that and find a vendor with open standards and is easily able to assimilate and recognize the query protocols or the query languages and the things that we use from Prometheus.

[0:03:51] KB: Cool. Let's maybe dive into what Prometheus is. I saw it described as a cloud-native observability platform. What does that mean?

[0:04:00] ES: Yes. So, Prometheus is a metrics collection project. It's the best way to do it. That's one of the signals that you would want to collect in an observability platform. There's some very unique things about what they do. So, this was originally written back in the day, I believe, by some people like SoundCloud. The idea being that it has to be highly performant, it has to be highly scalable, and they want it to be as unobtrusive as possible. So, one of the things that it does is it scrapes, which means it goes out and gets the data from an endpoint and doesn't require you to set up things to push data to it. It doesn't involve collectors, it doesn't involve agents, it doesn't involve anything like that. They allow you to fine-tune that as you wish as a developer, so you can use auto instrumentation. They call it like you just flip a switch on the Java library and it starts spitting out all kinds of Java metrics.

Usually, not a great experience because Java metrics are just a wash-lice list of stuff that's just crazy. So, you want to get more specific, so then you can use that library to trim that stuff down and deploy applications again with proper instrumentation. And they usually expose some kind of endpoint where they publish these metrics on, and you set up Prometheus to go out there and scrape these endpoints every so many seconds, 10, 5, 15, whatever you choose to do. You're also dealing with - you need a backend for this. By default, if you're just playing around with Prometheus, it just does it in memory. but you really don't want to do that in real life.

So, you have to put some kind of backend storage in there, and that's a time series, what you're storing. Normally in a database query, it's a finite set of whatever that you're collecting. Select star on, whatever table and get this stuff and get a bunch of data. Here, you're basically trying to represent constantly collecting data on a machine. So, for example, CPU usage, constantly every second of what's going on is a million seconds or whatever is a little bit hard to store and would probably overload almost any organization's network in the matter of no time, especially when you're doing it across thousands of machines. So, they have a way of sampling the data and then re-representing that when you do queries. That allows you to do time series queries and figure out what's going on and create the dashboards and all that kind of stuff.

It also provides alerting mechanisms, so it has an ability to set up alerts and rules around that kind of stuff that then triggers and sends off and can integrate with things like page or duty or send you Slack message or whatever it is that you're using in your org. The query language is also pretty much a standard in the industry. It's called the Prometheus query language, PromQL. Doing that, you can generate database queries basically against stored metrics data and create visualizations, and there's been embedded dashboards inside of Prometheus, but that's not really generally how people do that. They tend to query with an external dashboarding tool that used to be centralized an awful in Grafana, but there's an up and coming projects. The very first visualization and dashboarding project just reached sandbox status this last fall. And it's going to be at KubeCon EU here with one of the Project Pavilion little stands. It's a Prometheus project and they're basically focusing first on the metrics queering through Prometheus instances. They're starting to expand into traces and some of the other stuff. It's a very young project. But that's pretty much the infrastructure, what you're going to end up with the Prometheus.

[0:07:21] KB: I'd love to dig in on a couple of different pieces there. So first, talking about the sort of piece by piece you talked about, one of the things that's distinct about Prometheus is, it is going out to find the data and things like that. Can we talk a little bit about that? I saw, particularly in the cloud-native environment, there was some stuff around service discovery and how do you find things within Kubernetes to go and look things up. How does that all work and what does it enable?

[0:07:48] ES: Initially, some of the links I sent you, I assume you'll put them in the show notes, is around the workshop that we have. You initially start by doing that statically, just defining where to go, what endpoint, what IP address, where are the targets I'm trying to scrape. That's fun and games when you have one or two things, but you don't want to be managing a list of thousands of those things. Let's be honest, Kubernetes is set up to be an environment that you write a description for and say, go out there and run these services and when certain things happen, react this way. But that's about all the control you have. Because you really don't know what their IP addresses are going to be, you don't know where they're going to locate and how many there's going to be.

So, they have something called a service discovery, and you can integrate various tools that do that. They support several different ones. ZooKeeper is a pretty famous one that people know. What that does is just monitor the space you're running your clusters in, and so new nodes, new pods, new containers, and stuff come up. It just automatically adds those to the list and start looking for endpoints and scrape those also. Generally speaking, you've ensured that those are providing those endpoints. Otherwise, you're going to have a lot of down targets. Yes, that's kind of how the service discovery works. That makes it all dynamic. And that's much more realistic for a cloud-native environment.

[0:08:59] KB: Yes. So, another thing that you talked about here, and you sort of alluded to, you said these are collecting time series as contrasted to, you mentioned another project that's doing logging or tracing or things like that. Can you kind of maybe dive in a little bit on what is the distinction between time series collection versus what you might get in a logging or tracing type of observability?

[0:09:22] ES: Okay, yes. So, a little bit of the difference between metrics, logs, and tracing. Starting with logs, that's a very standard, it's like text lines, right? Going right down the chain, right? Just a bunch of data that you're storing there in its log lines, that's pretty normal. Maybe you turned it into JSON with something like Fluentbit that parses it into machine-readable stuff, but it's still taxed, pretty normal storage.

Traces is also something where you're setting points in your service calls that you're catching and tracking as it goes by, which is also a pretty straightforward day to kind of set that you're putting together. It's not a constant stream of stuff. The metrics where they have to have a lot of really smart ways to deal with this, because of things like cardinality explosions. If you look at object-orientated programming and you make an object for a person and it has a name and it has an age, has a social security number, has whatever an address, these are all metrics with labels. Your person might be the metric and the labels might be all these things underneath it.

If you happen to put something in there like the IP address from the machine that you're getting this information from, that's a unique thing. And every time this gets sent out, you're getting a unique thing you have to save in the database. That's a cardinality explosion. The more these things can create, that kind of stuff is where you get the mistakes where it gets really hard to create this kind of stuff. In a time series, you want to be able to have a way to store this kind of streaming data and this massive amounts of data in a way that you can measure it periodically and just connect the lines between those short little periods is the best way I can give you an analogy around that.

[0:10:58] KB: I guess then a question becomes, how do you decide when is it you want to use time series collection as compared to sort of full traces or something along those lines? Is it when you start? 

[0:11:09] ES: I mean, don't get too wrapped up in the way it's stored. I mean, that's just, time series databases are needed for metrics, they just are. So, all of them require this. It's not an option that you can choose one or the other. If you're looking at the signal type that you want to collect, whether I should use metrics, logs, or traces, that kind of has to do with what you're trying to observe and what you're trying to watch. So, if you want dashboards with visualizations on what my infrastructure looks like, you have things like thresholds you're setting and watching for things that get beyond the threshold. When you're looking at resource consumption, when you're looking at network lag and things like that, these kind of measurements only come through metrics. When you're looking to find out where did my calls go through my service network, that's traces. Logs are just dumping information that every developer put into his app. Every application that's running on that machine is logging something, right? It's putting messages in the log that I started up that I'm having trouble, and I'm dying now or whatever the deal is. It's three distinct signal types that you're using.

[0:12:11] KB: Yes, that makes sense.

[0:12:11] ES: And each one is probably stored a little different, yeah.

[0:12:14] KB: Well, and part of why I'm asking this is I think metrics are much more common in sort of operational monitoring and things like this. And a lot of software developers are thinking, "Oh, I want my logs so I can debug or trace." So, kind of translating that mindset into what happens here. Let's maybe look then a little bit further, you have your metrics, they're stored in Prometheus, and you're starting to want to query them or think about them. What types of questions you've already started doing that are best answered by this data, and what does the query language look like to explore them?

[0:12:44] ES: I think you've kind of alluded to it there at the end. So, I think a lot of DevOps modern organizations are no longer in that boat where developers are just like, "I want to see my logs." That was when I was a developer. That was like 15 years, 20 years ago, right? We used to spend, I used to make jokes about this with a couple of guys here, because I do quite a bit of stuff and film a bit now and it's a lot of logs and stuff where we're simulating and dealing with. I used to spend time at the coffee machine because the machines were slow enough to compile the Java code we were writing that you had time to go get coffee and talk about problems you were having.

One of the things we used to discuss quite a bit is like trying to figure out how to make a good Java exception message for the logs, because some humans reading that downstairs on the third floor, right? That's your ops guy. Now, that doesn't work that way anymore. That's not really, really what you're focusing on. And that complete picture of your cloud-native environment, so deploying apps and services, your team owns the service, and you want to have a visualization of that service and everything about it. Luckily, we have a lot of organizations dealing with things like platform engineering, SRE teams, and all this different stuff that have a really good idea of how they want to do this and provide a developer view, provide an SRE view that's carrying the beeper that has this stuff and know who to call if something breaks.

But when you own a service like that, you're not so much digging around looking at individual metrics, you're having a predefined look at your running object, element, whatever you use, service that you own and that's your landing page, that's your heart monitor for what you own. That's telling you, depending upon what this organization has gone through, that pretty much have a good idea of what they want to know. If I'm in the order service, I want to make sure nothing's blocking orders. That's where the money comes in the retailer. So, they'll be watching really close the connections to whatever credit card companies they're using or maybe the incoming stuff from the website, and if something starts slowing down orders and they see that little thing go all the way down like this, and maybe they're going to say, "Hey, it's time to look at our dashboard and it should be able to really quickly find out where's the problem? Are the orders not coming in? Are they not being processed out? Is the timing out going to the credit card company?" That kind of stuff.

Yes. So, I think what you end up is with having their own view of their world, right? And that's a combination of all three of those things, not necessarily just metrics or just logs or just trace. Not saying that that doesn't exist, but it's generally, you don't want to set up your dashboards with here's my metrics, here's my logs, and here's my traces. And then when SRE's beeper goes off, he or she is like, "Let's look at a log, let's look at the trace." No, what you're trying to figure out is a high-level view and you can drill down into to get to the problem.

[0:15:17] KB: So, with Prometheus then, Prometheus is still just providing one of these pieces, which is the metrics tracking. And I think as you highlighted, that's probably why people are then setting up queries to query against it from an external rather than using built-in visualizations, because they want to integrate this with their logs, their tracing, things like that. What does that end up looking like? For someone who hasn't, for example, written something in PromQL, what is this as a query language? What does it look like and how does it work?

[0:15:48] ES: I'm going to say right up front, it's not easy. This is something that a lot of people struggle with including me when I'm writing labs and digging around and trying to find the right thing to give an example of. But it's within the workshop that I alluded to. There's about four chapters that are involved with learning PromQL and kind of walk you through all the various aspects of it. There are so many different ways to represent your data. You can do a simple counter, you can do a simple like tachometer kind of thing. You can do a graph, you can do histograms, which is histograms is going in the direction of looking at a graph over time. It's so complex what they're looking at, that they're basically bucketing chunks of it. So, the first minute is bucketed, the second bucket represents the first bucket plus the second bucket, and it gives you a different look at your data. You have heat maps. 

Have you ever watched baseball when they show the pitch or stuff throwing and what are the pitches that they hit the most, it gets read in the areas where they get the most. You can represent your data or your view of what's going on that way too. You have topologies for your service calls where you're basically showing the network of the services and how the calls are going and the lines get thicker the more calls that go over it. The PromQL gives you the ability to - and some of the tooling embedded in it right now into the newest Prometheus is really nice where they explain the query.

So, you're putting the query together, it has command line completion in there, luckily, helping us find the metrics that are available at the point you're trying to do that. You can see the result sets, you can dissect them because you're also doing sort of like queries within queries to build up a bigger query. You will apply things like rate across a query. To kind of generalize a specific query out over time. Maybe you'll only want to look at a small window, only 5 minutes, or only 1 minute, or 10 minutes. There's so much flexibility in that. And yes, depending upon what you want to land in.

So, generally speaking, say you're a service owner, when you come in there, you have a couple of things that are important to you, probably that it's up. So, you'd like to have a green bar when it's up and a red one when it's not, right? Or yellow one is degrading or things like that. It'll be a threshold when starts degrading and you're not allowed to get to a certain point, then it's a problem. Things like that are the very high-level look, and then when things start going wrong, you have to be able to dig down into it. So, if it's a specific service, maybe I want to go look at the traces, see one of those topology kind of graphs, and see that, "Hey, this one is not getting anything." I'll click on that, and I can go down and look at the trace, and then maybe I can look at the log for that specific thing.

We like to talk an awful lot about observability being the pillars in the old days where they talked about logs, metrics, traces being the three pillars. For quite a few years now we've been talking about Chronosphere that we think it's a phase. I think we should speak a business language. It's much more complex than an individual tool. I like to tell the story that it's just like you're driving around in an old car you just restored. You're really, really proud of this old car and you're driving along and all of a sudden the temperature, the engine starts going up, and it starts getting kind of weird how it's driving and you're like, "Oh, I'm close to the mechanic I use. Let's whip in here quick and see what he says." You pull up and it's getting worse, and it sounds bad, and you get out and he comes running out, he goes, "Oh, come over here and look at this." He starts opening up a toolbox showing all the tools. Meanwhile you look outside the window in your cars overheating catching on fire and he's in here talking about these great tools he's got.

So, we'd like to talk about the phases of observability where you want to know as fast as you can what's wrong and then you want to be able to triage it as quickly as possible, preferably fix it with remediation, then you want to be able to go back and look at like root cause analysis and find what the long-term solution is for this thing. To do that, we don't care whether it's two metrics, one label, three traces, and half a log line, it doesn't really matter. As long as I can get to the answer as quick as possible and fix the problem. I think that's pretty much modern observability in a nutshell, where Prometheus has a very important role in helping collect, manage, and display these kind of stories. 

[0:19:49] KB: Let's maybe, if we can, make it a little more concrete by going through an example. So, let's say, I don't know if we want to use a car example or something like that. You have a service, you're monitoring. Maybe let's take it from ground up. First, what do you need to configure to get Prometheus starting to track things in there? Is there a thought process around what needs to be tracked or everything, or it's there by default? Then, how do you design those queries that are going to give you that dashboard that tells you, "Oh, shoot, my car is on fire. What do I need to do here?" And then take it through that route.

[0:20:19] ES: Right. So, for Prometheus itself, there's exporters, for example. So, let's say that your service is a node service, node.js. It's very easy to turn on a node exporter. It creates an endpoint and generates a bunch of node type of metrics that are pretty standard. You're able to get quite far with just doing that. What's nice about that is you don't have to redeploy anything. It's no code changes involved. You target the service, you set the exporter on, and you start watching what's going on. Once you start collecting data, you now have all the generic metrics around things like CPU usage, what the memory usage is. I don't know, what specific things are in this exporter, but like if it was a Java one, it's keep sizes and all that kind of stuff. Everything that you normally would be able to monitor around something like that without having to design it yourself.

So, the people that write these exporters and manage these exporters and maintain them are trying to make life as easy as possible for people that just want to kickstart it and see what they like. That's either way how you usually start just so you can see what's available and then you can start trimming stuff down if it's too much for you, because one of the things that gets out of control really quickly is it's very easy to collect a lot of data, and if you're doing this in the actual cloud, in and out with your data is water through the pipe is money. So, if you don't have the ability to see it which is one of the big things that Chronosphere is good, is using your control planes to see your data coming in and out and tell you when you're not using your metrics. Turns out that across our customer base anyway that on average 60% of the metrics are not used that are collected.

That's not a bad thing at this point in time. It's just a big chunk of your bill you don't really want to be paying, right? So, the first thing you start looking to do is how can I trim this down or at least stop it from being stored. You do want to adjust everything as you might need it in the future, but it's not a problem to not pay for storage if you're not using it. If it doesn't show up in a dashboard, it doesn't show up in a query in an alert or anything like that, no ad hoc queries from the user, nobody's touching it, what are you doing? So, that's kind of what you run into really quickly when you use a standard exporter or a standard library that would just spit out everything.

But for us developers, that's a nice way to start, right? And then you say, "Okay, I only want to know CPU. I only want to know how much memory it's using. I only want to know whatever." So then, you go back and you start instrumenting it using code. Going actually in there and saying, "I want this, this, and this, and the rest that I don't need." Re-deploy the thing. There you go. Now, we have it trimmed down to just what I want. And that's the stuff that you're querying to create your dashboards. So, I'm saying, "I want to see a chart that shows me how much memory consumption is going on in the last 5 minutes, or the last 10 or in the last day or whatever it is." By default, it might be in the last hour, but you can expand that and drill down in it. Cut and slice and dice any way you want inside that stuff.

That's kind of the evolution you go through to get to something to make you happy. And trust me, when you start getting somebody carrying a beeper, you're going to start seeing things in there that also include documentation in your dashboard. So, there might even be playbooks or runbooks that they have. When certain things happen, go down this path and do this and call this person, alert that person. Look for a feature flag that got changed. Maybe there was a new deployment. Oh, definitely reverse that stuff, that kind of stuff.

[0:23:31] KB: Yes, that makes sense. Okay. So, just to make sure I'm understanding the life cycle, first you build this exporter, which you probably can start with just a package off the shelf. That's sending a bunch of data. Prometheus is going to then start tracking that in time series. You look at that and you can start building your dashboard by setting these queries against it. And I think one of the things you highlighted there that is kind of interesting and might be worth exploring is like, what time series enables is querying over ranges. I did see there looked like there's a core distinction between you can do sort of instantaneous moment in time queries, and you can also do kind of range queries. Maybe we want to look at what differences there are there.

[0:24:06] ES: You studied hard.

[0:24:08] KB: I do what I can. So, all right, you build your dashboard on that, maybe you set up some alerts on that. Question I actually have is, is there any difference between the queries you're doing for dashboard versus alerts? Does Prometheus have native alerting support? Are you querying that on -

[0:24:23]ES: It's a separate binary that you can install, but it definitely has an alert manager. Interesting part about this is, and I ran into this quite a bit when I first built the workshop, is you tend to forget, like, it's in memory, but it's also writing it to a little file system in your directory there. So, every time I would adjust something and restart and things like that, I think, "Okay, you go look at it and it would start a new graph of standard graphs and I'd be querying just to say uptime." You can just do up and it'll show you all the instances that you're tracking, are they on or are they off? And you'll see this little thing going. If you query something else that happens to have a counter that's running or whatever it is, it'll show you whatever the graph is. It's really hard to get interesting stuff when you just started. So, you got to remember that your student is out there doing this. He just started this up on his machine and his graph starts with an hour with this little blip in the corner, and you basically got to get him to go over here and turn this thing into like a minute and then it starts looking -

It's very different than what I have this been running because I know this, I let it run for half a day and then I start working on whatever I'm going to give you an example of, otherwise we have no nice examples. Also, what's really, really kind of freaky is you'll go away, come back, and redo stuff, and then there'll be stuff with a big blank spot, and then stuff way back there, and stuff way over here. And if your query doesn't span all that, you won't see it all until you do. So, slicing and dicing stuff that is long dead and long gone, but is still in your database, because it collected that time series when it was alive. So, some container that was running or some instance of your service may not be there anymore, but you're getting like, it feels like false positives. You're like, "Hey, wait a minute. This one isn't running well." I know that you're looking at legacy data there.

[0:26:02] KB: Yes, that is kind of an interesting question, especially as we talked about, if you're changing your metrics that you're collecting or trimming them down, like how do you deal with versioning in this type of metrics database system?

[0:26:14] ES: Versioning of what exactly?

[0:26:17] KB: I mean, the example you use, right? Okay. I had it running and then there's this dead spot and something changed. Maybe that dead spot is because I took things down and I'm actually changing my collector a little bit. Now, I have a new set of data from this point forward.

[0:26:32] ES: Yes. This is a little bit more of a development environment we're talking about. So, you should never see this in production. A blank spot means it all went south. That means you are offline, your retail store is selling nothing, people can't get to your website, that kind of stuff. It does happen, but that's not really a good sign. You do everything you can not to do that, right? So, when you do updates or new versions of whatever, it's a rolling thunder kind of thing, right? So, that's why we're in a cloud-native environment, so you can bring up a new instance and then take down the other one once this traffic's taken it over.

A really good example is how people try to take care of - we haven't really got that far yet, but when this starts scaling, Prometheus, one of its weaknesses, I think is it was not built for high availability. That's not really part of the design. So, what that means is if I have one instance collecting all my stuff and I have dashboards and I have alerting attached to that, it gets too much traffic, it'll overload and die. Everything dies. There's no dashboard, you can't query it anymore, you can't - everything kind of goes bad. And if you try to load balance that with another instance, you can't put one alerting and one dashboard behind it because an alert will go off on one of these and it won't see the other one. So, then you have to do two alerting. You see this scaling starting to go out, and we're not even talking about putting database funds. You see this spread out into these weird, incredibly complex topologies to try and load balance all this stuff.

One of the things that you saw was that you could set it up as I have an instance running, I have another instance on the service, and then I have a third instance on the service. If it starts really flooding this service, it'll spin off its own instance just to cover that, to deal with that heavy load and keep the other ones running. That's really nice, but the new one you spun up has no history beyond the point that it came alive. The other ones lose all history from the point that the other one took it over, unless you know that in your dashboards and can account for that, and query that together into the one dashboard. You know what I'm saying? So, there's a real complex problem you start juggling when you're doing this by hand.

[0:28:32] KB: Let's maybe actually dive down that a little bit because, one of the desirable perks of going cloud native is if you have that big viral hit, you go and it's relatively straightforward to scale, right? You're not dealing with, "Oh, I have to figure out how do I set up a new server?" You're like, "Okay, take this service and this set of pods and scale it up. Go." What happens to Prometheus as you do that? It sounds like there's some amount of automatic failover or trying to scale in there, but how does that end up playing out?

[0:29:02] ES: You can define when something gets like, I said, you can spin up a new instance with it, but the high availability is not built in. There's not a feature, there's not a function you can turn on that accounts for that. So, you're getting another non-high-available instance of Prometheus that starts from that moment onwards collecting data for whatever it's monitoring. And you're applying your own tricks of the trade to spin this stuff up and to automate that kind of thing. This is where vendors start becoming interesting, right? Because you can already smell and feel if you're any kind of a software developer or a person that's had to manage stuff like this, that now I'm starting to do a lot of proprietary work myself and it's going to take more hands and more people to maintain this.

I always do this with like, imagine your DevOps people all over here, your whole team is on the right side of the room, doing everything they're supposed to be doing around, developing whatever your business is doing, say a retail site or whatever, managing all the services and having a good time. And then slowly, but surely about half the team is on the left side of the room managing your very successful environment, which is now scaled way up and has a whole bunch of metrics monitoring going on and all that kind of stuff. Wouldn't it be nice to get half those resources back on the right side of the room doing what you want to do and not messing around managing the infrastructure anymore? That's where vendors start coming into play where you're happy it went down the open source road and you can unplug and reroute your stuff directly into something that that looks an awful lot like what you've been doing. You recognize the query language. You recognize the protocols being used. Your dashboards are not. Those efforts are not lost. They're easy to replicate wherever you land and something like that and you take the management out of it. 

Most of these vendor platforms, and like some part of the stuff I just described from the control plane from the Chronosphere, is a big, big help. That's basically quantified all that for you and let you just concentrate on what you really want to concentrate on.

[0:30:52] KB: So, let's maybe talk about that then. What do you get if you're ready to move from Prometheus to Chronosphere or another vendor package? Actually, maybe even just like, how do you know, right? Is it when you start seeing that complexity? Is it when you first get a spike that overwhelms your Prometheus? Like what are the signs that maybe you're outgrowing the manage it yourself infrastructure?

[0:31:14] ES: There's several things. They're pretty classic in almost any open-source environment, right? So, there's a really funny marketing kind of story that goes around where you say, "Killing your heroes." Who hasn't run into some place where they've worked, where there's one, maybe two people that are like the big rock star guys that know it all? Might even be girls, doesn't matter. But I mean, they know everything, right? They were there since whenever, they know where all the bodies are buried, and what happens when they leave? You know what I'm saying?

That's usually the ones that are pretty core to running a complex, highly scaling up open source environment. Some places are really happy to do it. I know our CEO, both our founders, our CTO and our CEO, were both at Uber in the beginning and spent the first, I think it was three years running, they built the M3 database out from scratch and that's a time series database and set up the whole infrastructure there for Uber and ran it for three years. He did an article not too long ago about how if I think they have 400 or 450 engineers or something like that there, just doing the infrastructure for the observability. I mean, who does that?

[0:32:17] KB: I know they do.

[0:32:19] ES: I know why they do it, because he said how much they would have to pay if they came over and did it at ours. It had been like $65 million. So, you're like, "We all saw some of the leaked information to some of the customers that were on somebody's yearly earnings calls." And you're like, "With those kinds of numbers, you can pretty much run a pretty nice department." And I've worked in universities where money they didn't have, but time they had. So, they didn't care how long it took you or how many hands were involved or whatever was going on to manage the open source stuff, it just couldn't cost anything. I think you get to the point in time where you're a CIO or somebody that's responsible for these organizations, head of the central observability team, or you're the head of the SREs, and you just want your guys focused on what they got to focus on, you're seeing the burnout, you're seeing the stress, you're seeing too many incidents, things like that, and it's often not hard to figure out where you can start cutting costs and where you can take some of the load off, right? I think people are getting a lot a lot better. You see the conferences we go to and you hear the talks. The examples from the organizations are setting up pretty good observabilities to teams and environments and they're doing it at big scales. I mean, good Lord, look at DoorDashes and things like that. They're doing it at a mega scale and they're not running a team of 450 observability guys. That's not for everybody.

I don't think there's any one specific thing, but we've all been in the environments where you're just firefighting, and too much of your team is doing stuff that he doesn't want to do. I used to always make kind of jokes about it stuff when the DevOps first started coming out. Because I was like, "I signed up for dev. I didn't sign up for ops." You'd hear that around. And if you're not careful, and even now if you look on the Internet and kind of Google around, I think they say that all these developer reports that come out about what they're doing and what the languages are using, all this stuff. I think it's like 35%, 36% of your time is spent on actually coding. Think about that. That's barely a day in a week.

It's like, we did our own research and stuff and 10 hours a week were spent on this kind of observability problems. That's crazy, out of 40? Come on. Is that what you signed up for? And that's why people leave. If I want to be a developer, I want to be a developer. I don't want to be a troubleshooter the whole time and don't want to carry a beeper that's going off all the time when I'm trying to have Christmas dinner and things like that. So, let's be honest, it's a complicated thing. It's not easy to run all these very complex things that we're doing.

[0:34:47] KB: Yes. Well, I feel like what you're describing here is there's sort of a curve where you start out and you have more time than money. Maybe you're in a university or you're like in a cash-strapped startup environment or something like that. You're like, "Okay, great, open source, get it going. We're just a small environment. It's not a big deal."

You get to a point where you are hitting the limits of what open source gets you easily. For Prometheus, that might be, it sounds like when you go from one instance to having two, and now you've got to navigate all of these like, am I sharding, versus am I just replicating, versus all the different federation questions. And maybe we can talk a little bit about what, go into a little bit of detail of some of those though you've covered some already. And you say, "Okay, hopefully, by the time you hit that, you actually have a little more money in the bank and you can pay for a service like Chronosphere to take care of it for you." And I do want to put a pin on this and come back to what does migration look like if I do that.

And then at some point, though, once again, you get to the point where you're so big that the costs of paying for the service are high, but you also have so much money, you can pay for a whole department to manage it. And then maybe you migrate back out. Let's kind of maybe look at that migration path. 

[0:35:52] ES: Well, I think you touched on a really good one. I don't want to slip away. I think the big part of open source that we're always chasing is the open standards, the ability to be an architect, whether you're designing apps or designing infrastructure. You want to have the ability to stand the test of time as much as possible, and you also want to have components that can be replaced by another component, but are still speaking the same standardized language. When somebody in all these organizations take enough time and effort to create a standard something, whether it's TCP/IP or whether it's an observability protocol like the OpenTelemetry protocol. If you've chosen that road, that means, and I think containers is a great one.

I mean, Docker in the early days owned that, right? They had their thing. They were the big cat on the block. When they started getting approached about we need to standardize this kind of stuff, they didn't really want to hear it. So, what happens is open source world gets together and it's a bunch of companies too, but I mean it's all these people contributing. They sit down there write the OCI, the open container initiative, and now you can write any engine you want against the containers. I have Podman, I have Docker, and I have whatever you want that's coming down in the future. It's all standardized, right? Kubernetes standardized. YAML standardized.

I want to have these kind of tools and I think that is what you're doing and what you're positioning yourself for by watching the CNCF, those kind of projects and the observability space, the Prometheus, the OpenTelemetry, the Jaegers, the whatever you're using that's generally speaking doing their best to try and not you know tie your stuff into a knot and what that means is, is when you're ready to actually migrate to something that should be relatively famous. Right? You're going to find out really fast it's a vendor, is that hope or not.

One of the things we do as a pilot, so we show it to you, you get a trial run, and you get to take your environment, put it in ours and plug it in and see what happens. Just seeing some really neat stuff happen when they do that. People are finding things they didn't find before, they're figuring out that they have so much metrics coming in, they didn't even touch, things like that. It's because we're all so busy trying to do the day and the day to day, you don't have the time to step back and who hasn't been in those positions where you'd love to step back and like do some real strategic stuff, and then you just don't get the time.

[0:38:10] KB: We've talked a couple of times about, okay, you need all these different pieces to have your observability solution. You need metrics so you're keeping track of what high level is going on with your team. You need logs and telemetry so you can dive - tracings, so you can dive deeper. In the open source world you might have spun up some of each of these on your own. Maybe you're spinning your logs out to Kibana and you've got your tracing going and you've got Jaeger so you can do that and you've got Prometheus covering your metrics. If you're moving to something like Chronosphere, can you pull all of that under one umbrella?

[0:38:42] ES: Pretty much, yes. If that's what you're trying to do. It's not necessary either. It's kind of a funny thing. So, say that your organization has been experimenting with this stuff as you go, right? And you might be big on the OpenTelemetry ecosystem. It's just the thing you bought into. It's what you've seen. You've got the collectors out there already, fine. But you also have some legacy stuff over here in the corner that is kind of a problem, right? An expensive and coming due, and maybe you want to get off of it. But there's tools like Fluentbit, a very lightweight collector, telemetry pipeline basically. Anything in, anything out is their catchphrase. So, they have inputs and exporters on both sides. It doesn't matter where you get it, and it's very lightweight. It's written for the cloud native stuff. It's a sub-project of the Fluentd, which was the monolithic stuff, APMs.

The Fluentbit thing can go out there and you can even go to edge cases. And really lightweight, it's able to handle high volumes and high stream and it's really quick at processing all this stuff and let you on the edge, already take down the amount of volume of stuff you're getting that's coming in. They can expose it as a metrics endpoint. So, Prometheus can go scrape it. They can pass it on to a logging back end. They can pass it on to OpenTelemetry. They can use the forwarding protocol they have from Fluentbit, or they can turn it into an OpenTelemetry envelope and pass that off to that collector understands.

So, the infrastructure you already have in place can be maximized, but what's out there that isn't able to yet, you can put something like a Fluentbit in front of it and easily kind of obfuscate that it's not the right protocol yet. And then work on that on your own time and hopefully get rid of it before you got a renew. You just unplug all that stuff and you're onwards with your OpenTelemetry. The OpenTelemetry collectors can just be redirected to another destination, right? Which could be Chronosphere and whatever cloud that you need, or continue to be some backend you have or whatever the deal is, it's quite flexible. I think that kind of covers the whole roundabout question.

[0:40:39] KB: Yes, that makes a ton of sense. So, the one thing that I did want to come back to briefly is we touched a little bit on what happens as you start to scale up. These decisions around, okay, in Prometheus, maybe you're deciding, am I doing fallover? Am I sharding things? What do I need to deal with to scale? Is there a federation type of thing? Maybe we can talk a little bit about what that looks like in the open-source world. And then bringing that into Chronosphere, does that whole problem just dissolve? Or are there still things to think about even if you're using a managed solution?

[0:41:11] ES: So, when you have to start making decisions around how you're sharding and databases or storage, I guess you should call it, because it's time series is not exactly the same thing. That's you managing your infrastructure and I don't know what you think when you when I start hearing sharding and stuff like that and high availability and low balancers and it sounds like we're getting complicated. I mean I'm not running that infrastructure, but it sounds like it's starting to get hard. The whole idea of a managed service is that you don't really care. You know what I'm saying? It's not that you don't care, but one of the things that we spend a lot of time on is working with the customer and trying to give them the most optimized whatever we can give them.

There was a lot of effort a couple years ago when everybody was kind of cut back their spending in the marketplace to optimize both storage and transmission and collection and all this different kind of stuff. I think those are the kind of things you're happy to look at, but you don't want to spend your time on that, right? That's the reason you try to get off of the infrastructure you're managing yourself. And the bill is important, monitoring the bill, and providing insights to what does my consumption look like, what teams are using it, being able to balance that kind of stuff. You're starting almost to get into the FinOps kind of view of what's going on in your organization, right? The financial operations. Those things are all baked into a more mature observability platform that can include pipelines, to limit through data, tracing logs, whatever, it's events, all kinds of stuff get integrated into that.

The look and feel shouldn't be so dramatically different, which is what's nice about coming into something like Chronosphere. I think there's an awful lot you're going to recognize from the open source world. It's the same query language. It's the same idea of what you're looking at. Dashboards are relatively the same. It's an experience you're looking for. How do I get my stuff set up as quick as possible? How am I able to dissect stuff? There's definitely things under the hood that you will not find in you know the younger open-source environments, stuff that you might want to write yourself. But I mean, we're talking about serious organizations before you get to that kind of level.

We have various features like that that take you right down into the to the problem in a couple of clicks. We got some great quotes from customers when they tried it. I'm not really trying to sell you anything here, but it's just that that's what the maturity looks like and that's the difference between doing it on your own and seeing it scale up and starting to have problems. To be really honest, everybody's environment is way different, right? Everybody has their own specific problems, their own specific legacy stuff, and their own specifications. We have some customers that have discovered that They use less than 10% of their ingested telemetry data. That's extreme, but that's a used case where they're very much focused on something specific, and that's the only thing they monitor. Fair enough, if that's the case, that's the case, but they do it at massive scale. So, that leads to a lot of automatic ingestion of garbage, basically, for them.

But even if you can just have it, get half the chuff out of the way, that's got to be an ROI you're interested in. We show that. You get to actually run it for a couple weeks and see what it looks like in a pilot. It's not to alleviate fears and stuff. I mean, it is, but it's to show you what it looks like in your environment, what it can do. Try to use real data and real environments. It's not meant to be just a test bed. So, that's kind of the experience that you're looking for. Can I get off of what I'm doing and stop thinking about sharding and stop thinking about, "Oh my God, there's another instance, and another one, and another one, and I'm going to beeped again."

[0:44:40] KB: Yes. Absolutely, it's how do I focus on getting the value out of this thing, not just all the work to keep it running?

[0:44:46] ES: I was going to say, one of the things we often talk about is nobody would care what it costs if you had better customer experience, if I had better on-call experience, if I had happy engineers, if I had more money in the bank, and less downtime. Very often that's not the case. It's just a bucketload of money going somewhere and everything is all a mess. 

[0:45:07] KB: I heard somebody say at some point, they said, data is growing exponentially, but I have yet to find that a company whose data budget is also growing exponentially.

[0:45:16] ES: Yes. That's a really good example. And we wouldn't care if we're using it, but you're just not using it, that's the problem. So, every time I have one of these conversations, I come away being like, "Holy smokes, all the different things I learned." And hopefully, hopefully folks listening along also have that sense of like, I'm coming away smarter than I was an hour ago. I would say if you go take a look at the all the workshops we have, you can get hands on from zero to installing it, to learning about Fluentbit, the Percy's project, Prometheus or OpenTelemetry, we have all of that online for free. 

[0:45:48] KB: Awesome. Well, thank you Eric.

[0:45:50] ES: You're very welcome.

[END]