EPISODE 1873 [INTRODUCTION] [0:00:00] ANNOUNCER: Modern software platforms are increasingly composed of diverse microservices, third-party APIs, and cloud resources. The distributed nature of these systems makes it difficult for engineers to gain a clear view of how their systems behave, which can slow down troubleshooting and increase operational risk. Groundcover is an observability platform that uses EBPF sensors to capture logs, metrics, and traces directly from the kernel. Critically, Groundcover runs on a bring-your-own-cloud model, so all data remains within the user's own environment, which gives increased privacy, security, and cost efficiency. The company is also focused on adapting to how AI-generated code is changing observability. Code can now be produced at superhuman speed, which increases the challenges for reviewing code before it enters production. This means that observability is likely to play a growing role in code validation and providing guardrails. Yechezkel Rabinovich, or Chez, is the CTO and Co-Founder of Groundcover. He joins the podcast with Kevin Ball to discuss his journey from kernel engineering to building an EBPF-powered observability company. The conversation explores the power of EBPF, the realities of observability in modern systems, the impact of AI on software development and security, and where the future of root cause analysis is headed. Kevin Ball, or KBall, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies, founded the San Diego JavaScript Meet-Up, and organizes the AI in Action Discussion Group through latent space. Check out the show notes to follow KBall on Twitter or LinkedIn, or visit his website, kball.llc. [INTERVIEW] [0:02:00] KB: Hey, Chez. Welcome to the show. [0:02:02] YR: Hey, thank you. Nice to be here. [0:02:03] KB: Yeah, I'm excited to have this conversation. Let's maybe start a little bit about you. Can you give our listeners just a little bit of your background and then how you got to Groundcover, where we are today? [0:02:14] YR: Yeah, sure. More than a decade in software engineering, specializing in Linux and distributed systems. Mainly worked on kernel models, until I got sick of it, and then got in love with EBPF. Then we founded Groundcover, which I'm the CTO and co-founder, and that's what I'm doing for the last four years. [0:02:35] KB: EBPF is really cool. We did another episode about that. I hadn't been exposed before. My previous Linux background had been 15, 20 years ago. Then coming in and being like, wait, you mean to integrate with the kernel? I don't have to go through this arduous patch process and kernel process? It was mind-blowing. Actually, let's maybe even start a little bit there. How are you utilizing EBPF for Groundcover? [0:02:58] YR: Yeah. As you said, after a few years with kernel models, you're falling off with the power of extending the kernel, which is very cool, but it's also very, very hard. The development cycle is so slow, because any mistake you do, basically, can crash the system. After a few years as the R&D manager and leader, me and Shahar, my co-founder, realized there is a big problem with instrumentations. We all know the SDK instrumentation. You have all the classic vendors. You have all the open-source standard, but still, you have to instrument your application, right? It sounds very easy. You have a lot of tutorials. But in real life, it's very, very hard. Most companies have a lot of different runtimes, different versions, and you need to keep on track on all those SDKs. Why won't we do it with EBPF, which basically led us instrument the application from the kernel side? Without any risk for the application itself, you're running a sandbox. You have the kernel verifier that will not load you if you're doing something wrong, or that can potentially harm the application, but you still get 95% of the value. We can inspect any CISCO, we can intercept HTTP, SQLs, Redis calls. Whatever you're doing, we can probably see it. Maybe leads to Groundcover, which is an observability company. I would say, more than observability company. We utilize EBPF alongside with classic instrumentation, but our main sensor is based on EBPF. So, you deploy the sensor, in less than a minute you get traces, logs, metrics, everything in one place, without changing your code. The other cool thing that we do is our backend is based on bring your own cloud, which means your data stay inside your cloud account. Very nice in terms of privacy and security, and also, allow us to reduce costs, because we're not charging by volume, and your customers are usually happy with it. [0:05:08] KB: Yeah, for sure. Well, one of the things that you alluded to there is an area I think we can dive into a lot. You said, a modern observability company, and I think a lot is changing right now in terms of how we build applications, how we deploy applications, how we need to be observing them. What is needed for modern software development from the observability side? [0:05:30] YR: I think nowadays, average platform is so complex with integrations. Third-party, cloud resources, different SaaS vendors that you're using for feature flag, or hosting, and the average engineer struggle to even know what components the platform is relying on. That's where classic instrumentation fair, right? Because it's like the unknown-unknown, right? You don't know what you don't know. This is a nice experience we see with customers that deploy the sensor for the first time. All of a sudden, they see those links between their application to third-party. It could be even something very mild, or very small, like fetching an avatar from a third-party website that they didn't even know, or things like that. I think nowadays, the basic of modern observability is to have all the information in one place. You'll be surprised how many of our customers before they use Groundcover, use five, six, seven different tools. Their signals were across different platforms, and just a correlation between that, it's so hard. I think having all the data in one place, this is the very bare minimum of observability platform. In these days where most companies have 100, 200, 300 microservices and the number is just increasing. [0:06:58] KB: Yeah. Well, and the fact that you're able to gather all that data without the engineers having to add the instrumentation to their code means you're able to capture those unknown unknowns, because if an engineer didn't think of it, if they had to instrument it, it wouldn't be there. [0:07:13] YR: It's 100% visibility on what your application is doing. It's very nice and very mind-blowing for the first time. [0:07:20] KB: Yeah. Well, I think there's another big trend, which makes this very interesting, which is it just feels like, in particular, the way that software is changing in the AI development world, the volume of change going on is so high that being able to even keep track of what's going on has just gotten harder. [0:07:41] YR: You can see it even from how the development life cycle changed. A few years ago, not that many, it would take you a lot of time to write the code, something that product could explain in a few words, it just takes time. Engineers optimize for typing speed, right? There are competitions around how fast can you type. Currently, that barrier basically does not exist. Any software engineer can basically write superhuman code speed velocity, and this is no longer a barrier. You basically can print unlimited lines of code in minutes. Are they doing what do you think they're doing? Maybe that the architecture that you planned it to be? Maybe. How do you code review it? [0:08:30] KB: I am feeling that pain tremendously right now. What's your answer? How do you code review it? [0:08:37] YR: With an AI, of course. No, I'm half joking. But to be honest, obviously, at Groundcover, we use AI to write code. We use AI to code reviews. One of our engineers just wrote a utility that use AI to create the pull request with Groundcover flavor. It collects the information from the ticket, maybe the Figma that correlates to that ticket and basically, generates a PR with AI and that the engineer could just edit the last mine. Obviously, the short answer is, of course, we use AI for code reviewing. But at the end of the day, engineers still need to be accountable on the software that we ship. I personally think that testing should be very, very mindful, because I've seen tests are actually making sure there are bugs, because the tests are also written by AI. The code review in the AI era for me starts with, first of all, let's look at the test. That's been true forever. But with AI, it's just sometimes impossible to read all the code, and maybe sometimes it's unneeded. There are different software and different requirements. For instance, me personally, I just wrote a very simple library that can pass metrics QL to logical representation. We integrate that in the platform. Apparently, this was not exist. I was shocked that this does not exist in TypeScript, and I wrote it. I don't know TypeScript. Never code in TypeScript. I manually crafted all the scenarios. Half manually, right? I described all the scenarios, and then pressed stop. But I manually crafted the scenarios that I think logically would challenge the platform, the lib, and then follow under the implementation and seems reasonable. I've never read the code. But for this kind of library, it doesn't really matter, right? It's simple input of string. The output is very simple. There's no side effects for using that library. It's not something in the hot path of the platform. That makes sense. On the other hand, if we implement a new parser inside the sensor, which runs I throughput zero locations, code itself matters. You have to know how much do you allocate. Do you have any memory leaks? I think the question is very dependent on the context of what are you building. That's something that we, at Groundcover started to differentiate. Let's think, what are we checking now? Does the code matters? Maybe it's not. Maybe we can replace it in a week and doesn't it? [0:11:46] KB: Yeah. I mean, I think to some extent, what you're describing there, it reminds me of a metaphor I've used before, which is increasingly, the code gen is essentially like a compiler. When was the last time you read the binary that a compiler for your code generated? You didn't, probably. But you did check, did it do what I expected? Did it pass the tests? Does it behave as I anticipate? For a lot of code, now that's essentially what it is. Does it pass my tests? You described a few other things that might matter, like performance, functional pieces. How do you validate those in a world in which agents are building all of your code? [0:12:28] YR: Yeah. I think the difference between transpiling, maybe compiling C, C code to assembly is you don't miss out on the architecture when you change C language to assemble. When you only test for input-output, you do make sure that that piece of software does what it needs to do, but you don't really - you can't make sure it does it in the way you want it to be. It's also not deterministic as compilers, which tend to be very deterministic, for the most cases. I think it is different in that, what about the architecture? How does it fit with future features that you want to integrate? I think, when you talk about architecture, the comparison between classic compilers doesn't represent it well enough. [0:13:26] KB: Yeah. Well, that's definitely true. Or at least you need a new source code, right? You need technical documentation, or tests, or some sort of validation. Yeah. No, you're right. Not all the metaphors align. I think this does, though, get you into this interesting thing with what you all are doing, which is tests are one form of validation. Reading the code is another form of validation. Another form of validation is like, what do the logs say? Is it allocating memory? Is it performing? Is it doing all of these different things? Having some sort of feedback loop between the actual running code and whatever is writing it, whether it's an engineer, or an agent, is extremely important. [0:14:07] YR: Yeah. The problem with LLM, that it can lie, right? It's statistics. I don't trust logs that being generated by LLM, because it just can make up logs. Maybe it refracted that code and forgot to rename that variable. Maybe it thought it would write it and it eventually did not. I personally less rely on logs and also, markdowns, or cursor rules. I less tend to use it, because I think, if you want to embrace AI with the options that it will fail, you have to put some guardrails that are very deterministic. You mentioned tests. Tests are brilliant for that. Tests usually don't lie. I've seen cases where AI inject every scenario in that code base to all slip past the test, but it's very rare. It's very rare. I anticipate we're going to see it less. I think, clinters should be very fashionable now. I think we should have more complex rules to make sure the complexity of functions is something that we can live with, or even conventions are something that we want to enforce, because eventually, you are going to read that code. There is a chance you're going to read that code. We have to make that assumption, because end of the day when you wake up 3am, something is wrong, you can tell anyone that you're trying to craft a prompt to fix it, but it's your responsibility that you're not sharing with that AI bot, right? We're still relying on humans at the end of the day for the future. For the near future, it still looks like it. I personally think EBPF comes very nice with this, because the EBPF will tell you the truth. This is the HTTP request. [0:16:15] KB: Yeah. It's not dependent on your code. This is what happened. [0:16:19] YR: Exactly. We even start to instrument testing environment. Getting the tracer back to the AI and saying, "Look, this is what happened. What do you have to say about that?" You have to create guardrails that rely on a solid ground truth. [0:16:37] KB: It's a good observation for why your observability should be separate from your code base itself, because you don't want the LLM to write lies into it. That's really cool. Now, a challenge I've seen before with observability is it can be very verbose, right? There's a lot of stuff that happens in a modern system. You get thousands and thousands of lines of logs, or what have you from relatively simple interactions. That's one challenge on the pure data storage and transfer side. If we're feeding these back into LLMs, there's a context management challenge, too. How do you all think about that? [0:17:13] YR: Yeah. I've seen a lot of people saying, "Hey, let's just send all those logs to open AI. They will tell us what happened." Then you realize in that five minutes, there were 20 million logs. You're right. When we introduce AI base code, it actually increases the data that we're sending. That even just makes things worse. What we're heavily trying to do at Groundcover is being able to summarize data on a stream aggregation fashion to basically, represent trends, or patterns. For instance, logs are the simple use case of you have 15 log lines. If we can nail the patterns, we can actually convert them to time series. Time series is very compact. Any AI agent happily look at a graph and say, oh, this is interesting. Then you can narrow it down to a service, or time frame that will allow you to dive deeper. This is one way. This is maybe the simplest signal to summarize. We're also doing it for APIs. We create baselines on interactions. This is a bit more tricky, because the dimensions on what is the same communication pattern is very complex issue, especially when you look at it from network perspective. You can just, just to give you an example, think about query parameters that generate 1 million different routes. You need to understand that those are the same API and this is a variable. That's a very simple example. We are trying to compact all those signals to baselines. Then, this is something that you can speed the AI for trends, and then narrow it down until you can find the raw data, but still manageable to send to an agent. Another interesting idea is to look for other signals. For instance, change management. Think about image change. Something happened. This is probably a good place to start looking around that time. You basically narrow the time, and without you, you narrow the context needed. [0:19:45] KB: If I understand a little bit and just to replay back to make sure I'm on the same page, so you're looking for a set of patterns, the ability to build a baseline, the ability to pull in any external information, like when an image changed to show, here's a relevant area and a high-level description. Then you expose some exploreability, so that the agent that you've passed that off to can say, "Okay, great. This looks like it's a problem. Give me more data for this spot." [0:20:15] YR: Yeah. Yeah, exactly. You can also go wrong with this process and then go back. Because sometimes, every engineer knows that you see some suspicious log and you think you found it. Two hours later, someone comes to the office and say, "Oh, no. This is not the issue. I know this." The same thought process will happen to AI. It's an intuitive process. But to make it efficient, you have to start with some kind of patterns, or time-serious represented signals. [0:20:52] KB: Then, do you expose those to your LLM via an MCP server? Or how do you approach that? [0:20:59] YR: Yeah. Groundcover was one of the first observability vendors that exposed MCP. Actually, we started before MCP had the OAuth authentication. We were almost ready to release, then this guy announced and then recreated the entire authentication mechanism. That was fun. Yeah. While we worked on the MCP, we learned those kinds of stuff. At the beginning, we were very, very naive. We thought, hey, we have the swagger, the open API, very simple. We're just going to convert it to MCP, which basically, it's another HTTP server. Eventually, we realized, agents are not that good with reading open API docs and using it. [0:21:40] KB: Especially if you have a very large API, with a lot of different things. Yeah. [0:21:44] YR: Yeah. You can imagine that our APIs are very sophisticated in terms of, you can use a bunch of operators and conditions with recursive groups, and some kind of operators between those groups, and things that the UI is doing constantly, even human. But when you want to write a query language, you pour a lot of intuition about how you're using those conditions. We found out that the agent didn't really like it. We had to limit the amount of options in order for the agent to make a reasonable call. We encouraged the agent to use simpler APIs and let it narrow the search, and only then expose more complex APIs. You have to keep the API in some way, a bit more relative to what you would do with an SDK where you want the developer to have all the options in the world to find what's relevant for that scenario. [0:22:50] KB: Yeah. No, I've definitely seen something similar. If you have too many options available, it just gets confused, and starts throwing random stuff at you. [0:22:58] YR: Yeah. Also, I must admit that at the beginning, I thought we can also feed LLM with the open API and tell it to create an MCP that works well for it. But that didn't work well as well. It didn't understand how an agent will effectively consume those APIs. We were very disappointed with that process, where we had to go back and manually craft the endpoints and think about the use cases. We were worried about, are we leading the agents to do things that we think are the right thing, but maybe it would prefer to do something else. At the end of the day, we saw 100% better results when we closed so many APIs, limited the number of results. We forced the agent to get up to 20 results, for instance. Otherwise, it just got 2000 log lines and got stuck with random nonsense. [0:24:01] KB: Yeah, that's interesting. If I'm hearing correctly, some of what you did is you, one, applied your expert judgment. Here's a set of things that probably will be helpful. We're just going to expose those. Two, gave it this progressive disclosure, where it's like, here's where you start. Okay, now we're going to expose a little bit more. Now we're going to expose a little bit more along the way. [0:24:24] YR: Yeah. We also make some parameters required, although they're not required from the Visual SDK. For instance, it meant to tell the agent, if you are looking for traces and you don't know what cluster are you looking at, something is off. You need to do something else. But if you're that clueless, you're using the wrong API. It has to go through a certain way of thinking, because it understands, it has to know first, what cluster are they looking at. Now it can think, "How do I know what cluster do I want to check those traces?" Then that led to some flow, where eventually, it got the right cluster. Some APIs where the output was less verbose and more high level, we allowed more primitive set of variables. You could ask questions like, what changed happened in my entire production? Or, what incidents do I have and then get the labels of those incidents and then think where do you want to check? Some leading indicators to, where should I look for the heavy stuff? Where should I look for the actual raw logs, actual raw traces, which could be very overwhelming. That API can return easily 10, 20 megabytes of response. [0:26:05] KB: That's fascinating. Essentially, you're making the API much more restrictive than you would for a human, because you're saying, "Hey, there's a right way and a wrong way to do this. If you're calling this without these variables, you probably have not thought about, or figured out enough to get a useful response here." [0:26:24] YR: Yeah. When you think about it, this is very human nature. Imagine you're standing behind a junior software developer looking on some kind of an incident in production. All of a sudden, you realize that person just read all those log lines randomly. You're like, "Hey, stop for a moment. What are you doing? Let's figure it out." High level, where is the issue? Let's filter all those logs. Let's focus on what we know, probably going to lead us to some realization on what happened. We trying to put that notion into the flow of the API. It's not 100% successful. It can still get lost, but it helps to keep the wild investigations in somewhat direction of narrowing that. [0:27:27] KB: That makes a ton of sense. Now, we've talked a little bit about identifying patterns, doing that very deterministically. I was curious if you had explored any statistical pattern recognition, or LLM-based pattern recognition internal that you then can expose up to an end user agent, or anything like that. [0:27:48] YR: We play with it and we're still playing with it, because I think this is obviously the future. We have a feature that is copied to agent, which you can see in - modern tools today have this kind of copy to agent, when you have this prompt that can guide the AI agent to what are you doing. We started doing that for more than just a section. Imagine that I can represent flow you did in the platform visually and explain it to an agent, saying, "Well, you went from traces to that workload page and then you clicked on logs, you added those keys and you landed here." Then take a screenshot, or something like that and give it to an AI. This is an alternative universe for MCP. Instead of letting it use APIs to your platform, it let it experience the UI in a way, in a markdown way, or in a screenshot way. The result were actually surprisingly good, because you would think that we put so much effort in UX to make the app human friendly, or help you slice and dice a lot of data, and then you throw everything and tell, okay, the AI agent will just use MCP, or APIs. If you just take a screenshot of your application and send it to an AI agent, you'll be amazed how much does it understand from that scenario. We're definitely looking at this stuff, nothing yet production ready, but we are playing with it. [0:29:33] KB: It is interesting, right? These things are trained on human thought processes and they can only incorporate so much information. All of this thinking we've gone into, how do we help a person reach the right things they're looking for can be applicable. Switching threads a little bit and looking at another hot topic in this AI expansion space is privacy, security, those sorts of things. As you're building an observability solution for this modern era, what types of privacy and security needs are there that are maybe more relevant today than they were a few years ago? [0:30:13] YR: Yeah. Logs and traces have the potential to contain PII at the minimum. We all try not to do those things, but this happens, right? The one just put an object inside a logger and you're all of a sudden, you printed from PII into your logs and now you need to understand how you delete it, or how you contain it. Groundcover is built on bring your own cloud, which I think is becoming very, very popular with the LLM era. You have non-humans sniffing into your code, or your data, or your most private data, so all the data contain inside your environment. Imagine if you have a severe sophisticated agent that running some SaaS, you don't know where it is, and it learned from your data, analyze it, summarize it. Now you're more exposed to front injections and even just bugs, right? Even just someone didn't think that will happen. Groundcover is built to use your LLM inside your account. The risk is a lot smaller. You basically have all the data with the agent inside your cloud provider. At least, you know what are the physical parameters with data being served and saved. I think this is the future for agents in general. You don't want a very sophisticated agent to have your data externally, as we do with humans, right? When we onboard new employees, when we onboard contractors, we want them to use our laptops. We want them to stay in our offices, or in our network environment. I think AI agents just make it the potential, the risk is just a lot bigger. [0:32:23] KB: It does feel like that's the next generation of SaaS, is everything can be done within your cloud, wherever it is, at least if you're on the big cloud providers, but increasingly, whatever cloud you're in. [0:32:35] YR: Yeah. I think if we're already at it, I think Kubernetes changed the world in standardizing where do you deploy code. Now Kubernetes is commoditized like an RDS. It's easy to assume you're going to have a PostreSQL in managed PostreSQL instance and compute, and maybe even object storage. To be honest, that's 90% of what you need to build a very complex platform. This is what Groundcover is actually doing. Use very simple services to run that production and we make sure we can run it everywhere. We have obviously the big cloud providers. We also have on-prem and air gap. Once we rely on Kubernetes, a lot of things got way simpler. [0:33:21] KB: Well, now that we're talking about the way things are evolving, what do you see as what needs to happen in the observability space over the next five or 10 years? Where are we not yet solving the problems that are really there? [0:33:36] YR: The holy grail is root cause analysis. Everyone wants agents to tell us what went wrong and how to fix it, right? You see it everywhere. Everyone is targeting root cause analysis. I think this is the very expected future. We're still not there. It's still very complex. You can see root cause analysis with very basic scenarios, right? You have that log and error log and you can explain it. You can correlate it with infrastructure changes. That's very basic and this is already happening. What about a two-hour research? You want that co-pilot. You want someone to be there with you and do the research with you, while you are investigating an incident in production. We all been through this incident where you are six hours, eight hours into the incident and trying to figure out not only what happened, and you also want a way to remediate. How do I recover from it? For that, we need a deep understanding of the architecture, a deep understanding of the changes and also, a lot of world knowledge on engineering. That's still not there. I think this is definitely what we are targeting, but it's going to take some time to get to it. [0:34:59] KB: What would you say are the underlying components that would go into making that possible? [0:35:05] YR: Good question. We need to have this model of how production looks like. We need to have the notion of how we learn production, right? How we learn the architecture. This is still something that usually represented by a simple graph. Think about how long it takes to an engineer to understand the entire architecture. It takes years in a modern company. For some companies, it's never been one person that can understand everything. We need a way to create that knowledge and also, keep it up to date, because these change so fast. It has to get the live data from production to understand how it's built and also, the consequences of the architecture. I think, for instance, EBPF is a very good way to understand the behavior of applications, the interaction between applications, the dependency between applications. You can definitely build that graph represented with some kind of degrees of what's the side effect of this component failing. We also can inspect disks and network calls. That's a very good start. You also want to understand the interactions of people with those components. You want to understand the teams and the responsibility of the teams on those components. We are thinking of a lot of things happening on Slack, right? We need to understand the interaction there. A lot of planning on your features. Okay, we need Giro, or Linear to also understand the planning for the future. Maybe someone already wrote, this is going to break, we need that plan. We need to understand that as well. What's the future? A lot of organizational knowledge on DRND and product should be there as well. That's a lot of knowledge. [0:37:14] KB: That's interesting. I was thinking about, you brought up Kubernetes and how that has really shifted deployment. I think one of the things that it does is it takes at least one part of your stack and makes it declarative, right? You can see some piece of your architecture and pull that knowledge out. Terraform is similar. Any of these declarative, infrastructure is code types of solutions, give you the ability to analyze that part of architecture. Then as you highlight, EBPF gives you live behavior data. That's another piece. The organizational side is an interesting one, right? How are these systems used? What's going on with them? I wonder if there's other places where, like Kubernetes, there's some declarative structure we can put in place that would let us short circuit that a little bit and just get a sense of, oh, here's how info flows in this system. [0:38:08] YR: I think you can learn a lot from code reviewing. If you read on code reviews, you probably get a lot of insight on where the weak spots of the components - You basically want to understand why things are like they are. You can always jump to conclusions, right? We maxed out the database connections and that's why it broke. There is something more than that, right? There is a reason why we limited that number of connection. Maybe the developer created this new query that's consumed a lot of this connection. We actually wrote all that to protect that database from being over-consuming CPU, or something like that, in order to serve other applications. A lot of going on between the APIs and the applications and the Kubernetes. I think, we need to find a way to understand why things are built in the way they built. What was the architecture decisions? Did we make it intentionally? Maybe we didn't. Maybe we just didn't know this is the default behavior. Is it the default behavior? There is a lot of things to know that are between the code and the application and documentation of those services that we still need to figure out how to get that information. I think what's comfort me is that, end of the day, engineers solve those problems. They go to Slack, they search everywhere, they read in GitHub issues, they open AWS documentation, understand that's the default behavior. They open Groundcover, see how many connections they have currently. What's the storage throughput at that moment? Is it different than was yesterday? The knowledge is there. We just need a way to represent it and also, to make it efficient. Because humans usually have intuition on issues. We need to understand how that works. [0:40:15] KB: Yeah. No, I think that the modeling piece is really interesting, right? in some ways, even you mentioned with log data, right, you're finding that time series is a very effective way to model things for an LLM to utilize and to use in different ways. If LLMs become the glue that is running through these different inferences, it matters. How are we representing this data to them? They're often quite linear thinkers from what I can see using them. If there's nonlinearities to it, that needs to be represented somewhere else in how you're then presenting that info to them. Yeah, it's fascinating. [0:40:49] YR: I think we also needed to decide how much are we going to help it. Because most use cases, most incidents can be represented as a 20-30 questions that you have to answer, right? Most incidents are missing resources, noisy neighbors, bad version, exceeding quotas, some infrastructure failure. We can help with crafting, as you said, 50, 60, 100 scenarios that AI can go through linearly and then about to think on its own. Basically, that's what we would do as an engineer, right? We're going to have those 10, 20 playbooks. If it doesn't fit none of those, you know you're in trouble, but you still have some learning that you did, because it's not that, it's not that. It's not that. Now you're looking for something weird. Something weird means you're going to call X, Y, or Z and ask for help. I think even if we just do that, we can do that before an incident, maybe, because we can have more questions answered, or even just present a report saying, "Here is a custom dashboard I created for those scenarios. This is what I checked. Blah, blah, blah, blah. That's what I know. I'm stopping now. Now you help me help you." That's a big milestone. That's a big milestone. If you can have a co-worker that is actually a bot that help you actually investigate and taking branches. I'm thinking of mentioning Groundcover agent, go look for storage issues in the cluster. Coming back after a minute saying, "You know what? I found this. I don't think it's relevant with a graph." That's a big thing. [0:42:38] KB: Now that would be super nice. To your point, I think a lot of the folks who are most effectively using AI for whatever purpose right now, they're building out playbooks. They're building out scenarios, where someone has thought through it, maybe using an LLM, maybe on their own. They've described what they think should happen. That can be then reused and built upon. Yeah, you could have these 60, 100, 200 scenarios. Some of them might be purely deterministic. You could just go and look and see like, hey, did this happen or not? Others might require an agent to do some decision-making. Yeah, that would be a really nice roll-up. [0:43:15] YR: Well, we're working on it. [0:43:16] KB: You are. Okay. That actually feeds into this question, right? What is coming next from Groundcover, or coming soon? What are the dimensions on which you guys are pushing this forward? [0:43:28] YR: Yeah. There are two main verticals. One is observing LLM. The other one is LLM for observability. We just released the LLM observability piece. We use EBPF. We talked about it. It's very easy for the sensors to pick up LLM calls. We now can see every API to LLM inside your production, or dev, or staging. We can now track token consumption and comparison between models and correlate that to workloads, and even showing you what caused that API call. This is a track of LLM observability. We're just getting started. This is a huge milestone for us, because until now, we focused on classic APIs. Now we're trying to get the bedrock API call, or Azure, OpenAI, and all that. It's very interesting and very fascinating. We're hearing customers' requests for rendering images sending to LLM and a lot of wild stuff. This is one vertical that we're going to invest on it. The other one is using LLM for observability. We are taking a very different approach. We started by making sure the data, the observability most data you would think is JSON logs. No, that's not true. Most data is just random printing of logs with weird formats and multi-lines and a lot of, sorry to say that, but garbage data. No matter how much your agent is smart, if the data is garbage, garbage in, garbage out. We are actually now focusing on using LLM to create observability pipelines to clean the data. We offer to pass those log lines into much more meaningful format. We suggest passing specific fields that we think are meaningful to create time series from it and represent it in a much more sophisticated way that will allow you, the user, to create more advanced queries. That's the first step. The second step will be after the data is organized and cleaned, we're going to use obviously, LLM to analyze the data. We already started doing that with patterns. As we talked before, the baselines of the signals. The target is to have that co-pilot running with you when you investigate, or even just doing a research. It doesn't have to be just on crisis. I found myself using cloud code for understanding repositories. Not even just writing code. Just asking questions about that repository. Sometimes it's just making things up. Sometimes I can actually get very good answers from it. That saves me a lot of time. Imagine you can have it on your observability data. You can ask questions like, "I'm looking for a cross-AV consumer that is inflicting $10,000 a month, because of cross-AZE network. That easily can be a two-hour research for an engineer, and maybe two minutes for an AI bot, that you can just mention on Slack and get dashboard answering your question. Yeah, that's the goal. [0:47:02] KB: No, I love it. I think the dashboards and being able to surface up the right dashboards at the right time is a really nice thing. I read an article about looking for more AI enabled heads up displays, rather than co-pilots, where it's like, it's just going to show you what you need to see at that time, so it's easy to understand what's going on in your system. [0:47:22] YR: Yeah, I love it. [0:47:23] KB: Awesome. Well, we're getting close to the end of our time. Is there anything we haven't talked about today that you think would be important to leave folks with? [0:47:30] YR: I'm going back to the beginning. We want to leverage AI to write a lot of code. We need to find smarter ways to make sure in a very deterministic way on what the code actually does, or how it behaves, what is the architecture that it represents. No matter what we're going to end up, it has to be deterministic. Otherwise, we're just lying to ourselves with another layer that is not deterministic, that is judging another layer. I don't know who's listening, but we need better linters. I feel like, this is this is the answer. We need new tools. I feel like, this is the era for having another stage for the CI to make sure things we care are being forced on the AI agents that write the code. Tests are very good. Linter is a good start, but we need, I feel like, something smarter need to be created. [0:48:32] KB: Yeah. There's something interesting there. I almost wonder. I observe a wide difference in how well LMs write particular software languages. I wonder if there's a designed for LLM generation software language that needs to happen, where you can enforce those architectural pieces at the lint level. [0:48:51] YR: Interesting. Probably, there are a lot of developers that will code in just English, right? Because if you think about it, code language is built for being compact, but for specific tasks, right? Very efficient for specific tasks. But now with AI, I feel English is probably a good way to represent an idea. [0:49:14] KB: They're not only compact, they're also, to your point, deterministic, right? This thing means exactly one thing that can be reproducibly created. English is terrible for that. [0:49:26] YR: Yeah, I agree. [0:49:28] KB: I love English as the first specification level, but you need something that you can, as you highlight, deterministically verify, evaluate, enforce restrictions on, something where it's very - you could draw very clean lines. [0:49:44] YR: Yeah, sounds interesting. [0:49:45] KB: Well, this has been an absolute pleasure. Thank you, Chez, for the time today. We'll call that a wrap. [0:49:51] YR: Thank you. [END]