EPISODE 1899

[INTRODUCTION]

[0:00:00] ANNOUNCER: Enterprise IT systems have grown into sprawling, highly distributed environments spanning cloud infrastructure, applications, data platforms, and increasingly AI-driven workloads. Observability tools have made it easier to collect metrics, logs, and traces. But understanding why systems fail and responding quickly remains a persistent challenge. 

As complexity continues to rise, the industry is looking beyond dashboards and alerts towards agentic AI systems that can reason about operational data, reduce toil, and take action when things go wrong. SolarWinds offers solutions to monitor, understand, and remediate issues across complex distributed systems. The company began as a leader in network and infrastructure monitoring and has evolved to support modern applications, cloud environments, containers, and AI workloads with a growing focus on reducing operational toil. 

Krishna Sai is the Chief Technology Officer at SolarWinds. He joins the show with Sean Falconer to discuss how SolarWinds is rethinking observability in the age of AI, what it means to design agentic systems for mission-critical environments, how AI-assisted programming is reshaping engineering workflows, and why the future of operations depends on building platforms where humans and autonomous agents work together. 

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. 

[INTERVIEW]

[0:01:44] SF: Sai, welcome to the show. 

[0:01:46] KS: Thanks, Sean. It's great to see you and meet you. Big fan of the show. So thanks for having me here. 

[0:01:51] SF: Oh, well, thank you so much. That's nice to hear. Yeah, I'm looking forward to this as well. So, I wanted to start off talking a little bit about SolarWinds and kind of set the stage there because I think a lot of people know SolarWinds maybe as like a single tool that they used years ago, but you guys do a lot of different things. So, given where you are today, like how would you describe what SolarWinds actually is today to someone who hasn't looked at the space in a while? 

[0:02:17] KS: No, absolutely. If we take a step back and say how IT and ops teams, who typically use SolarWinds products, have been using SolarWinds for the past 25 years or so, a product portfolio broadly spans three domains: Observability, incident response, and service management. And to put it simply, IT and ops teams use us to help detect and remediate issues across a variety of workloads in their environments. 

Network and infrastructure, which is where we started and have been a leader for a very long time, but also applications, databases, containers, ML workloads, etc. And so our solutions cover this from a horizontal perspective, meaning give you the ability to look at the general basic health of the typical workloads; compute, storage, network, etc., but also vertical crosscutting concerns like performance, reliability, cost, security, and so on, right? 

And what happens is typically IT and ops teams are accountable for SLAs and SLOs, right? And that kind of drives your day-to-day behavior. More mature teams, of course, manage error budgets at scale, and they have nuances of that same dimension. But all of this is much simpler said than done. 

I was talking to a CIO, it was part of a customer call recently. He's a CIO of a major system integrator responsible for running big managed global services for an organization. And he said it well. He said I'm responsible for SLAs. But honestly, I can't tell you everything that contributes to an SLA, right? Which is a statement of complexity in these environments, but it's also increasingly these teams have to deal with large microservices, distributed systems, etc. And so complexity is very real. 

And so what we target is we all - especially in the context of AI and so on, our goal is to reduce toil. We've all been there, waking up at 3am, alert storms, and getting into a war room. And the problem with that is that even today, a lot of the tools just ingest a whole lot of data and show you a lot of dashboards with red lights and so on, but they're still finding out why something as red is still a big challenge. 

So, we've been thinking about this challenge. When we think about AI assisting with this, traditionally we've gone from statistical approaches, things like anomaly detection, machine learning, basic stuff, to now there's a very clear shift to agentic AI, not just in our industry but just across the board. And so that's something that we want to focus on and increasingly index on. 

And the way we talk about that is we just call it SolarWinds AI more broadly. But in particular, the agentic portion of it we call SolarWinds AI agent. It often gets confused with AI observability, which is something that comes up a lot. And the way we think about AI observability is that as more of a vertical use case, rather than a lot of a horizontal thing. But that's also something that we're starting to do here. 

[0:05:33] SF: Mm-hmm. Yeah. I think there's quite a bit to sort of unpack there. I definitely want to talk a lot about how the use of essentially AI agents are starting to sort of impact the types of use cases that your customers are generally interested in. But one place I wanted to start off with, because I think a lot of people listening to the show and certainly a lot of businesses I talk to are all interested in understanding what are some of the leading technology companies doing when it comes to leveraging AI. 

I wanted to talk briefly about sort of AI-assisted programming. Start off there. How heavily has SolarWinds invested in various AI-powered programming? Is that something where every engineer is now got a junior engineer in their pocket, where they're leveraging something like Claude Code? Or where are you with that? 

[0:06:20] KS: No, we are. We are investing very heavily. All our engineers use AI-assisted coding. Actually, it's super interesting because just yesterday we were kind of reviewing the progress that we've been making through the year, and all engineers have enabled using AI-assisted both in terms of co-pilots as well as agents, and so on. And what we're broadly seeing is that, in general, we're seeing, of course, increasing code being generated with AI for sure. The percentages vary from organization to organization, and how you measure it, and so on. But we're seeing increased commit velocity. Our commit request velocity has gone up like 25%, 30%. North of it sometimes. 

But part of it also, what we're seeing is deployment frequency has gone up, as an example. We're seeing some improvements in lead times. But what we're also seeing is that tools are maturing, which means that both the acceptance rate as an example of generated code is significantly improved over the last year as both the models have improved and the agents have improved. But the shift is, of course, now more on the code review side, right? 

A lot of the code review portion is still - the bottleneck has moved there. That's something that we're starting to address and look at how do we make it better and so on. Broadly, I would say the signals are positive. Of course, the concerns are the usual concerns around code quality, and flaky test generation, and security guidelines, etc. Those things still are very much front and center. 

But maybe this is kind of where you were going is that when we think about agent - I mean, coding is actually a very good baseline because we've all been working as an industry for a long time. And the nuances of, "Okay, how does the application or the use of coding agents now extend to broader enterprise like software use cases?" is something that's front and center. And that's where we've also been - we've been playing both sides of it, which is using coding agents and building our muscle around using how agents work and so on, while at the same time offering them through our products, to our customers. There's that context as well. 

One of the things that when I think about this notion of how do these agents apply in the enterprise software use cases. If you take a coding example, for example, right? If you ask a coding agent go add an OAuth support to my function, or whatever, or refactor this module to be async as an example. These types of things are very, very - instead of taking a example of taking a single code snippet, breaking the task down. The coding agent goes and reads a repo, inspects dependencies. There's a lot that's going on underneath. Editing multiple lines, running tests, so on and so forth. This notion of setting the intent and then the system deciding what actions it needs to take to drive towards that goal turns out to be a very good mental model to baseline on in terms of how we think about agents in the context of enterprise software. 

[0:09:39] SF: Yeah, I think that's a very astute point. I think that one of the reasons - I've been thinking about this a lot recently, and I actually wrote an article about it. I think one of the reasons why programming has been such a tip of the spear is, like you said, there's all this sort of history associated with it. But it's also like a hard truth environment, I would say, where even though the output might be non-deterministic, there's kind of deterministic ways of checking the correctness, and then that becomes almost like a reinforcement learning cycle because I can compile the code, I can run it against unit test. And most environments don't have that. 

I think for other sort of non-coding environments to be successful at the level of coding, you need to be able to create those similar sort of deterministic guardrails where you can actually evaluate the outputs in some reasonable way, so you have confidence that it's actually generating something correct. 

[0:10:27] KS: No, absolutely. And I think that's why that analogy kind of makes a lot of sense to me when we try to internalize it, like what is the intent, right? And what is the intent, and how do you measure it? Which is why, as an example, if you - a lot of systems - which is why in operational practices, things like SLOs and SLAs come in super handy. Because at the end of the day, what you're really trying to drive is towards a certain healthy operational state, which is actually well defined in the non-agentic world, in a very human-driven world as well. Because a lot of your practices, etc., incident response as an example, traditional setup, a threshold fires, page goes out, engineer wakes up, engineer does a series of things, checks dashboards, pulls logs, inspect traces. It's a very, very disciplined type of an approach that the industry itself has matured along those lines. 

But the way for an agentic system to kind of then, let's say, mimic it, so to speak, and to make it a lot more effective, efficient, autonomous later on when we talk about, one approach is to say how do you not remove that logic or the set of practices. But how does an agentic system say absorb it, so to speak, right? And that's where I think a lot of the implementation, design challenges, etc., come in. 

A system, the way we think about a typical - like an agent, as an example, right? If you have an SLO, a system could be observing, raising error rates, as an example. Notice that, "Okay, this is isolated to a specific service. And then it correlates that to a deployment that happened 10 minutes earlier. Observes a trace pattern that happened during a previous incident. Concludes that this is a bad config change or whatever." All of these sets of steps is very, very - I would say there's a lot of historical knowledge and actions that an agent can learn from. Right? And that's where I think the analogy with a lot of what happens, or how do coding agents work in a coding use case, versus an operational agent working in an operational use case. There are a lot of similarities there. 

[0:12:40] SF: Yeah, absolutely. And you have a tremendous amount of experience in, I guess, traditional sort of enterprise infrastructure, having worked at a number of large successful organizations. How much do you see building sort of agentic systems as something that's brand new versus, I don't know, a rebranding of some of the typical things that we would do with any software application? 

[0:13:08] KS: Yeah. No, that's a great question, actually. And if you think about this evolution of operational service systems, if you think about it, traditionally they were monitoring systems. And then monitoring just there were things that were polling and observing the state, so to speak, manual thresholds, and you act in the first generation, so to speak. And then they evolved into things like observability, where the system expressed its state through multiple ways. And then there was a way to correlate them across these multiple signals. 

And then there was this concept of AI ops essentially, which it's essentially using these signals and coming up with ways to correlate those signals and making decisions, still, I would say, very early on. But that's where it started. Then we had started to see cases of like copilots emerging, where there's assistive tooling that is sitting next to all the human decision-making that's happening. And now we're starting to see the early green shoots of agentic AI, where there are agents that can actually act even autonomously at times within boundaries and so on. 

There's an evolution of how this industry has gone through all of that. And some of that is, I would say, the natural push and pull of technological evolution. But also a lot of what has made that almost an existential need is just the sheer exploding complexity, and tool sprawl, and data. And at some point, we all realize that just a human is not going to scale in terms of maintaining the health of these complex environments, right? That's the evolution that I've been seeing. 

[0:14:55] SF: Yeah. I mean, we saw a similar evolution in the world of biology, too. There's just the sheer amount of data that exists in biology. People started sequencing the DNA and stuff like that 30, 40 years ago at this point. Technologies served a massive role there. And a lot of people say that the 21st century is going to be the age of sort of biology because of the fact that now we have powerful enough computers. We have these really powerful models to kind of assist in the data crunching involved in essentially evolving that science because no one human, no matter how gifted you are, could possibly keep all those things in your head. 

And I think we're seeing a similar evolution in technology because there's just any complex enterprise environment, there are literally thousands of different data systems that might be sending important signals all over the organization. It's like how do you sort of start to be able to parse that? And most businesses are sitting on these terabytes or heaps of unused data where they hold on to it because they might be useful some day, but they don't have sort of a way to unlock that use. 

[0:15:56] KS: That's right. That's right. And a lot of that, if you think through, for example, if you extend that and then you ask yourself why, right? If you think through that in an operational context, this was an aha moment that we had probably a couple of years ago. We've always kind of loosely had the strategy, but it came more front and center a couple of years ago, is that if you think through how the operational systems have evolved in digital environments, there was a monitoring observability industry which was just only focused on getting signals, showing dashboards, and giving alerts, as an example. 

And then you had this incident response or more DevOpsy types of environments that came where you really saw all those signals, but then it adapted to how teams were operating with incident response being able to observe, maintain SLOs of services, manage error budgets, and so on. And then you had the IT systems off on the side, where there were very, very IT-driven, service delivery-driven, very, very, shall we say, structured processes that enabled enterprises to scale at these practices. 

But what all of that did was create all of these silos in terms of IT operations management, IT service management, etc. And the silos happened in organizations but they also happened with data. And to your point, when you have all of this massive amounts of data in ingestion, which data injection is something that we've mastered very well now, but they ended up creating all these massive data silos. 

What happens is when you want an agentic system or any kind of an AI system to work, when you have data silos that becomes incredibly complex and expensive results, and you have this third wheel of separate AI systems that then ingest all the data and then having to process data on top of that, and then you have multiple dashboards and consoles. And so when you're dealing with operational situations and war rooms, there are just so many different consoles that are spread around everyone's desktop. 

All of these challenges, which really almost became important critical that we really take another look at how these things work. Internally in SolarWinds, as an example, in our design discussions, we use the left-brain/right-brain analogy, right? Where if you take the human brain, it's actually one of the most - talking about biological systems, right? It's the most wonderful biological system for observability ever created. We walk around the face of the earth, ingesting all kinds of signal through our five senses. 

And you can be in a crowded mall with a lot of noise and someone says Sean, and you're instantly paying attention to that with your name. And the subconscious seems to have this phenomenal mechanism of being able to process all those signals, decide what's important, what's not, and being able to surface that, shall we say, actionable signals to the conscious, where you, as Sean, can say, "Hey, is that really meant for me? Well, maybe that's a different Sean." I can go about walking around getting cookies in my mall, so on and so forth, right? So, you can do all those things. 

When we think about extending that analogy to an observability use case, you have this system of observability on the one side, which is optimized for your mean time to detect, so to speak. And then you have this right brain systems, which were conscious systems in the past, where there were actions, and runbook automations, and workflows, etc, all optimized around remediation. And these two come together as a unified system. And then you have kind of the analogy of the subconscious and the conscious, where in the subconscious, there's all this processing that's happening. And the conscious is where the human element comes in. And increasingly, the conscious actions, which are the actionable things, are also becoming dimensions or various dimensions of autonomy. Right? This analogy of the human brain is something that we talk about a lot internally when we think about these systems. 

[0:20:12] SF: How's that impact the way that you think about using agents for observability? I think, traditionally, as we've talked a little bit about observability is a lot of dashboards. It's metrics. It's maybe some level of statistical-based ML to highlight certain issues like anomalies and so forth. But it was really about giving visibility into the systems for humans. But now with AI agents potentially playing a role there, it's really about giving inputs to machines, not necessarily dashboards for people. How does that kind of think about how you would build these systems? And how is SolarWinds approaching this problem? 

[0:20:46] KS: No, I agree. I mean, there's a lot. Maybe it may help. It may be useful to maybe baseline on a couple of different examples of how we've seen this evolution take place. And then maybe I'll go into more of the design choices that we've had to make. If you think about, for example, in this evolution of how gen AI essentially started to help out with the problem-solving diagnostics resolution, so on and so forth, we saw this phase where there was this copilot phase, where there was this gen AI essentially becoming a very capable interface layer sitting next to the system. Not necessarily inside it, but next to the system. 

When you're debugging a production issue, the copilot then helps you summarize things, and explain an error pattern, and help you write a query across multiple metrics, and traces, and so on. Still very useful but fundamentally still very reactive, right? For example, we have a couple of examples just to illustrate this. We have a configuration agent as an example, right? And the configuration agent is interesting because it is one of those configuration. It's one of those highest leverage, highest risk surfaces in modern systems. If you think about the number of outages that have been caused by poor configurations, it's pretty massive. 

And the other thing that's super interesting about that is if you think about DNS misconfigs, which we hear about every other week these days, are certificate expirations, and overly permissive security rules, and so on. What happens is a lot of them don't necessarily result in a crash. So you don't get an exception that you can go look at. But there's a subtle degradation of surface behavior. And so you don't realize that the outage actually happened because nothing specifically crashed, and the system is just executing to your configuration. 

What typically happens is, in a copilot world, that configuration failure would be detected, discovered after the fact an engineer would get paged. And copilot will summarize all of that. But what we're increasingly doing is changing that behavior to where a config agent is continually looking at how the service itself is degraded. And then when there's a service degradation, having all the information that is required to be able to go and correlate that to a config change and then be able to make a very effective choice about whether I want to do a rollback or something else, right? I'll bring a human in the loop, sort of, so on and so forth. We're starting to, I would say, see these types of use cases increasingly. 

Now what happens is when we think about engineering for a lot of this, there's a lot of, I would say, very important considerations that come, right? And the hardest problems actually tend to be architectural in nature. And that happens especially when you're dealing with production systems, mission-critical systems. How do you think about building AI software that can act in real-world environments and not just a co-pilot? How can you kind of start to build that autonomy? And that becomes a very, very important kind of design choice. 

Internally, we think of this as AI by design, which is not AI first or AI everywhere, which tends to get kind of misunderstood a lot. But how do you design a platform from the beginning with the assumption that AI-driven components would exist, would evolve, and eventually operate autonomously, right? 

And I think the mistake that a lot of teams end up making is to treat agents like a feature that you bolt on after the fact rather than - and then what you end up is you have these powerful models, which people have built billions of dollars building, sitting behind super brittle guard rails. And then you have this problem where, "Hey, I have this best model. Why is it not giving me the results that I want?" Then you realize that you don't have the basic system that is really designed for these things to act. 

We have this statement that we use where the model can propose, but the platform must dispose, right? Meaning, treat the model for what it does. It's a great reasoning component, but make sure that the platforms a safety boundary. And so when you start to build out these types of systems, then you have to have specific architectural platform components in place, right? And these become very concrete design choices. 

For example, LLM gateways is a great example, right? Early on, when we started experimenting, teams were experimenting with wiring logs directly to an LLM as an example, right? It works great in a demo. Executives are super impressed, but then you start to say it immediately runs into problems the first time you try to put anything close to it in production because cost spike unpredictably, right? Sensitive data gets into problems. Different teams hardcode different models, and then suddenly you can't change providers, enforce policy, and so on. 

One of the initial design choices that we had to make was actually bring all of that together in a platform service, which is an LLM gateway, which then handles everything from model selection abstraction, PII masking, rate limiting, auditability, and so on and so forth. A lot of those shared concerns across expanding your AI cases are kind of isolated. That's a very good kind of choice. 

The other one for us is also around you just can't throw MELT data at LLMs, right? The logs thing that I mentioned about, it's one thing about generating a whole lot of logs. Logs tend to be super noisy. And so we have to think about how do you feed logs to an LLM or a model for decision making? You need to really think about logs. How are you going to compress them, deduplicate them, summarize them before you ever expose them to a reasoning layer in your platform? 

Instead of asking a model to read like a half million lines of log lines, you have a compact representation of what changed, what's anomalous, what's new? And then that's a kind of a classic systems approach that we've applied in other use cases, but then you need to bring that into something like an AI system. The same thing kind of also - 

[0:27:16] SF: Yeah. I want to go back to a couple things that you said there at the beginning. So you talked about these copilot experiences, where even a really great copilot, you're still finding out after the fact that there's some sort of problem, and it has to essentially wait for a user to prompt it. And it sounds like you're trying to move to a world where you have kind of these more ambient agents that are always on. 

And I really believe - I'm a big proponent of this. I think that's going to be the next evolution of the use of agents. There's been a lot of success in the b2c world with AI, especially with ChatGPT. And I think that has been fantastic, but it's also locked a lot of companies into thinking that the only way these interfaces work is it's a chatbot, it's some sort of copilot. Maybe there's an agent behind it, but it's still just a chatbot waiting for a user to ask a question. 

But in these operational use cases, I don't want to have to ask if there's a problem. I want the system to know that there's a problem, and then do some work on my behalf, and then loop me in when human decision- making is required. It kind of sounds like you're thinking about that in a similar way. 

And I guess one of the questions I had is I think there is, in a lot of ways, a substantial difference both when we think about chatbots, especially the b2c world, versus the b2b world of solving specific operational use cases. If I want to constrain this to how do I figure out whether we have some sort of DNS config issue or a certificate has expired, the shape of that problem is very different than being able to go a chat interface where anybody can ask any unbounded, unlimited set of questions. 

I'm curious, how does that shape your thinking when it comes to the types of models that you might need to use or even the way that you are architecting this? Do you need trillion parameter models in that world, or can you get away with something that's more like an SLM that's tuned to the particular problem at hand? 

[0:29:14] KS: Right. Right. No, I agree. I think the model definition is definitely very, very important. And we do that. That's why this this example of LLM gateway is a very good one, where, depending on the type of use case and the consumptions, you can actually make that choice at the gateway level rather than downstream consumer having to make that choice. 

And you're absolutely right. What we initially saw is that LLMs as they get bigger and better and so on. They're able to pretty well handle a lot of the generic use cases, right? And then marry marrying that with a RAG system as an example. And even there, when you think about RAG systems, there's a lot of design choices that you have to make. But marrying that and then having a system that needs to be able to work in the background, all of those things need to tie together, which is why initially bolting on a copilot on top of a set of existing APIs exposed via MCP, as an example, will get you going initially with that very, very basic assistive tool calling. 

And when you wake up at 3am in the morning and dealing with alert storms, you have no idea where to start even, right? Having a copilot, which has a few prompts, and give you some contextual prompts based on the problem you're looking at or alerts make a lot of sense. But immediately the problem shifts to, "Okay. What now? What next? And how much of this can I do?" And that is where I think a lot of these type of ambient agents, as we call them, come into the picture. 

And you're absolutely right. There are agents where you do need, let's say, the heavier agents. A lot of use cases, for example, even within - for example, we use Claude models a lot. We use other models as well, but Claude models a lot. And we use the Haiku models for some very, very specific use cases. 

For example, in our ITSM product, we have this agent, one of the agents that we have in our service management product, is when a ticket comes in, typically - and it gets assigned to a human agent. And the human agent has to go review the ticket. And often time, a ticket gets forwarded 15 times before it ends up being at the right person. When an agent comes, an agent gets a ticket, you have all this history behind it. And there's a lot of cognitive load. 

One of the agents that we have is one that'll go look at your incident data, previous incident data, and go through that whole process of managing that and generate a response, give you a context summary. When an agent comes, they not only say this is what's currently going on. This is a customer you're dealing with. This is the sensitivity. And here's the summary of everyone. Sean's looked at it. Sai's looked at it before. And here's the summary of what they thought. And by the way, here's a suggested response. And you can quickly edit it as a human in the loop and respond to it. These types of use cases and experiences. 

And if you think through this, the beautiful thing about that example is there's so many things that come into the picture. You have your entire system that is working in the background. You have the system of all the architectural choices you had to make around LLM gateways and model choices, RAG systems that process that have to deal with chunking, similarity measures, embedding dimensions, and so on. It has a user experience dimension of actually extending an existing use case so that a human is not completely having to deal with completely new use cases. And an agent-first example of the agent's done all the work for you and is looping in a human right when you need it, when that highest point of decision is required, right? A lot of these design choices come together in a very, very elegant use case, I would say. 

[0:33:07] SF: Yeah. And I think it's a massive sort of operational efficiency win for the company too, where you don't have to have spend sort of human cycles routing tickets around. And probably nobody's favorite job to do. 

[0:33:20] KS: I mean, this was one of those initial use cases that we rolled out. But the amazing thing was the take rate on something like this was instantaneous, right? And the MTTR goes up by 30% to 50% just overnight using a very effectively, well-thought-out use case that comes into your natural workflow, right? 

And I think that's why I'm bullish on a lot of these things. Because if you do take the care to have a systems approach to this problem, rather than bolting on flaky agents on an unstable kind of an environment, if you really design it, engineer it, build a platform, build these platform primitives, and take a user experience first approach, then you actually can leverage a lot of this. And talking about ROI and so on, you can see instant ROI in these types of agentic use cases up front while still maintaining all the concerns around boundaries, and choices, and so on. 

[0:34:22] SF: Yeah. Also, it's easier to bound the problem when it's sort of more use case specific. If you're looking for config issues, for example, then you have more of a bounded input data set to build guardrails around versus just like, "Hey, someone could ask the weather support for my upcoming family vacation, as well as to do log analysis on our systems. That's hard to build test cases for. 

[0:34:51] KS: That's right. 

[0:34:52] SF: One of the things you also mentioned there earlier was this idea that you don't necessarily want to just take your raw logs and kind of fire hose it into a model, and it's going to be hard for a model to be able to interpret it. It's probably also going to explode your token cost to some degree because you're putting in a lot of input tokens. And there's just going to be a lot of noise in that data set. 

And I think for a lot of use cases, it's not really - and I think where companies kind of struggle sometimes is, because you have data all over the place, sort of the easiest thing to do is try to attach essentially the agent to be able to pull in the raw data from all those different systems. And that might work okay in the demo. But realistically, you don't want the raw data. You need a refined data set, essentially. The data set needs to be massaged into something that's kind of purpose-built for the AI system. Just like you don't take a bunch of raw ingredients and call it a meal. You refine your spices and your salt and pepper, your raw ingredients, and you cook it together, and you make a meal, and you call that a meal. I think you need a meal for your data sets as well for these AI systems to be successful. What are some of the things you're thinking about there, and how is that being reflected in some of the tools that SolarWinds is building? 

[0:36:07] KS: No, I agree with that. And I think you know in terms of recipes, thinking about that, you need to have a system in place. And this is where I think going back to the platform choices and decisions that we had to make, especially for a data-heavy system like ours, very, very broadly, we think about our platform in three different planes. And this is not unusual. There's a data plane, which is where we deal with a lot of what we call as a MELT data; metrics, events, logs, traces, topology ingestion, and normalization. A lot of that happens in the data platform. 

And we have the idea of a control plane, which of course deals with all your policies, actions. And then we think about what we call as a reasoning plane, which is something where the intelligence and the agents tend to operate. And that separation I think is very intentional because it mirrors, number one, how big systems like Kubernetes, service measures, etc, are already built. But it also ends up - goes back to what you were talking about is you're dealing with a lot of very, very operational concerns like distributed stay, partial failures, components that should never have implicit authority, and so on and so forth. A lot of those types of things come. Which is why having an explicitly permission by design reasoning plane is something that's a first-class choice, decision design choice that we had to make, right? 

And that distinction is actually matters a lot because the aha moment for us was you want models in the reasoning plane to do what they're good at, which is really be expressive but not be dangerous, which is like the key distinction, right? You want your models to do what they're really good at. By actually separating the concern and saying let the reasoning plane go do what it's really good at, explore hypothesis, propose actions. But the execution of those actions, etc, when things have require mutation or action, they always happen through a control interface that enforces things like lease privilege, etc., right? That's a very, very critical, I would say, design choice that we had to make. 

The second one was that when we think about autonomy, don't think of it as flipping a switch. Think of it like a tiered capability. Meaning that having an autonomy level into every type of action is something that needs to be baked in and built in. You can think of it as an autonomy level. You could start with something like recommend only, no mutations, right? And then you could evolve to something like execute with approval. And then later on, you can go to execute autonomously but with certain constraints for certain low-risk, high - use cases. 

And then you can also tune it based on action type, by environment, by service, by team. Having all these types of bells and whistles as you build autonomy into the reasoning plane for your system, I think, is super important. And then the other one that is very important in terms of actually realizing these things in production is to make sure that you have first-class transaction traceability for all your agentic actions because things will go wrong. And when they go wrong, you need to have the ability to go back and look at the why and what, under whose authority did an agent do certain things? And these things are important. You don't realize this upfront, but these things scale very fast. 

I mean, the moment you give your engineering teams the ability to say go use these LLM gateways and use these agentic frameworks for your use cases. Overnight, you will see 20, 50, 100 different agents being built and contributed. And unless you have these baseline things in place, it's super important. Then last but not the least, being an observability vendor ourselves, we always keep reinforcing the fact that you have to make observability for your autonomy a first-class concern, right? 

And there's a lot of activity happening in forums like OpenTelemetry, etc., now to support this as well. These are, I would say, some of those - I would say both, they cross the boundaries of not just being guardrails, but also good sound platform decisions that help you achieve these things at scale. 

[0:40:34] SF: Yeah, I think it's really important to have kind of that running decision log from these systems. And like you mentioned that when you turn these things on, suddenly you're going to have, I think, a huge volume of use. In some ways, I made this analogy recently, and I think it's it kind of rings true. It's when banks suddenly had mobile apps, and then they went from a world where, in order to check the balance in your bank account, you had to go to an ATM machine and pull it out, where maybe somebody's doing that once a month if they're really motivated. And then suddenly, it's like I can do that a 100 times a day, and their APIs are just getting hammered as a result, and they had to scale those systems massively. I think it's similar when you start to turn these things on, the usage is just going to really escalate. You have to think about that as well from a system design perspective. 

[0:41:18] KS: Exactly. No, I agree with that. And that's where I think things like token usage, etc., a lot of those experiences, compressing logs, these types of things. Having good design decisions and saying how you're going to distribute the design choices when you're building a large scale systems greatly help with how you think about all of those things, you know. 

[0:41:39] SF: Mm-hmm. Looking sort of longer term, how do you see human role evolving in the space? Are they going to be supervisors, sort of auditors, collaborators with agents? Something else? 

[0:41:53] KS: It's an excellent question. It's front and center for all of us. The way we think about, I think, a lot of - even as engineers, as an example. How does an engineer's role evolve as an example in a lot of this? The way we internalize it is what we're seeing is more than just a tool shift. A tool used to be able to do XYZ, and it's now becoming this. What we're seeing is that there's a responsibility shift. Meaning the role of an engineer or an operations person is clearly changing from or will change from having to be the author of the logic to how are you going to engineer the context. 

The shift from writing logic to engineering a context is a very, very important shift. And we're starting to see that. And that's why things like coding agents are super useful because you're already starting to do a lot of that in your day-to-day, I would say, example. Right? Now, in a traditional way, responsibility was super deterministic, you wrote the code, something broke, and you're responsible. But with agentic systems, that responsibility actually moves earlier and becomes more probabilistic. 

And engineers, just our nature, we are super uncomfortable with that. We want everything to be very, very deterministic and want to encode logic. And I think learning to do that is a very subtle nuance shift that will happen across the board, right? And so if you think about that, and overcoming that uncomfortable nature of learning that engineers have to go through, and thinking about agentic systems, which force you to think in terms of probabilistic engineering. And so that's a very big shift that I think we all need to go through. But I think the shifting of that responsibility from building business logic to engineering context is a very good way to frame how I think our responsibility shifts in this process. 

[0:44:01] SF: Yeah, I think that is probably the biggest fundamental shift that we're seeing is sort of what the role of software engineering is. Historically, it really is about writing these precise rules and logic to be able to represent the business logic. And now, essentially, your program and in many ways almost becomes like your data pipeline. It's that context engineering funnel into the model. And the business logic is essentially a byproduct of the execution of the model. And some part of that is nondeterministic, which is a hard thing to sort of wrap your mind around and get comfortable with. And then it introduces these new things that like we've been talking about that become front and center of observability, tracing, decision logs, evals, replace unit tests, and things like that. 

[0:44:45] KS: And as you were talking about, the other thing that occurred to me was one of the things that we don't talk about enough is having a sense of emotional resilience when you're dealing with these systems, right? Because these systems can be super frustrating. There's that initial aha, like I have the superpower, and then soon they start to get very frustrating because they hallucinate, they make surprising choices. 

And we've seen that the engineers who actually succeed are the ones who are able to treat that as feedback rather than failure. Right? And it's like the classic engineering, you have to kind of go back to the core of your engineering-first mindset. And we see that, for example, with engineers and teams that are doing this better. And so a lot of teams will look at that initial results and say, "This is garbage. I'll just write this script myself." Rather than saying, "Hey, that's interesting. I wonder what context is missing." Right? And what constraint did I forget to encode in my prompts or in my system, right? Building that mindset, I think, is very critical because agentic systems improve through iteration. And they're not like one-off correctness type of systems, you know? 

[0:46:00] SF: Yeah. I think that goes beyond just engineering as well. I think that's true of people who are using LLMs for generating market material and things like that. I've certainly seen my fair share of people in that world that are more resistant to it. They would put one prompt in, and they're like, "Ah, it didn't work for me. I didn't get like the perfect pressure release on the first attempt." And they don't know how to kind of take that signal as feedback and figure out how to provide the right context. 

[0:46:25] KS: Right. Or you end up dealing with AI slop, right? Which is essentially the difference is that different systems have different challenges. Writing some content for SEO is one thing, where you can probably get away with not getting the exact thing right. But dealing with mission-critical systems is a whole another thing. There's a big spectrum there. 

[0:46:47] SF: Yeah. 

[0:46:47] KS: And I think going back to how we used to think about even levels of engineers will change a lot, right? It's no more about you knowing the syntax and you knowing about styling guides and effective principles of basic writing software, but it's more about how you're able to architect the systems and how you're able to work the systems to do what you want to do, which is really build great products that people love to use, you know. 

[0:47:13] SF: Yeah, exactly. I think there's going to be a shift in terms of where we've historically rewarded really deep expertise in a particular domain. And now, at least for certain classes of problems, I think you can kind of get away with having more of a generalist sort of wide view of how to build things because the LLMs are so good at having that deep expertise in the syntax of Rust or whatever programming language you happen to be working in. 

Actually, one thing along these lines is what are your thoughts on now this kind of, I think, perceived risk of what does this do to the junior developer? Because we are working at a higher level of abstraction now. And I think that initially, a lot of the copilots, the perception was these are great for junior devs. This kind of levels them up a little bit. But now I think with the agentic systems and this change of context engineering, and really needing to understand how the pieces of the puzzle go together, that's something that generally more senior people in the organization are good at and are now getting a lot of value out of these models. Of course, the risk, the challenge is how do you get the junior people to senior people if suddenly the junior person isn't getting sort of that hands-on training? And I guess what are your thoughts on that? And is that a problem that people should be concerned about, or are we just kind - 

[0:48:28] KS: And this may be one way. What we're seeing is, of course, this is all super early. And we're learning a lot as we roll out these systems and so on. What we're seeing is both are true. Meaning the junior developers tend to be a lot better at adapting and learning how these systems work, and being able to get the system to do what you want. They get really good at that. And the more senior devs are really better at understanding, "Hey, agentic systems don't replace your basic engineering discipline." They actually expose and amplify where a mess exists already. 

Teams, for example, which have clear ownership, strong data, having good operational fundamentals are then able to use agents as force multipliers. And the teams without those foundations tend to struggle and tend to experience agents as unpredictable and challenging, and so on. One of the things that we have done, actually, internally, which is actually a pretty good practice, is to build this AI communities and have a very, very active, vibrant community where you have senior devs and junior devs actively engaged in sharing knowledge. Because there's a lot of learning that's happening on a daily basis, and being able to share that knowledge and be able to articulate, "Hey, here's what worked. Here's what didn't work. And by the way, as we are going through this, here are some architectural foundations and best practices that we need to put in place on which we can build greater production systems and so on." I think doing some of these things greatly tend to help is what I've seen. 

[0:50:08] SF: Yeah, absolutely. I think that one of the misses sometimes companies make is that they buy these tools and then they think that the overnight there's going to be 100% efficiency gain from it and they they forget essentially that it's a new skill set that you have to give people time to sort of train up on, and that's a really important part of the process. 

[0:50:25] KS: The other thing which is interesting that you mentioned, because this is the other challenge that most companies that are adopting these, is having to justify the ROI for spending a lot of these tools. And that's a common challenge. And the way we try to balance that is focus on the cases where ROI is very quickly, shall we say, well established, number one, as a system. For example, if you take customer service, they're highly metrics-driven. They know what works for them and they know what doesn't work for them. So you can actually measure your ticket deflection rate or your time to first response or time resolution. And these things can be actually measured. Engineering productivity on the other hand, as you know, is not at all well defined. There's a lot of - we talk about - 

[0:51:16] SF: Number of lines of code written. 

[0:51:18] KS: Yeah. Rate times, and commit velocities, and all of that, which is good. I mean, we try to somewhat early. But what we are really poor at is being able to try a lot of that to business outcomes. Again, going back there. For example, we had to do a lot of work in terms of it's early. But being able to quantify certain metrics in terms of whether your lead times improving as an example, your change failure rates decreasing, etc. A lot of those types of things. Having that in place also tends to help, right? You know that you're trying something and you're learning. But over a period of time, you're starting to get those signals or green shoots of, "Hey, this is actually helping me do things better or work better." 

[0:52:05] SF: Yeah. I mean, I think the reality is it doesn't matter how powerful the models are. You still need to do the work, essentially, around putting the right metrics in place to measure what success looks like. I think sales, similar to customer service, is another place. A sales rep, they're going to use whatever tool allows them to hit quota more often. If you can unlock that, there's clear ROI for the business. Well, I want to thank you so much for being here. I really enjoyed the conversation. And the last thing here, is there anything else you'd like to share? 

[0:52:35] KS: No, it's just thank you again for the opportunity. And we're super excited about where all of this is going. And we're taking a very comprehensive approach to how we look at this. And we're definitely excited about moving the needle, so to speak, in terms of how a lot of these things continue to add value for us even as we adapt and learn. Thanks again for the opportunity. 

[0:52:58] SF: Great. Well, cheers. Thanks. 

[0:53:00] KS: Cheers. Thanks, Sean.

[END]