EPISODE 1910

[INTRODUCTION]

[0:00:00] ANNOUNCER: LLM-powered systems continue to move steadily into production, but this process is presenting teams with challenges that traditional software practices don't commonly encounter. Models and agents are non-deterministic systems, which makes it difficult to test changes, reason about failures, and confidently ship updates. This has created the need for new evaluation tooling designed specifically around the properties of LLMs. 

Comet is a platform with roots and MLOps that has evolved to support teams building modern LLM-powered applications. The company recently launched OPIC, which is an open-source platform focused on evaluation, optimization, and observability for LLM agents. Together, the tools aim to bring the rigor of traditional engineering and ML workflows to the rapidly evolving world of agent-based systems by treating prompts, tools, and workflows as optimizable components that can be evaluated and improved over time. 

Gideon Mendels is the co-founder and CEO of Comet. He previously worked at Google on hate speech and deception detection. And he founded Groupwize, which trained and deployed NLP models processing billions of chats. In this episode, Gideon joins Kevin Ball to discuss how agent development sits between software engineering and ML, why evals are the missing foundation for most AI teams, prompt optimization as a search problem, and the future for continuously improving agents in production. 

Kevin Ball, or KBall, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action Discussion Group through Latent.Space. Check out the show notes to follow KBall on Twitter or LinkedIn, or visit his website, kball.llc.

[INTERVIEW]

[0:02:09] KB: Gideon, welcome to the show. 

[0:02:11] GM: Yeah, Kevin, thanks for having me. I'm a big fan of the podcast. So I was looking forward for this one. 

[0:02:16] KB: Yeah, I'm excited. Well, let's start with you. So, can you give a little bit about your background, how you ended up at Comet, and then some of what Comet is about? 

[0:02:24] GM: Absolutely. So, originally I started as a software engineer, kind of moved throughout the stack in the first kind of few years. And then about 10, 12 years ago, I shifted to working on machine learning. I was a grad student, and then I went to Google. Funny enough, I worked on language models. This is 2016. So they weren't large nor very good, right? It's like pre-transformer days, unfortunately. LDMs, if anyone still knows what that is. 

And as someone coming from a software engineer background, where we take a lot of pride of how we build software. Obviously, a lot of that changing right now. I'm sure we'll talk about it. But a lot of pride of how we build software, the tools that we use, and then joining an ML team with amazing, very, very smart and talented people, but just seeing how the whole thing is kind of like a little bit like the wild west. It was very, very challenging. 

We worked on hate speech detection on YouTube comments. If you remember the YouTube comment section back in the day, I think someone called it the worst place on the internet. We had a hard time getting these models to work. And from that point, I was like, "Okay. Look, we had data, we had compute, we had smart people, we still couldn't do it. What is it?" And it's not considered necessarily a hard ML problem. And I realized it's just kind of around the process of kind of how do you drive these projects. 

And I called my co-founder here at Comet who we worked on another startup building ML models, and I was like, "Hey, you remember you're making fun of my ML workflows and how everything is stitched together?" I'm at Google, and it's exactly the same thing, just at massive scale." That's really how we got started. This is 2017, '18. And we started with specifically what my team and myself needed back then at Google, which was around model experiment tracking. 

You train a bunch of these models. There's all these moving pieces, hyperparameters, data set versioning, all these results, and it's really hard to know that you're making progress to understand what you're doing next. Collaborating is completely out of the question because no one has access to anything. So we started with that. And then over the years, we kind of expanded that side of the platform. So data set versioning, model registries, model monitoring, and such. 

And then about two years ago or so, obviously, there's quite a big shift in the industry, and we started seeing a lot of our customers and users start telling us, "Hey, for this use case, we're not going to train a model anymore. We're going to try to build it on top of OpenAI API. But it's still very similar because we're testing all these different stuff and we still want to use Comet for that. How can you help us? 

And at first, we started to add some features and such to help with that. But eventually we realized, "Okay, there's a lot of similarities, but there's also enough differences not to try to bake it into like a slightly different workflow." Then September 2024, we launched OPIK, which is our open source product focused on team building agents, any type of LLM-powered applications. Really focusing on the end-to-end from like early dev through kind of this deployment process to production, and specifically things around observability, and evaluation, and optimization of these agents automatically. 

But yeah, it's been a fun ride. We power some amazing AI teams in Uber, Netflix, Etsy, Shopify, Autodesk. We have roughly 150,000 engineers over the world using our products. Great adoption on the open source front. Yeah, it's been a fun ride so far. 

[0:05:46] KB: Well, let's dig into a few pieces of that. I definitely remember MLOps has been a term for a while. There's just all this operational stuff that is hard and different. And as you highlight, now in the LLM world, everybody's getting to play with this. In fact, everybody's having to be in this space and start to deal with non-determinism and how do you deal with data flows and all of this different stuff, but it's also a little bit different. Can we dial in? What are the particular characteristics of development with agents that were different enough that you said, "Okay, this is a whole new product. This is a thing we need to build."

[0:06:22] GM: Yeah. Yeah, that's a great point. I mean, when you're training a model, first of all, you typically have some kind of a training data set. And the algorithms are mostly a commodity, right? Everyone is mostly using the same algorithm. You do spend some time maybe changing some stuff, but it's mostly out of the box. And the majority of the time you spend on figuring out what is the right data set or what variation you want to augment it in different ways, the model hyperparameters. How to formulate the business problem as a machine learning problem, which is not trivial in many cases? And then you get these like massive binaries on the other side of the model weights. And how do you do retraining? 

At that level, the majority of our users and customers consume the LLM as an API. It could be like that they're deploying it and it's open source. But majority of them use the commercial ones. So you're no longer in control over the weights whatsoever. You have three production hyper - temperature, stuff like that, that you can play with. But it's mostly static, right? Sure, OpenAI will release a new version every quarter or so, but it's mostly static. 

And what you do control as a builder, whether it's a simple LLM workflow or a full-on agent, are slightly different things, right? You control the system prompt, which is the equivalent of the weights if you train the model. You control tool calls, context, vector DBs, all these. But the similarity is - the reality is you have all these variables, system prompt, configuration of your chunking, the tool called descriptions. You got all these variables, and you're trying to find a combination that gives you the best results. From that perspective, it's quite similar. In the day-to-day it tends to look quite different. We had a lot of experience in this workflow, but the SDKs, the UI, all those things tend to look quite different. That was kind of the main motivation to separate the two. 

[0:08:16] KB: That makes sense. Looking at now that development problem, right? You have these different pieces that you can coordinate. And definitely one of the things that I see teams struggling to come up to speed with, or a lot of engineers at least, is this grappling with non-determinism. If you're coming from machine learning background, that may be old hat, right? Machine learning has always been statistical and data. But if you're coming from a traditional software background, which now more and more people are, and still having to deal with this stuff, non-determinism is weird. That's scary. We used to try to get rid of all of that. What changes in the process need to happen? 

[0:08:52] GM: You got kind of multiple levels of non-determinism. Maybe some of them not officially non-determinism, right? Obviously, you got the LLM. Technically, you're supposed to be able to get the terministic output if you set the temperature to zero, technically. But there's actually an interesting write-up on why they're not deterministic, and it's actually because of how they're deployed with a mixture of experts. Because at the end of the day, it's a combination of matrix multiplications. It should be deterministic if the input is the same. But I'm digressing. So, you got that. 

I think to your point, what is a slightly different concept to software engineers and much more familiar for data scientists or people training models is in software we write these unit tests. And at some point, we have a pretty good coverage. Of course, something could happen in production, another edge case, we'll add a unit test. But you have this level of confidence that this new version that you built is not going to break everything before you deploy. 

And with these agents, how do you translate that concept? Right? Because, first of all, in the small details, you can't do string matching on stuff because obviously you can have the exact or semantically the same output but from a string content completely different, right? You have to think how do I compare or do assertions between two of these outputs, right? That's a hard problem. But also, this concept of generalization, which is very common in ML. It's like how do I write a test suite that gives me enough coverage and variance so I have confidence that this thing is working? 

I think that's a concept that's relatively new for a lot of people who haven't trained models before. And that's the reality is building these agents is somewhere in between software engineering and ML. It's definitely not pure ML. It's definitely not pure software engineering. But there's a lot of learnings from both of these paradigms that you could bring together and actually ship stuff that works. That's the reality. It's hard to get this stuff right. But when you get it right, it's magical. I think we're all experiencing that in a day-to-day basis. 

[0:10:55] KB: Yeah. So let's talk about some of those primitives. And one in particular that comes up a lot, and it's talked about a lot, but then also when I talk to teams, everybody's like, "Yeah, we need that, but we're not really sure what it means," is evals. How do you think about evals? 

[0:11:10] GM: Evals, if you put in the work, they are extremely powerful, useful, and kind of give you that level of certainty. Now I completely agree with you that most teams don't do it, right? And I have a very strong thesis on why that's the case. But evals is really kind of mapping this concept of like a test suite to your application. 

Okay, I have in the most basic form a list of inputs. I have a list of the expected output. Now it's not a simple assertion. You typically want to have some way to measure the distance between the agent's answer and the expected output. You could do it with deterministic metrics, boost score, stuff like that. Or you could do it with LLM-as-a-Judge. But at the end, you get a score. Hey, this passed, this failed. This is your overall score across the board. 

I think they're extremely powerful from our customer base, the ones that put in the work and invested and adding those. They're definitely on the more successful ones and getting these agents to work well in production. And I think the reason not that many people do them is because it's extremely hard to construct these data sets. It's painfully hard for various reasons. First of all, often the person that's building the application is not an expert on what's the right answer. Right? 

If I'm a software engineer and I'm building a tool for the sales team or the HR team, I do not know what is the right PTO policy in Norway for an employee that's been in the company for two and a half years, right? You kind of have a domain knowledge gap, which is hard. In addition, if you're doing this kind of end-to-end testing completely agnostic to what happens in the process, which is one way of doing it, that's maybe fine. But if you're trying to test something that does take to account which tool it should have called, constructing these graph traversals and compare - it's really hard work. It's extremely challenging. 

And look, there's a lot of ways we and others help with that. We have UIs for subject matter experts to clearly annotate stuff and add things. There's attempts, which I'm not a big believer in to solve this with synthetic data. But I think that the solution is a product solution or a product approach. The reality is whether you build evals or not, you put the stuff in production, and at some point, someone comes complaining, right? Hey, this user tried to do it. I gave it completely wrong answer, or this blew up, and so on. 

And then as a person owning this, whether you're an engineer or a PM, you go ahead and you try to fix it because that's what we do. And what we're spending a lot of time on is how can I take this activity that you're going to do anyway and use that to bootstrap the eval data set. I'll give you a specific example. Let's say we take that example. It's an HR chatbot. The user asks about PTO policy in Norway. There's some context injected with their tenure and all those kind of things. HR comes and say, "Hey, that's the wrong answer." 

So you would go in there, and you would say, "Hey, if the employee -" you write it in free text. If the employee is located in this country and it has this kind of tenure, you should look at this document, or the answer should be this. People will do it anyway and might go and change the system prompt. But what we're trying to do is actually build this test suite based on that. Now you have a new test sample with this input and this output with an assertion that says the answer should be X, or it should be greater than certain years. 

And then the next time you try to iterate on your agent, whatever you're changing the LLM, the system prompt, the tool calls, you're going to run through this eval suite automatically. It's a product solution, right? It's not like an algorithmic solution to this problem. From speaking to users and customers, I think that's the right solution. And we're spending a lot of time really nailing that workflow, whether it's UI first or it's like terminal first, and there are all these questions. Because I do believe in evals. They're very powerful. 

[0:15:07] KB: One of the things that I find myself wondering here is - and you said something about like, "Oh, are you doing this end-to-end, or are you doing sub-pieces and diving in?" And what you described in terms of a product solution potentially supports either. But how do you think about the different levels at which an eval makes sense? Coming back to the software world, are there relative equivalents to unit tests, versus integration tests, versus system tests, or what have you? 

[0:15:35] GM: Yeah, absolutely. What I typically recommend customers and teams to do is start with the end-to-end, the system test, right? You're not able to test every single scenario that way because there's side effects and all those kind of things, but that's typically the easiest data set to compile, and it does give you tons of value. Another thing we're seeing a lot of people do is when you start adding these tools, a very, very common failure mode is it's not calling the right tool. Even though if your tool is perfect, if it's not calling the right one, it's not going to work. 

So we see people build data sets that are essentially it's like a classification problem. Given a certain context, user context, previous message history, anything that came from previous tool calls, basically the entire graph execution up into this point, did it call the right tool? That's relatively easy if you have the right product to help you. You don't have to manually create the graph context. That's relatively easy to generate. And it also provides relatively good results. 

And then, of course, a certain tool could have an LLM call in it, or it could be a complete sub-agent. Often, you want to test that in isolation as well. I think it does kind of correlate to the software engineering testing concepts that we're familiar with at a high level at least. 

[0:16:53] KB: So far we've kind of talked about it using it. I think your example was, "Oh, this edge case came in. I got a complaint or whatever." My expert looks at this and says, "You should do this." And to me that kind of maps. And I'm going to keep mapping to software engineering. That's like a regression test, right? We had a failure. We're introducing a test to make sure that we don't. And then we go and introduce a fix. What other types of use do you see? Is there an equivalent to, for example, test-driven development? Is there an equivalent to other types of sort of process level utilization of tests for these evals? 

[0:17:26] GM: The approach I talked about introducing these, yeah, I agree, regression tests, is mostly a way to try to bootstrap that data set in the most user-friendly without requiring you to do a lot of work. But when you put together a true evaluation data set with a subject matter expert, not necessarily based on production regressions, then at that point, you can do something very similar to test-driven development, right? You could one-shot the system prompt, just write two sentences, put something together really quickly, which is easy today, right? Test it on a data set. And then, "Okay, it's failing in these samples, in these samples. I need to add another tool maybe." So you do see that. Transparently, I don't see a lot. I think we're going that way. I'm not going to say that I see a lot of teams do that. But this space is moving so fast. 

[0:18:13] KB: It's interesting because it reminds me of - are you familiar with DSPy?

[0:18:16] GM: Yeah, of course. 

[0:18:17] KB: DSPy is a very researcher/developer-centric version of this, of like I'm going to describe the intent of what this thing should do and go. It's extremely not productized, user-friendly. And I think at least the last time I looked, it was very optimized for single inference types of tasks. When you're talking about at the agent level and thinking about this, what does that kind of loop look like? And how do you think about - for example, let's maybe walk through your example of I have a particular case. My agent is complex. It has different tools. It's got a sub agent it can call. It can go down this. And something at the very top level, my eval is failing. How do I step through and figure out the right places to fix or debug this? Does that also involve eval or tooling? How do you see people doing this? 

[0:19:06] GM: Yeah. The calling workflow is you identify a failure and then you use the UI, like Traceview, which really shows you very clear breakdown and what happened in every step, every tool call, every LLM call, every function call internally, right? Typically what we see is people just go through that, and then, "Okay, this is the step it failed." Or why? 

For example, if the output from the RAG database or the vector DB was bad, then clearly you're not going to expect the right answer. If the output was good, but the LLM still provided - it really allows you to pinpoint where the issue is. And that's super powerful, super powerful. But I think where we are as an industry today, I think we're going to look back a couple maybe - I don't know in this space. Maybe not a couple years. Maybe a couple of months. But we're going to look back and say, "Hey, that didn't make sense that we did that so manually." 

If you go back to the early days of neural nets, people manually tried to put the weights on the neurons manual. This is before kind of SGD and that kind of stuff. And now you look at it, it's like, "Huh?" I mean, obviously, they have an intuition that the architecture could work. But the reality is everything we're talking about is a search problem. You have a system with a bunch of variables, and parameters, and configurations. And if you have an eval test suite, you have a certain score of how well you're doing against it. And we are searching for hopefully the global maxima for this search problem, right? Let's look at it as a search problem, as an optimization problem. 

This is something we're spending a lot of time thinking about. And we shipped a bunch of stuff on the product. And I truly feel that we're not completely there yet. But I truly feel that's like how things are going to be in the future where we're going to stop doing this manual inspection of a trace to try to figure out where it failed. We'll have this flywheel of eval suite that continues to grow over time. And then a nightly optimization process that tries to find a new global maxima. 

[0:21:10] KB: Let's talk about what that might mean, right? Because I think a mental model. We're now moving mental models in a lot of ways, right? We had been taking this mental model. Okay, this is a unit test. What do I do if a unit test is failing? I'm going to debug, which is what's it kind of a manual process or maybe a coding agent-assisted process, but is still pinpoint to this thing a thread going down based on a failure, and trying to correct it. And what you're describing now is something quite different from that and much more algorithmic and broad facing. 

Let's flush out what do you mean when you say, "Hey, this is an optimization problem." I completely agree, right? LLM's kind of, because of their like fuzzy nature, lend more to this versus unit tests that are, by nature, do they pass or do they fail, right? If we know what are the variables that we control, which in agents is typically system prompt, tool descriptions, if you have a RAG step chunking strategy, all that kind of stuff. How many values are returning from vector DB? You got all these values. Let's call them hyperparameters, okay? And you have a score at the other side of it. With this tuple of these values for these variables, I'm getting 80%. And I'm not getting into how you compute, but we can get to that in a while. The most naive approach is like let's brute force search all the possible values to try to get the best result. I mean, putting kind of compute constraints aside, you would eventually find the best result. 

[0:22:42] KB: The space of possible system prompts is very large. 

[0:22:46] GM: Yeah. Yeah. Yeah, it's not actually feasible. But if you think about it as a problem, you would eventually find that. And then you say, "Okay, let's kind of walk through what typically it looks like." And then you're like, "Okay, there's random search. Okay, let's randomly select." The user would define here are the ranges that I think make sense. Let's randomly search. And then you got methods from ML like Bayesian search, which are a little bit more informed on how to kind of search that space. 

But then with these LLMs as part of the search or the optimization algorithms, you can do things that are a lot more robust and a lot more kind of compute efficient than kind of searching through all the possibilities. Especially for things like system prompts, which is - even random searches, it's going to be quite challenging. 

There's obviously a lot of work on this topic. I think the leading lab, I would probably say Stanford. JEPA is a very well-known algorithm in this space. We've built a couple of our own as well. But the idea is, okay, I have a system. I have the configuration, that tuple. I ran through all the inputs from my evaluation suite. I got a result. Now I need to come up with new candidates for these variables. Let's talk about system prompt. 

You can use an LLM in that process to look at the ones that failed. So you say, "Okay, here are the 10 samples that fail in my data set. Let's look at the tree, the trace of what happened, where it failed. Common failure modes. And let's suggest a new system prompt candidate. It's a suggestion. It's a candidate that we think might fix it. You put this candidate in, you rerun the whole process. Am I doing better? Am I doing worse? And so on and so on. 

Now these algorithms gets more sophisticated. JEPA is doing this evolutionary approach, suggesting multiple candidates and kind of trying to merge the best ones. We can get into it if you want. But the reality is, if you have a good eval data set, they work extremely well. It's really cool to see. And the best part, this is my favorite part, especially LLMs, but generally ML, you need at least thousands of samples. With these algorithms, even 20, 30 samples will take you a very, very long way. It's extremely powerful. I'm happy to give you. We've ran some kind of public - 

[0:25:07] KB: I would love actually - yeah, let's talk about some examples of that and mapping to - because my brain immediately goes to, "Okay, we have an LLM in the loop. We're running LLM evals, and all this." Those token costs are going to rack up. What's the order of magnitude we're talking about there? What's the length of time this tends to take for a looper to converge and all those different pieces? 

[0:25:27] GM: Yeah. I'll give you kind of one. Let's call it relatively simple, but real life example, right? Not a toy one. LangChain, obviously a very popular library, has baked in the library a system prompt or a prompt that is used when you're using an LLM that doesn't have structured outputs, which is a lot of the kind of smaller open source ones. And it basically asks the LLM, "Hey, return your response adhering to a specific JSON schema." Right? Because you need to parse it and do stuff afterwards. 

And we just saw from our user base how often it's failing that we said, "Okay, let's try to look at this as an optimization problem." We created a simple data set of JSON schemas, essentially, and then a certain output with random values that adheres to that schema. And then we ran it through the original system prompt they had in LangChain. And I think we got about 12% pass, which is we're just making sure that the schema matches. It's a simple deterministic metric. 

Then we ran it through the optimizer. And within two iterations, we got it from 12% to 96%. It was that powerful. Now you're talking LLM cost. It was less than a dollar. It was like a frontier model, but it was less than a dollar. Very cheap. We opened the PR to the LangChain team. It got committed the same day. They were like, "Thank you so much. This was painful." It works very, very well. Yeah. We have a couple of - we have some customers using it as well for also softer, softer evaluators. But yeah, it truly depends of how much do you believe that metric on your evaluation data set. 

[0:27:05] KB: Fascinating. Yeah, for structured data, your evaluation can be deterministic, very regular. I think LLM-as-a-Judge is where I start to wonder, is like, "Okay, I have an LLM-as-a-Judge. I have a subject matter expert that's inserted a bunch of things about why, how they would evaluate, etc." One, it's a little fuzzier. Two, it's probably a little more fragile. And three, it's expensive to run a lot of times. 

[0:27:31] GM: Yeah. Yeah. I don't disagree, right? In no way I'm saying this is perfect, right? But I'm comparing it to the status quo, which is like - 

[0:27:39] KB: Oh, for sure. 

[0:27:40] GM: Yeah. And I think that's like a big reason why we're not seeing more of these agents out there, right? There's obviously a few that are extremely popular and common. But I would argue, you'd expect to see more out there, right? It's because of kind of what is the status quo, which is this kind of more vibe checking, which is essentially, "Oh, I'll test a few inputs. I'll see what it look - it looks good. Let's put it in production." Right? 

And one argument I always kind of think about when people talk about cost is if you look at the cost per token curve over the last three years - 

[0:28:12] KB: It does keep dropping tremendously. Yeah. 

[0:28:13] GM: Tremendously, right? But yeah, it does incur additional costs. There's no argument about that. 

[0:28:19] KB: Let's maybe talk a little bit about what is the life cycle around these then? Because I think one-off costs or one-time costs are very different than recurring everyday costs, things like that. Let's say you're going all in, you're doing an optimization, agent optimization, you have your eval suite. One, how frequently are you building updating that suite? Two, what are the situations in which you're running it? And three, how often do you do this optimization pass over all of your different agents, etc.? 

[0:28:48] GM: So I think there's where we are today as an industry, and I think where we're going to be in 6, 12 months. So what we see from our user and customer base is people don't kind of run these optimization every day, or every week, or maybe not even every month, right? They spend the time to build the evaluation suite, and then they do it in dev maybe a couple of times to get to a version that's kind of good enough. But that goes back to the challenge of their evaluation suite is not growing as much as you'd expect, right? Which is the bottleneck for everything here. 

I think that's where we are today. I do think that if you fast forward 6, 12, 18 months, I think what it should look like is a lot more what ML teams or pipelines used to look like. All of my ML customers retrain their models on a daily, weekly, monthly basis, right? Obviously, some of these algorithms are completely online. They update as you use them. I think it will look something like that, right? 

We have an agent in production. We are getting feedback, whether it's from user feedback, LLM-as-a-Judge evaluators, to flag different edge cases, behaviors, and so on. You either have or don't have a human in the loop to kind of improve that data. It goes into the evaluation data set. And then, yeah, once a day, once a week, you reoptimize, retrain the agent. 

Because I completely get the cost side. But the fundamental equation with ML is more data equals better model. But these agents is you ship this thing, whether you have a one-user using it or a billion, it's stale. It would not get better unless you do something about it. I think if we are able to close that loop where we can learn from production data, I think the sky is the limit. I'm very excited. I mean, I feel I'm very fortunate. To me, it's one of the most exciting problems out there. 

[0:30:44] KB: So let's dig in then. One, is anybody doing this today in terms of being able to learn live? I think of this as like in-vivo learning, right? It's happening in the flow of what's going on. Are any of your customers doing that? How does that work? Or what are the barriers that are keeping people from doing it? 

[0:31:00] GM: With this type of optimization, there's no RL stuff that obviously people do on policy and stuff like that. But I don't think any of my customer at a point where it's completely autonomous in the loop yet. I would also sort of say, we just launched this optimization open-source SDK about six months ago or five months ago. It's like early days for this entire category. Sorry. Remind me the second part of your question. 

[0:31:27] KB: Oh, what are the barriers? I mean, I can think of a few in terms - I mean, immediately, I go to like, "Okay, what about privacy?" Where is this eval suite living? If it's using production data, is there a cross-contamination? How do you deal with that? But what are the barriers that you're seeing? 

[0:31:43] GM: Yeah, there's a lot of them, right? First of all, kind of what we talked about is like when you add stuff to this evaluation suite, and it's the ground truth. It needs to be right. You can't just put a bunch of noise or garbage in there. Is it human in the loop? Do you have a really good way to do that? That's kind of one fundamental issue. Second one is kind of like the operation side of it, right? Where do you store these data sets? How do you split between different users, customers, deployments? All these operational problems, which are real problems.

And then I think the algorithms still have a lot more room to improve. All this work is super nascent, super exciting, but it's very early days. And then the other part is from a deployment model, how do you test this stuff. For example, one way people do it is they kind of bake the system prompt in their repo, in some configuration or whatever, a YAML file, or even in the source code. They manage it classic version control. You got a new prompt you want to test, you redeploy the entire application with a new prompt. That's one way to do that. 

But if you think about it, you could test. We talked about evaluation suite testing. But a lot of teams AB test these things in production, or canary, or whatever. Canary deployments. But it's like such a small component, typically that does it require a full-on redeployment? And when you think about AB testing platforms, you kind of inject the framework in your code. And if test this, do that, and so on. Or display this button, feature flags, all that kind of stuff. 

One of the approach we're looking into is like, "Hey, maybe when your application spins up, it fetches the prod prompt from our platform, for example." Right? And then when you have a new candidate that you want to test because you ran an eval suite or you manually went and changed something, the next time it'll make this inference, it will just get the new prompt. And that suddenly opens up like this whole new world of like, "Hey, I can correlate production performance to my eval test suite performance. 

Like in production, you can often look at the downstream actions. Did they actually book the flight, right? For example. It kind of opens up a lot of interesting use cases, which I think if you tell a lot of software engineers, you tell them, "Hey, your application is dependent on a third party to fetch a critical configuration value." 

[0:34:06] KB: Well, it's a fascinating question, right? Because prompts in some ways - a mental model I sometimes use is LLMs are like this big virtual machine, and you're throwing in a combination of code and data that is text. Your system prompt is code. Any user input is code, but it's also kind of data, right? It's also kind of like, "Hey, we're using this to shape it." Anything coming back from a tool call is some combination of this is data, or this is code. This is instructions to this virtual machine to keep going. How do you manage it? Do you manage it like code in a repository? I can make arguments for that. It's important to know the version control, and this is shaping the core behavior of my system. Or do you manage it like data? Because also, it's got a different life cycle, and maybe you're optimizing it, or changing it, or tweaking it, or all these other - it's a really interesting question.

[0:35:00] GM: Absolutely. And the coolest/the scariest part of this is often, when you confront these problems, there's some teams or some companies that have been doing this for a few years, right? They're like so much ahead that you can kind of ask or read their blog post, right? And like, "Okay, they might have seen." But here, it's so new that even the bleeding-edge teams are figuring stuff as we go. 

[0:35:26] KB: Totally. I was having this conversation with an engineer. And he's like, "Isn't there an established solution for this?" And I'm like, "Are you kidding? This whole field is a year old. There's no established anything." 

[0:35:35] GM: Absolutely. We've seen a little bit of it with the MLOps world. But again, then we kind of lean, we had these early customers that were doing ML in production for 10 years. We leaned a lot of them. But it's funny, a lot of the solutions came from some weird constraint. I'll give you an example, right? When you think about how do you manage model binaries, right? They have their versions, they're tied to a production version because the inference code needs to match. You got all these kind of moving pieces. And the biggest limiting factor was like Git LFS, which wasn't good/fast enough, right? So it never really made it into repos. Because I would argue, it should probably live in a repo. 

[0:36:17] KB: I mean, there's coupling. I mean, this is another thing, right? Some prompts, you have code that is handling that, that is parsing it that is not a part of the prompt. So there's coupling between your code and your prompt as code. 

[0:36:27] GM: Yeah. And most teams on that front use a model registry, which is a wrapper around object store and does that coupling exactly that you're referring to. But it also allows you to do these like hot switches in production, right? So that's actually quite powerful in the ML world. A lot of people use that. You don't have to redeploy the entire application unless, of course, the inference kind of code had to change. But it gives a lot of flexibility. Because often, the team's building the models are not the team deploying the software. Yeah, it's really fun. 

[0:36:59] KB: Are you seeing that metaphor working out in the LLM world, where you have essentially a registry for this prompt that has whatever associated code? And you say, "Okay, this prompt, probably you have an eval that explains what is the type of data that can come out of this, so that my coupled registry is able to deal with that." 

[0:37:17] GM: The first approach we took, because we weren't sure, is, yeah, you could do it either way. You could kind of take the prompt through CIC - you could do it in a lot of ways. You decide what you want to do, right? We weren't sure what's the right approach. But now, about a year and a half in, and seeing kind of a lot of teams doing it, that's definitely the approach we're taking. And it's not just a prompt. What we're building is essentially a configuration manager which has all the prompts, has all the other kind of variables that impact the agent, and then a product manager, which is often the person that owns these agents can go and change stuff, right? You need to be able to revert to older version. You need to have some kind of process so that you don't accidentally click a button and break production. You need to make sure that your APIs, as the configuration registry, meet a certain SLA because other people are dependent on it. 

But then you can start doing all this cool stuff. You can start doing overnight optimization. You can start doing canary redeployments of agents versions and so on. I do think that's how things are going to shape out. And again, we have kind of all flavors in the product, and it's clear that it's getting there. And I think it's predominantly driven by the fact that how involved product managers are versus just a pure - you're smiling. That makes total sense. But yeah. The engineering builds it, and kind of throws it to the PM team, and they need to manage this entire thing. 

[0:38:42] KB: I'm curious. And full disclosure, this is literally a problem that I'm tackling in my day job right now, is like how do you navigate non-technical stakeholders doing prompt optimizations, doing these things, and the coupling between all the different pieces. So what you're doing is you've got this prompt configuration system. It has versioning, and release gates, and all this stuff. But it has with it a prompt, some amount of configuration. Are there dependency graphs? This prompt needs to run? It needs to have access to this type of data injected or these types of tools. Is there an SLA for like this is outputting structured data? Is there an SDK that is matching? How does all the pieces fit together? 

[0:39:23] GM: Yeah. Yeah. We're still figuring some stuff. But generally speaking, you have an SDK where you manage to commit, or write prompts, or read prompts, right? And it kind of supports the templating and all the stuff that you want to do there, right? For example, you'd be like OPIK.getprompt, and then promptname.latest, for example, right? Something like that. And then that will fetch it when that API or that function call happens. 

But you kind of want your - it works when your agent kind of architecture is like somehow stable, but it could have multiple prompts that are definitely dependent on each other, right? It's not like this prompt - we call a blueprint. It's all together. It might have 10 prompts and have a bunch of configuration, like which LLM you're using, or all these different things that you want to control. And then because you're kind of fetching it when someone calls - you're fetching the configuration when someone calls invoke, you can change it in the UI very easily and get a good result. And you can respond to production incidents really quickly. I think it's like a very powerful approach. 

[0:40:26] KB: Yeah, that's super cool. Looking forward then, what is the edge that you're working on? What are the things that you see coming in the next - as you said earlier, times feel compressed. I don't try to look out multiple years anymore. But what is coming in the next few weeks or months, right? 

[0:40:43] GM: On specifically the OPIK front or just my prediction of - 

[0:40:47] KB: Well, let's start with OPIK. But I think you have a window into this space that I'd be curious to look at also beyond OPIK. 

[0:40:54] GM: I think on our front, we're spending a lot of efforts on a bunch of the stuff we talked about. How do we help teams bootstrap eval, regression, suites, very easily? And a nontechnical person can do it. I think that's the biggest blocker at the moment for everything. If we figure that out, I think a lot of people will be much more successful with their agents. That's the first block. 

And then really, this concept of blueprints. This concept that you can run optimization, get a new blueprint or variation, test it in production on a certain percentage of your traffic. All of that is kind of deep in the pipeline, and it should be coming soon. All going to be open source, everything we do. That's that. 

[0:41:35] KB: And then I think once we start getting customers and users in this, what you call in-vivo, in production, I think we'll probably learn a bunch more. I don't know how to predict how that would look like yet. That's maybe a few more months in the future. 

I mean, generally in the industry, I think, first of all, it's clear that these models are still getting better. I think everyone who used cloud code somewhere in the December time frame, something changed. 

[0:42:02] KB: Opus 4.5 possibly? 

[0:42:05] GM: Yeah. Yeah. Probably. But there's been versions. There's something more substantial of how good. And Opus is like a huge part, but I think Claude Code as a harness is extremely impressive. They did such a good job there. 

[0:42:17] KB: Well, and that is one of the things that's really interesting to me here, is the models are incredible, and they do a lot of things, and they're continuing to go. But there's still tremendous nuance into how to build an effective harness and put those pieces together. And what I'm hearing from you is the eval needs to be capturing all of those pieces and looking at all of those. You can't look at a prompt and a model in isolation. 

[0:42:39] GM: Yeah, I agree. And that's why the SWE-bench benchmark is not actually a good - because that tests only the model usually, right? There's a lot there, right? And one of the things, everyone were talking about vector database a year ago or two years ago. And Claude Code, I think they tested embeddings at some point, but it grips most of the stuff. And it does it in a really smart way, so it doesn't kind of blow up the context. And it works so well. 

I think that's the other thing that I'm starting to see more and more is a lot of teams started hitting challenges with their agent. As engineers, we're like, "Okay, let's start building some structure around it." We kind have these user flows. And if you're in this flow, I'll inject this context. And if you're in that flow, I'll inject that context. But I think the trend is you actually want to give the models more freedom versus more constraint, which is interesting. 

[0:43:32] KB: It's really interesting. And I think it's probably problem dependent, right? I see examples where it's like - and it depends on how flexible is the problem you're attacking. I have a well-constrained problem where I know what good looks like. Great. Let me lock everything down. Get a reproducible pipeline, a bunch of small, tightly scoped steps. Awesome. I have a general-purpose problem. This person's coding who the heck knows in what the heck language. I can't lock that down. I need to be able to flow with all those different changes. 

[0:44:00] GM: Yeah. And another thing is this is something I'm asking myself a lot, right? All these companies are building agents, right? And you go to this product, and it pops up in the UI on the right side, and chat agent, all that kind of stuff. And we're going to a world where everyone runs their Moltbot or whatever it's called, whatever the name is today - 

[0:44:20] KB: Hopefully, one that is less of a self-hacking pathway. 

[0:44:24] GM: Yeah. Yeah. I set it up, and I was like, "Do you want to set up connection to your email?" It's like, "No." 

[0:44:32] KB: Let me not only have the supply chain hack of NPM packages, but this thing is actively going out. I mean, we're talking about prompts as code. Let me go out and ask randos on the internet what code I should be running on your machine that you've given me all these permissions on. 

[0:44:45] GM: Yeah, I think people were privacy and security oriented. You sandboxed it pretty well. But I guess the point is what is going to be the interface. If everyone's just going to expose an MCP and you're going to just use your agent to call it. And if so, how much do you control out of the harness? It starts becoming like a different interaction. Or as a company, you will have your own agent. People will type in your chat window. 

And I'm thinking a lot about what's the role of UIs. Everyone who's been in software industry, we spent so much effort on like UIs and UX and all of those things. But is the future where every UI is on demand generated by your agent to just show you what you need to see right now? I don't know. I don't have an answer to all of these. But exciting times to be building. 

[0:45:41] KB: There is something there that might be worth digging into when we talk about agent optimization and how to do this. When we talk about UIs, I think one of the really interesting things that LLMs allow you to do in a UI is you can build kind of more intent-based interactions, right? So traditional software is very imperative. I am going to click on this thing and drag it over here. I'm going to go this way, what have you. 

And even if you don't have a full-on chatbot, which I think we over index on chatbots when we talk about LLMs, but you have something that can interpret fuzzy direction, whether it's voice-driven, or language, or even, I think, can still be some amount of gesture or other different things. But it's able to kind of make inferences based on incomplete information and do the imperative pieces for you. And that just opens this tremendous opportunity in terms of streamlining people's experiences, even if your core product has nothing to do with an LLM. 

[0:46:42] GM: Yeah, I completely agree. I think the question is, okay, I have my application code, and I write all these LLM workflows to try to kind of determine that kind of stuff. But when you think about a chatbot session, it doesn't actually have to be exposed to the user. Text is great for some things, horrible for other things, right? I'm kind of thinking UIs will need to be in sync with your chat session, right? 

It's kind of like what you said. The UI keeps changing based on what the context of the session is. And for some things, you're going to go and type it. Some things, I really - talking about data tables. I really rather have a couple buttons to filter, and sort, and look at things. But they have to be in sync. And to your point, and then you can start doing all these things that you can infer implicitly about what the user is trying to do so much better than, "Oh, that was the user flow we thought about when we did this product meeting." 

[0:47:36] KB: Now you have an interesting eval question, too, right? How do I eval that this UI that got generated on the fly? And I have seen some fascinating experiments with generative UIs. You can do really interesting things. Some of them are terrible. Some of them are great. But what does that eval look like? 

[0:47:55] GM: Yeah, I don't have all the answers, right? The reality is obviously it's a very different modality, but at least be similar to what we're seeing today, where you're going to have some form that's human reviewed and provided the feedback. And then you're going to try to align or optimize your LM as a judge to try to be as close as possible to the human evaluator. And that's like a continuous thing. It's not scalable to have humans review everything. LLM-as-a-Judge by itself has introduced so much more fuzziness and noise. So it's not helpful. I think we'll have to be somewhere there to make this stuff work. 

But yeah, indeed, it gets hard. This is a slightly different use case, but there are all these evaluation data sets for browser use models, which are different but not completely. Yes, the UI is not generated, but the state is determined based on the actions that the web use agent did. I think they do it on like the DOM level, to be honest. 

[0:48:53] KB: Well, and you could do something interesting with that. And one of the models I've seen for doing this is like essentially you have a UI layer that is controlled by React, or Redux, or something like that. And then you have like a JS sandbox, and you let your LLM just like yolo code into the sandbox. And all it can do is issue Redux things, right? But that's enough to run your UI layer. And so you've got it sandboxed, it's safe, it's not messing with stuff. But then you end up with React code, which you could test it in any of the React testing libraries, or it does boil down to DOM level, and you could do DOM-level stuff. That's interesting. 

[0:49:27] GM: Yeah, I haven't seen that. I can follow up later. I'd love to see that. I've seen some attempts. There's a library. I forgot the name now. But some attempts to sync sessions between some agent session and a UI. There's a bunch of stuff of in-chat UI components that try to do that. I don't know what the right modality is. But clearly, we have tons of chatbots, tons of UI products. And at the moment, it's just like a widget at the right side. I feel like it's going to change a lot. 

[0:49:58] KB: So we're getting closer to the end of our time. Well, so let's maybe close with this, right? You're seeing a lot of teams building agents, grappling with these problems. Do you have a set of advice, or guidelines, or things you would say? Like, "Hey, you're tackling agents, there is no established years-old best practices, but here's what I'm seeing in the field. You should be doing these things." 

[0:50:20] GM: There's a few learnings I can share, right? The first one is there's tens, if not hundreds, of agent frameworks out there. 80% of the people I see are not using any of them. You can spend months in testing all these different frameworks, and they do add value. I'm not diminishing them. But the reality, some of the most successful ones out there are kind of homebrewed vanilla built. I wouldn't spend too much effort on that. 

My next piece of advice is - and it's kind of annoying. It's kind of you go to the doctor and tells you, "Oh, you have to work out more." You know the answer, but a lot of people struggle with doing - or eat healthier, and all those kind of things. But spend some time on building a very small evaluation data set. 20 samples. Just 20. It will pay off big time, big time. And then you can yolo the whole thing. Vibe code your agent. You'll be so much more successful. 

The other thing is, and we talked a little bit about it, I don't typically suggesting like worry too much about costs early on, mostly because they tend to go down by roughly 90% year-over-year by design. And it's like a little bit of premature optimization to use the best model, use the frontier model, make it work first. And then you can figure out, "Can I make this work with a cheaper, smaller model?" 

And then just generally, if you're online, if you're on Twitter, it seems like everyone figure this out. And everyone has hundreds of - Fortune 500 CEOs say, "Hey, we have 10,000 employees who are actual agents." The reality is everyone in the industry is trying to figure this out, and it's hard for everyone, including OpenAI. They just put a great post on their data agent. And you can see they're struggling with the same challenges like all of us. Don't feel that pressure that everyone figured it out and you didn't. I would say that's in a nutshell.


[END]
SED 1910		Transcript

	(c) 2026 Software Engineering Daily