EPISODE 1827 [INTRODUCTION] [0:00:01] ANNOUNCER: Glean is a workplace search and knowledge discovery company that helps organizations find and access information across various internal tools and data sources. Their platform uses AI to provide personalized search results to assist members of an organization in retrieving relevant documents, emails, and conversations. The rise of LLM-based agentic reasoning systems now presents new opportunities to build advanced functionality using an organization's internal data. Eddie Zhou is a founding engineer at Glean and previously worked at Google. He joined Sean Falconer to discuss the engineering and design considerations around building agentic tooling to enhance productivity and decision-making. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. [INTERVIEW] [0:01:01] SF: Eddie, welcome to the show. [0:01:02] EZ: Thanks, Sean. Great to be here. [0:01:04] SF: Yeah, absolutely. I'm really looking forward to this. I'm fairly familiar with Glean. It's a bunch of my former colleagues at Google have gone and moved on to working at Glean. And I'm curious about how has the original vision of Glean evolve from this like enterprise search company to now working and having reasoning agents as part of their offering? [0:01:26] EZ: Yeah, yeah, definitely. I'd like to think that our vision actually hasn't changed that much and enterprise search was really just the first way that we could deliver part of that vision. And so the way I like to think about it is with enterprise search, we were meeting knowledge workers in a little slice of their sort of job to be done, their user journey, right? They open their computer, they need to do something, whether they're an engineer writing code, a PM researching other products, a salesperson prepping for a call, they all kind of have some journey that they need to do. And we were looking basically to make that journey more meaningful, easier, and sort of free them up to do more things, right? And so with search, it was really about, "Okay, if this person can do the work of thinking, "What is my high-level goal? What do I need to do?" Identify that I need to find some information and then going and interfacing with a search product to get that information, right? You've really sort of helped them in that little segment, right? And all we've done as we've evolved from Glean search, to assistant, and now to this agent platform is sort of broaden that segment, both to the left and to the right. Sort of helping them need to do a little bit less work to understand what do they need to know to do this task, meeting them earlier on in that journey, as well as to the right, sort of helping them get more done after they found the information, right? Maybe it's just synthesizing the information on the result page. Maybe it's actually starting to help them do whatever they're going to actually do with that information. I really see that sort of evolution from enterprise search into where we are today as an extension of the same vision we had to sort of help knowledge workers everywhere. [0:03:02] SF: Yeah. I think one of the unique aspects of Glean and how you've managed to position yourself in the market is with the models themselves, they're very, very smart, but they're really, really dumb about your data, essentially. They know all of this stuff, but they don't really know anything about you and your company and that's where the challenge is to build actual meaningful applications. And they also think that's one of the challenges with some of the other agent offerings that are available in the market. It's like it's great this tool could be amazing, do all this reasoning capability, but where essentially is it getting that data? And since Glean has already plugged into that from the very beginning, has access to all the sort of really rich information that you essentially want these agents to have access to. [0:03:47] EZ: Right. Right. And I think that that's an important sort of nuance for people to understand as the underlying LLMs that power these agentic systems get better, because them getting better is useful for everyone, but it's important to define what they get better at, right? They're not getting better at knowing your company's knowledge. That's a binary thing. Have they seen it or have they not, right? And so you're right that figuring out how to inject, we call it context injection. Putting enterprise context at the right places throughout an agentic system's execution is really important to actually getting them to work. [0:04:25] SF: Yeah. Because I've seen you have a really powerful model. But if you don't have good data, it doesn't really help. You can have essentially a lower power model that has access to the right data at the right time and perform the best model in the world. It really comes down to I think the central challenge for most companies building any kind of AI experience today is how do I kind of liberate the data and find that subset of data to provide stuff into that context window so I can actually steer the model in the correct direction to generate a meaningful response. [0:04:57] EZ: That's right. This sort of notion of world knowledge and then company knowledge as a layer on top is something sort of tying back to your first question that we've been thinking about in the search world for a while. When someone comes to a product in their work life, a search bar or a chat interface, they're not only in the company context. They're an employee of this company, but they've built on top of the foundation of world knowledge. And so we have to build our systems in the same way. Even in the search setting, if you think about embedding models used for retrieval, you still run into the same dynamic of, "Hey, have these models seen this new data or this data that's specific to the company? If they haven't, they're not going to perform well." How do you sort of layer on that new understanding? How do you adjust these models to account for that? The same thing goes for all the generative products and the generative use cases in the agent world. How do you effectively acknowledge that there is a prior, if you will, in that world knowledge, but you need to augment it with what's there. And it's a really hard problem. I'm not going to claim we've fully solved it by any means, but I think we're really well positioned to keep making progress and building things that give value to people. [0:06:05] SF: So I think it's worth stopping down for a second and actually explaining from your perspective what exactly is a reasoning agent. I think there's a lot of variation in terms of what people think about an agent. I know Hugging Face recently came out with this nice framework of kind of like the levels of agentic AI. How do you essentially define a reasoning agent? [0:06:25] EZ: Yeah. And I want to preface this by saying, by no means do we think - we're not incredibly opinionated here. I think there's a lot of framings and frameworks that are being developed and they all have their merits. I think we've sort of come to our own internal viewpoint on this. And even that continues to evolve. Because it does matter when you talk about, "Okay, if we have a reasoning agent and it has access to tools, what are those tools themselves capable of?" And so one framing is, okay, a reasoning agent is something that can, given a set of tools, formulate a plan to satisfy an input, and then go and execute those tools, right? But then there starts to become many different questions, extensions, right? Can it use these tools in an iterative fashion? What is the granularity of these tools? Are the tools themselves other agents in the sense that they also have the ability to call out further into other systems? Or are they sort of more static? And so I know this is kind of a cop-out answer, but we're trying to not all a hard line around this while still figuring out how can we keep our framework flexible enough to adapt to how the industry is thinking about agents and thinking about agents they might build in the open source world, they might build in other parts. And how can we make sure they can at least integrate with Glean as the dust sort of continues to settle on this? [0:07:50] SF: When customers come to you and they're asking for an explanation between how are agents different than RAG, how are agents different than some sort of AI application that follows a workflow? What is the explanation there? [0:08:03] EZ: Yeah, I think for the former, in terms of how it's different from RAG, I think the main component here is that we would see agents as an extension of RAG in the sense that the only tool that a "RAG agent" has access to or the only flavor of tool is some sort of read or retrieve type tool, right? It can issue a search, whether it's a search engine like Glean or a federated search engine, and then it sort of generates, right? Retrieval-augmented generation and it generates based off of that, right? And so agents are simply an extension of that where the content that's being generated may not be the response to the user. It might be the next step in a plan, right? And being able to sort of sequence this out so that it's not just a retrieve and then a generate step, but rather a more extensible. Perhaps it's multiple retrieve steps, perhaps it's a generation that goes into the next step that is then another retrieval. It's providing access to more actions that are not just retrieve and read actions, but actually executing. It could be executing code. It could be interfacing - we call them write tools, but sort of writing out into the world. Those are some ways that we sort of distinguish agents from core RAG. And the second one you mentioned - what was the second distinction you were looking for? [0:09:17] SF: Essentially, an AI application that's following some sort of workflow. [0:09:20] EZ: Yeah. I think in terms of the word workflow, I think the distinction between what people have been calling agents - and this is where terminology does get a little bit blurry, but we do like to think about things in terms of a sort of static and dynamic, right? A workflow might represent something that's more of a fixed execution. It might be multiple steps. There might still be LLM calls that are allowing for some variability, but the actual execution flow of what steps will be executed is fixed. And the "system itself" can't modify that graph. But once you introduce some level of dynamic where you say, "Okay, for a given query, we're actually constructing a graph or perhaps iteratively constructing it," that becomes the distinction between a workflow or what people are calling a workflow. And of course, you can take graphs that are constructed for a given query and freeze them into workflows that are repeatable, but that's another distinction between these sort of workflows and more dynamic agents. [0:10:21] SF: Right. It really boils down to what is the control logic? Is it some sort of predetermined, pre-programmed set of steps? Or are you essentially allowing sort of the brain of the agent to determine what that sequence of steps is? [0:10:34] EZ: That's right. And they definitely both have their place. It's sort of a high-risk, high-reward, right? The more levers you give the system to create something flexible, the more powerful it can be. You could hope that it generalizes to new use cases, new queries. But at the same time, it can be more unpredictable, right? And if you want something more predictable, you want something frozen, you certainly can buy sort of the size of freezing that control logic, as you said. And of course, you can use a dynamic system to help you build the initial sort of graph that you want to freeze, iterate on it, and then freeze it and use it indefinitely, right? That's obviously another path you could take. [0:11:12] SF: Do you see most people using some sort of hybrid approach where some part of the sort of workflow might be agentic where it's a more dynamic set of reflection steps or something like that, whereas other parts of it is a more predetermined set of orchestration steps? [0:11:30] EZ: Yeah, I think we do see both. And I think everyone wants a fully looping dynamic system. But they often find, once they deploy it, it's a lot harder to sort of get a handle on. And so people end up building more constraints back around their system and sort of freezing different parts of it. But it really depends on the use case and the stakes. If you have something that are high stakes, you can't afford that execution graph to change. And for us, I think what's interesting sort of tying this back to the enterprise case is when we talk about something like reflection, a concept like reflection, which is, "Hey, given what I've executed so far," you ask this central brain, this LLM, decision-making logic, "what should I do next?" And the really tricky thing about many of these enterprise, not all, but many of these enterprise use cases, is the same dynamic between world and company knowledge. It doesn't know what it doesn't know. I mean, this problem is present even in RAG. In a simple retrieve and then reflect before generate step. If I'm asking some question, it could be as simple as about my holiday calendar that my company has for this year, depending on it, what the retrieval engine does. If it retrieves the right things or if it doesn't, how can you ask that reflection component to do the right thing afterwards, right? If it's sort of beholden to the performance of that previous upstream system, if you present it, here's the query that was planned, here's some results, what should you do next, right? And you could imagine a case where the search engine didn't return the right results and it decides to respond. Reflection, I think is a very hard task to do in the enterprise context, because these agents generally don't know what they don't know. And so that's a very careful context injection sort of problem to think about and work on. [0:13:22] SF: How do you deal with sort of the unbounded execution even outside of reflection? Unless you're putting a hard limit on how many times we can kind of loop over this, how do you essentially manage the fact that the execution cycle might be unbounded? [0:13:37] EZ: Yeah, I think as simple as it sounds, the first thing you mentioned is kind of like the easiest way to do it is you might allow for a fixed number of executions at various stages, right? It could be overall the number of steps. You can only create an execution graph with this number of steps. Or within a given number of steps, you can only have it iterate this many times. Those are some good guardrails to put in. I think it also sort of depends on the complexity here. There's probably some neat analog to - there was a paper last week or two weeks ago that people were benchmarking a lot of these thinking models, like the LLMs themselves, and sort of measuring their performance as the number of thinking tokens increased. And they found that there's a sharp decrease in performance once the thinking tokens became too long. And the sort of, I don't know, intuitive analog is that, well, it's rabbit-holding, it's spinning, right? And you could probably extend that same thing to agentic systems where if your number of executions, if you're looping too much, it's probably unlikely you're going to reach a good outcome. And I think there's a careful balance there, right? I think there's a lot of value that people can have with simpler agents that do execute a much smaller - we're talking order of single-digit number of executions rather than dozens or hundreds, right? And so I think playing in that smaller space still provides a lot of value and you can add some simple upper bounds depending on the use case. [0:14:56] SF: Does it help to think about, essentially, by breaking up these problems where you want an agent to operate and solve some sort of task, instead of using sort of one monolithic agent to perform that, to split it up essentially into a multi-agent system? [0:15:09] EZ: Totally, totally. And this goes back to how - I'm trying not to be too opinionated here. But personally, I do have some opinion in terms of, for us, we're also thinking about, "Okay, how do you scale, right?" How do you both build agents internally and help others build agents in a way that isn't bottlenecked on a monolithic system? Again, drawing from ML systems and sort of first principles here, a lot of times, ML systems in large products become too monolithic. And there's a downside to them, because then, "Okay, you have your team of 15 people. Now, they can only all work on this one model." And everyone's just trying to work on this one model. So your actual rate of improvement is lower than if you had left that factored into multiple systems, you could have multiple people working in parallel and sort of you're giving up that "short-term" gain for a more medium and long-term game, but you'll reach a better spot. And I think the same thing applies here where our internal approach right now is a little bit more, "Okay, yes, you do have a central agent, but the sort of tools -" and I use that term again a little bit loosely, "that it's given access to, maybe other agents." And so if you can delegate more to those other agents. And it's not just about in the short-term what you can do, but someone full-time focused on making that agent successful is always going to be more - going to do have a better time doing that than you trying to solve all these things at the central level, right? I do think there's a huge role for delegation here, and it's about figuring out that right interface between that central agent and these delegated agents. [0:16:40] SF: Yeah. I mean, I think one of the things I see people sort of missing as they sort of dive into the space and are excited about it is that, as exciting as this stuff is, it really carries all the same challenges that you have with running any large distributed system. And the scale problems aren't simply just about infrastructure scale. They're also about how do you scale the teams. And, essentially, how do I design this in such a way that I can loosely couple these things, treat them essentially like a microservice that can kind of operate independently? Even leverage different models and stuff like that independent of some of these other systems. The teams don't have to necessarily know exactly what's happening within each particular team and have these hard, fast dependencies between them. [0:17:21] EZ: Totally. I mean, I know this podcast is Software Engineering Daily. And so experienced software engineers everywhere, I'm sure, have been punching the air, seeing LLM agent systems being designed without thinking about core engineering principles, just like all the things you said, that are still incredibly relevant. And I think the ability for these systems to show you something really awesome in the short term and give a good proof of concept has made it easier to sort of forget good engineering principles when designing them for the medium and long term. And obviously, the rate of change and the pace has made that hard too. [0:17:55] SF: Yeah, absolutely. And then in terms of like an agent, going back to like a singular agent, can you break down what the components or the anatomy of an agent is and how each of those are used in order to essentially come up with a plan, perform these reasoning steps, potentially execute tools sort of to solve a specific problem? [0:18:16] EZ: Yeah, sure. The way we're thinking about it is we have a central system that has access to a set of tools and its first step is to develop that first pass at a strategy or a plan using those tools, right? And how it does so is important. You can obviously give descriptions of the tools, or if you're using function calling, whatever that interface may be. But as with all LLMs, having good in-context examples is really important, right? And so this kind of goes back to the team scaling component of how do you influence the central system? Well, you can have teams that are building these sort of golden in context examples that say, "Hey, I want to make sure that when this central agent sees a query, like the one I care about, or sees an input, that they can effectively sort of build the graph that represents what I want it to look like, right?" And so I think that the first part is assembling this input to this main, what we call a strategize call, that is composed of available tools, but also curated important examples that demonstrate how to use those tools, how to synthesize them in this multi-step way, right? And then it becomes the output of this system is a sort of graph that can be partially executed or fully executed. But again, that graph, it has some level of delegation, right? Each of these tools, the way we're thinking about it, you can defer to them more and more, right? I can say, "Hey, for this tool, I have an objective." I'm not telling you exactly how to accomplish the objective, because that is the tool or the subagents goal. Or not a goal, but it's their sort of domain is to figure out how to do it. And so you can sort of distribute the cognitive load, if you will, over the subagent as well. And how you have different teams working on subagents, different folks can work on these subagents and make them work well. Yeah. Then you can basically execute your graph and come back to the central system as needed. But that's the rough flow that we're thinking about right now. [0:20:13] SF: Is there a limit to the number of tools that any one agent can handle? [0:20:18] EZ: Definitely. I think there fundamentally is. Even if we're not even talking about hundreds of tools, even if we're only talking about a dozen tools, the permutations of chaining them together or to accomplish some set of things is also obviously like exponential, right? And so you do need - this is our kind of drum that we beat. A lot of it can come back to search. And for us, it's thinking about, "Okay, how can we make sense of all these permutations?" Especially the ones that are these sort of in context examples that are examples of how to use these tools in sequence or in parallel. And take that eventual set of tens of thousands, hundreds of thousands of millions and search over them down to something that's much smaller and guide the LLM to that. You almost have a graph search problem or just some sort of item search problem that you can then say, "Okay, now it's a tractable set of things that we want." We've given an assist to this central agent. It doesn't have to fully reason over all the possibilities. You can, again, factor part of that system out itself to a search problem. Do that search separately and then say, "Hey, here's 5, 10, 20 different tools and/or combinations of tools that are probably relevant for this, rather than the entirety of all the combinations." [0:21:29] SF: The idea there is that I have some sort of context about what it is I need to - maybe data I need to gather. Then I can essentially perform the search. And, essentially, that's sort of my first line of defense is a tool at sort of the main agent that's operating here. And then from there, I know, "Okay, well, Slack has some of this information, Google Docs has some of this information, Confluence page has some of this information." And then I can essentially limit the set of tools to those three tools to then go and execute against. [0:21:58] EZ: Yeah. That's a good way to think about it. And another thing from our side, the way we think about it is the other dimension on this is the sort of company-specific dimension, right? Mostly companies kind of do work or store information in roughly the same way, but that's very quickly starts to break down. Companies develop their own ways of storing information. That's what we've been building with search, right? Even the sort of algorithms that you would use or the nuances of like, "Okay, what tools or workflows are relevant for company X for this query might be different for company Y, right?" And so the way that you build that search algorithm can start to be - you need to do query understanding in the same way over, "Okay, when Confluent users are asking about this kind of query, what kind of workflows should be surfaced in this search setting?" Versus if a Glean employee is asking that, it might be different, right? And that kind of goes back to a lot of what we've been doing in Glean around not just around language adaptation, but also around these other signals that are important to understand. Sean is working in these different places. When he asks this query, again, that search algorithm should be actually personalized to him, and it may return a different set of things, a different set of subagents to reason over than if someone else at Confluent does. [0:23:11] SF: Right. How do you persist the identity information across all these particular endpoints? [0:23:18] EZ: I mean, in the Glean context, a given request is going to come in and, obviously, it has identity information associated with it, right? So we can make permissioned calls wherever we need it. At the agent level, I guess your question is sort of - okay, I guess if we draw the analog back to documents, it's obvious in that documents have authors and people interacting with it. But now, if the items that you're sort of searching over are other agents or workflows, maybe your question is how do we know who is associated with them and how the identity gets associated there? [0:23:51] SF: Right. If I'm the user that's interacting with the agent, and let's say there's some sort of UI to this agent, presumably the agent needs to know who I am in that organization. So that when it makes a tool call, essentially, my identity information can be factored into that tool call so it knows that, "Okay, well, Sean only has access to this subset of documents within the organization?" [0:24:11] EZ: Yeah. I mean, in that sense, it's no different than how Glean manages identity throughout the whole. We can re-leverage our entire identity infrastructure and platform. But I think the interesting question is sort of on the modeling front of, "If Sean's colleague has created this agent, should that agent be more likely to be relevant to Sean? Depending on how closely they work together, where they're working in," these same kinds of signals because we can have that identity metadata and that sort of implicit. The explicit stuff is a given, right? You only can have access to things you have access to. Certain tools can only execute if you have access to them. It's not just documents, right? We're able to build on all the same Glean infrastructure we have to make that work. [0:24:50] SF: Given that you have essentially any agent, multi-agent system, you have a lot of these internal, external dependencies. Presumably, it's going to be probably not running all on one server and stuff like that. And on top of that, you have sort of these unbounded execution plans that might have cycles that happen. In a stochastic model, the heart of this is acting as the brain. How do you manage the debug process? How do you figure out when errors go happen with some of these dynamically-generated execution workflows? [0:25:19] EZ: It's a really, really good question, and it's another thing that as the use cases and the cases and the tooling evolve in parallel, we're all trying to build the right tools to give ourselves this ability. For us, it's similar to reasoning about the life of any query, if you will, even in the search context. Most of the things we've been building are compound AI systems. They're composed together, so you need to be able to say, "Okay, at the high level, where can I track this down to? Where in the flow input and output of each system do I think the breakdown is and sort of do that trace, right?" People talk about reasoning traces. This is similar. For us, when we talk about the graph, it is a lot, "Okay, how can we figure out where in this graph the output was not desired from the input?" You can always start from the back. You could start from the beginning, either way. But I'm curious to dig into a bit you said internal and external dependencies. I'm not sure I fully understood what you meant there. [0:26:12] SF: Well, from the agent's perspective itself. I could - this may not be relevant in the context of Glean, but just thinking about agents in general. I could have internal knowledge systems I need to tap into, which is my Google Docs or something like that. But it could also have factor. I need to factor in, essentially, external knowledge systems like DIN or some sort of website or something like that, where I'm actually searching beyond the bounds of my companies. [0:26:37] EZ: That's definitely relevant to us, by the way. It's clear that there's a one-way street of internal stuff's not going out, but definitely external stuff is coming in. Like I mentioned, people are coming to it. Many queries that come into agents, you need that blend, like you're saying. [0:26:51] SF: Yes. Then also you have short-term memory, long-term memory, which could be different systems as well. I guess like how are you managing that? Is your long-term memory using a vector store representation of that? [0:27:05] EZ: Yes. I think long-term memory, the way to think about it is, again, you can model that as a reliance on your context injection. How do I know what is the relevant plan for Eddie for this query? There's short-term information I can make use of like what was his last query. What work was he doing before this? But then that extends further into these signals that we use for search are, in many ways, closer to long-term memory, right? What was Eddie doing one month ago, or what team was he working with? These things, again, can be injected into different pieces, different prompts at different stages in this graph via any methods, right? They could be retrieved with means. They could be vector-retrieved. They could be lexically retrieved. I think the more important thing is that they are retrieved in some fashion and sort of injected at the right time. [0:27:58] SF: In terms of the outputs that any of these agents are generating, how do you essentially control for incorrect information, hallucinations, like put guardrails around it? What sort of post-processing steps exist to essentially evaluate the response to make sure that it's actually a valuable response? [0:28:15] EZ: Yes, very good question and very hard question to answer. I think for us, our best bet is, look, these systems are unbounded, right? I think the biggest delusion that some folks have is, "I can instruct things not to happen, and they won't happen." But instruction following, it's well-defined. You could measure and rate how well is this instruction followed. But once instruction starts to bleed into knowledge, again, going back to this, you don't know what you don't know. You tell the LLM never to lie. It's not lying with the context that it's given, right? So this does relate a lot to rag concepts around, okay, is something correct conditional on the context that it's given? Or is it correct, sort of independent? As an end-to-end system, the user doesn't care. The user needs to make sure they're not getting false information. But from an ML engineer's perspective of diagnosing, it does matter what part of the system is breaking down, right? In terms of the guardrails, I think doing stuff on the fly, there's some low-hanging fruit that can be done. But ultimately, on the fly is a hard problem, right? You're asking a system to reason itself, "Hey, is what I just emitted correct or incorrect?" I think that going back to this knowledge problem, that's a very hard thing to do. You can obviously build in online judges to say, "Tasks that are a bit more narrow." Was the output that was just created ungrounded on the context that it was given? That's more attractable. But even if it isn't ungrounded, that doesn't mean that it's correct, right? Again, given that context problem. A lot of what we try to do is measure more things offline in batch, right? You can run all kinds of processes to generate things offline, run them through your system, generate things you know to be correct, and make sure that your system can achieve them, or make sure that you create adversarial sets where you say, "Hey, I'm going to make it look like this is the case. How can I measure my system's performance to back off correctly or whatever it might be?" You kind of come at it from that side, and you get a measure of, okay, how good are we at this, and how can we continue improving that? You pair that with what's happening at request time online. But it's really hard. It's sort of an unsolved problem to say for every input request coming in, do I know if it was exactly correct or not? I can put some guardrails around that, but the sort of strategy we've been coming at is like larger scale measurement from the other side. [0:30:29] SF: In terms of all the pieces that make this system possible from tracing that you're doing, some of this offline batch processing to evaluate responses, actually building and deploying the agents in the way that they communicate, how much of that is built from scratch versus relying on existing tools? [0:30:48] EZ: I would say it's a blend. I think we are constantly evaluating and looking at parts of our system that even were built earlier that now there's a great open source alternative for and revisiting whether we can rebuild on top of that. I think it's a blend. It's sort of a trope that engineers always want to reinvent the wheel, but a great engineer will never do that because they know they can create more impact building on top of what others have built. For us, I think it is a blend. Our principle is to try to reuse where possible. We don't always do that perfectly. I think especially in an enterprise environment, there's also components like how do we - if we care deeply about efficiency, for example, and performance, right? Are we sure that the frameworks that we're using are pushing the boundary there, right? Or do we need to go in and roll something ourselves? Can we optimize within that framework, or do we need to roll something ourselves, right? Are there fundamental design decisions in these frameworks that go against our security constraints or our deployment setup? That doesn't come up a lot, but I think there's these checklist of things we run through to understand, can we sub out a new framework for what we have? But at the end of the day, it's sometimes we're rolling our own, and sometimes we're relying on what's out there. [0:32:03] SF: Do you have some sort of eval framework put in place to make sure that when you are making changes to perhaps the system prompts of some of these things that you're actually generating a better result than you were previously? [0:32:15] EZ: Oh, definitely. Definitely. That would be crazy if we were just pushing out changes without evaluation. [0:32:20] SF: I see some crazy stuff out there. [0:32:22] EZ: Totally. I should say that would be crazy. I know probably most people out there are doing that. [0:32:25] SF: There's a lot of putting your finger in the wind and seeing which way is blowing it. [0:32:28] EZ: Totally, totally. I sort of have mentioned this before to other folks internally because it's interesting. The team has been building an ML or AI product for a while, right? So they have the muscle build of what evaluation means, right? It's interesting coming from the search side because you have the search side, the traditional ML side, and now this gen AI product side, right? Where on the traditional ML side is you have some large-scale ML system. You monitor some metric. You try to make - your experiment is, "Hey, I changed the way this model trains. I changed this model architecture. Numbers go up. Great." The search world is sometimes that, but also a lot more qualitative. "Hey, I'm looking at individual queries. I'm trying to understand which parts of the system are breaking down. What can I change?" But I'm always going to run an evaluation, right? When you run an evaluation, you have some parts that are automated that give you a high-level metric, but you're also going to get a qualitative sense. You're going to go look at some queries and understand more vibes-based evals, if you will say. That was a thing that people started with a lot in the generative world, but it's still really relevant. So I think it's a lot of it is about pairing something quantitative large scale. You can say, "Hey, I ran my evaluation suite on my prompt change, and it's clearly the metrics went down a lot, and so I know there's something to be concerned with." If you run them and they're all neutral, they might still be still be the right thing to do. You have to rely on, "Hey, is there enough qualitative evidence here for me to believe that I'm making progress, that I know I'm improving some of these issues?" I think all things in balance here, and then layer on top of that is automated. Once you have a strong enough evaluation signal with enough density, a lot of prompt engineering stuff can then be automated, and you can use all kinds of frameworks out there to do that. [0:34:06] SF: In terms of the agent experiences that are available on Glean, I know that there's essentially like a no-code experience where I can just fill out some forms, create an agent that way. There's also some existing pre-built agent. I know that Glean has apps as well, but you can build against the APIs. Are the APIs for building that new agent experience as well? [0:34:25] EZ: I don't want to speak out of the product roadmap and make a bunch of PMs and/or engineers frustrated that I said the wrong thing here. I don't know what to commit to here, but certainly we want to support a range of builders from low code all the way to people understanding how to programmatically do these things. I think API definitions are important, and it can be tricky in the gendered world to say like, "Hey, what exactly is this API?" But I think in the agent-building case, there's definitely talk of it. I don't know if it's committed to, but we want every engineer internally to be able to build powerful agents on the same set of tools that we're giving external users. If we can't successfully - I guess that's our sort of forcing function to make sure whatever we're exposing in the platform we're building is effective because if engineers with "full access" internally can't do it effectively, then we certainly can't expect folks outside of Glean to do that. [0:35:21] SF: How are you dog-fooding some of this stuff internally? Are you using some of this technology to essentially make people more efficient within Glean? [0:35:28] EZ: Yes. I think folks are building internal agents, finding use cases where they're relevant, and trying to build those using that same suite of tools from low code to otherwise. They have different levels of traction, and that's the neat thing about user-generated content, right? Some of them take off. Some of them don't. So our product team is constantly trying to understand, hey, what use cases are really shining through. How much of things that people ask in chat or assistant itself are really agents or workflows that should be abstracted out and made more repeatable, and how can we make that happen? A lot of it is bringing more structure to a lot of existing usage. [0:36:06] SF: How should people be thinking about measuring the success of any particular agent that they're using? [0:36:12] EZ: Wow, what a loaded question. I wish I could speak for - There are so many different agent use agent use cases, right? I think usage is a decently good barometer, right? If people are coming back to it, that means they're probably finding value out of it. In the long term, it holds true. You could build an agent that's actually just wrong all the time, and people use it at first. But then once they realize that it's wrong, they won't. As long as you measure on a long enough time horizon, I think that's an effective measure. Usage is always king for any product. I think when it comes success, it's about measuring outcomes that come from that, and that starts to get a little bit more specific, right? Are you talking about, "Hey, here's an agent that half of our sales people use and the other half didn't."? Are we running an internal AB test to see how many exceeded their quota by a lot or not, right? You could start to start thinking about some of those outcomes, although that becomes, like I mentioned, really use case-specific. [0:37:03] SF: How should companies be thinking about agents versus some sort of simpler process? What is the scenarios where it makes sense for them to be like, "Okay. Well, we're going to go full in on an agent versus some simpler prompt-based approach or workflow?" [0:37:18] EZ: Ideally, they don't need to think about the level of abstraction. Ideally, a single product can cleanly span the gamut of like, "I'm putting in a natural language instruction on this use case." Perhaps behind the scenes, I don't care what happens. It's going to either do something simple. Or if it detects it's something complex, then I'll be prompted to say, "Okay, we think this probably merits something more complex. Do you want to help refine or iterate on this agent," right? But requiring people to do the pre-work of understanding how complex their task is is a hard thing to do. It's a big ask to make, right? At least from our perspective, we'd like to lift that away from folks and get them to try the simpler thing and push them up the complexity curve as needed. [0:38:05] SF: What would you say is one of the biggest hardest technical challenges with actually building agents today? What is the gap that's there that some R&D efforts need to be put into in order to solve? [0:38:19] EZ: I actually still think tooling and the evaluation suite of tooling. It's just an engineering problem is still a little bit behind, ML Infra in the traditional ML world. The faster you can give folks the ability to see how their agents are doing and help them evaluate at scale and push them to create a lot more training or eval data, the faster these agents can actually work, right? Because a lot of it is getting that evaluation signal, so you can tune on it. I do think that's to me a bottleneck I see across the industry, not just - because a lot of people see a small barrier, and they just give up, and they just ship whatever's out there because they're like, "I can't measure this anyways," right? But imagine how much more reliable what they could ship is if they could have iterated on it. That's top of mind for us as we're thinking about, okay, how do we help people build not just any agent but effective agents, right? How do we give them the right toolkit to really measure because no one wants to put something out there, and then it turns out none of their colleagues can use it because it's unusable. We want to help them get a pulse on, okay, I'm pretty confident because I ran hundreds of queries on this, and that was really easy to do, that this is going to work as I expect it to. [0:39:30] SF: In the companies I talked to, of course, there's a ton of interest in leveraging AI. But there is a lot of fear around making any of this stuff customer-facing. So a lot of it is looking at, I think, internally to how do I augment my existing knowledge workers and find efficiencies there before I ever put something customer-facing. [0:39:48] EZ: Yes. [0:39:49] SF: I agree. I think tooling maturity is a challenge. I think this is also why a lot of companies that are doing this in production don't trust even some of the existing tools that are in frameworks that are available because they're just worried about the maturity of those tools. It's not like you're building even on the cloud now that's been around for 15 years or whatever. You're building on stuff that's maybe only been around for six months sometimes. [0:40:11] EZ: Yes, yes. There's definitely that aspect. A lot of it is the applications of ML before were people understood, "Hey, I have this classifier." It has precision recall that's not 100%, but I know what the business outcome is with a false positive and with a false negative, right? For a lot of these applications, the business outcome of a false negative or, I mean, it's not even defined a false negative, false positive. It's undoubted text generation. It's a right action that does something that could be really devastating. It's hard to measure the business outcome of that, right? But a lot of people still need to just understand that it is still an ML system. There's going to be some stochastic nature to it. It's not going to behave exactly how you want it to all the time. It's the difference between an ML feature and a software feature in a way, right? It needs to behave like you intended to enough of the time. But I think that sort of uncertainty is fundamentally built into a lot of this. [0:41:01] SF: Awesome. Well, anything else you'd like to share? [0:41:03] EZ: No, we covered so much here. Thanks for some great questions, and it was really awesome talking. [0:41:08] SF: Yes. Well, thanks for being here. I really enjoyed it and cheers. [0:41:11] EZ: Cool. Thanks, Sean. [END]