EPISODE 1828

[INTRODUCTION]

[0:00:00] ANNOUNCER: Sourcegraph is a powerful code search and intelligence tool that helps developers navigate and understand large codebases efficiently. It provides advanced search functionality across multiple repositories, making it easier to find references, functions, and dependencies. Additionally, Sourcegraph integrates with various development workflows to streamline code reviews and collaboration across teams. 

Beyang Liu is the CTO and Co-Founder at Sourcegraph, where he has worked for the past 12 years. In this episode, he joins the show with Sean Falconer to talk about the frontier of leveraging AI and software engineering. 

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. 

[INTERVIEW]

[0:00:56] SF: Beyang, welcome to the show. 

[0:00:58] BL: Great to be here. Thanks for having me, Sean. 

[0:00:59] SF: Yeah, absolutely. You've been working on Sourcegraph now for over a decade. How have things evolved? And what are some of the big pivotal moments in the company's evolution? 

[0:01:11] BL: Yeah, definitely. I mean, as with many things, it's like some things stay the same, some things change drastically. Obviously, the impact of AI has, I think, introduced in everyone's minds the potential of changing the way that software is built. And there's certainly a lot more bot or computer-generated code now, a lot more use of large language models to write all sorts of functionality across many different verticals in industries building software. But there's still some of the fundamental problems that still exist, which are kind of the same as the ones that we were originally started to tackle. 

The fundamental problem of software development not really having economies of scale. In contrast to every other industry, every other industry, you build a bigger factory, you get more efficient, the stuff gets cheaper to produce as you scale. Software engineering is basically the opposite. As you grow, things get less efficient. 

[0:02:06] SF: It's like the Mythical Man-Month. 

[0:02:07] BL: Yeah, exactly the Mythical Man-Month still applies today even in the age of LLMs and AI that we live in. And this harkens back to the founding story of Sourcegraph, which is really born out of my co-founder, Quinn, and my work inside very large enterprise codebases. We were both engineers at Palantir. He was for deployed. I was just a regular engineer, but we're both going to the field a lot and working with a lot of Palantir's Fortune 500 customers. I shouldn't say a lot. We worked with the first two that Palantir landed. We were kind of on this like SWAT team of sorts that was trying to open up a new line of business for Palantir. That was our introduction to very large, messy codebases. I mean, Palantir's codebases itself at that point was probably like five, six years old, so it wasn't as bad as the codebases that we were being deployed into, but it was still enough to have a bit of cruft, a non-trivial amount of tech debt. 

[0:03:08] SF: Was that their desktop version at the time, basically of running some sort of Java suite? 

[0:03:14] BL: Palantir's codebase? 

[0:03:15] SF: Yeah. 

[0:03:15] BL: Yeah. Well, actually, it was kind of a combination of things. Originally, the project that I started on was like a Java-thick client. Because in those days, you have to remember that this was before React, before kind of the modern web development stack. Backbone.js and CoffeeScript were new at the time. That was the hot new thing. And so to get the richness of interactivity that needed for a lot of the data analysis workflows that they're trying to enable, Java Swing was the only option and they got started. 

But then by the time we started, we got put on quickly on this upstart team. And so we were using web technologies on that team, Frontend JavaScript and that sort of thing. But we had to integrate in the context of these very large banking codebases because those were the customers that were trying to go after. And that was just a huge pain. It was just like you couldn't get anything done. It took weeks, sometimes even months just to ask around and acquire enough understanding of how the code fit together to even begin to think about what you should be building or how that fit into the broader picture. 

And that was our first introduction. It kind of gave us this insight into the scope of the problem because we were being deployed. Palantir has brought in to solve very high priority problems. We were talking about things that were costing banks millions of dollars per day that were on some level existential if they weren't solved within a finite amount of time. This is all kind of like hangover from the 2008 financial crisis and cleaning that up. 

And so, yeah, it was kind of crazy to us that these institutions had all the money in the world to throw at solving this problem, could not solve the problem. And I just think that speaks to the fundamental problem that faces any software engineering at scale, which is software engineering really doesn't scale. That's still the problem that we're here to tackle. We've tackled it for a lot of customers at some level. And I think being able to integrate LLMs and AI into our tech stack I think makes it possible to actually sort of solve the Mythical Man-Month problem in the next decade or so, which is really exciting. 

[0:05:20] SF: Can you elaborate a little bit on that? How do you see AI helping solve that problem? Because presumably, I think the problem with scaling engineering is you end up with this var scale dependency graph essentially with tight coupling essentially between people. Communications break down. Everything just grinds to halt. 

[0:05:37] BL: Yeah. 

[0:05:37] SF: Could you potentially recreate the digital version of that if you're using AI as well? Because you're going to end up with a lot of these sort of digital dependencies essentially. 

[0:05:46] BL: Oh, totally. And I think like where the attention is today is all about AI-generated code, which is great. I mean, it's truly mind-blowing that you can create a whole simple web application basically in one shot with any of the big frontier models today. You can do a lot with them inside existing code bases too in terms of updating and editing the code with the right context. 

But I think the bottleneck, people still don't fully grok at the bottleneck in large scale software engineering. It's not on the writing code side, it's on the understanding and reading code side. And so to your point, it's like, right now, everyone's just taking code out of the LLM and basically pasting into their code. That maybe increases by an order of magnitude the volume of code you're able to generate in a given amount of time. But does that actually move the needle? Each line of code in the engineering discipline we tend to view as more liability rather than asset. The functionality you get from a line of code is an asset to your users and asset to the business. But the fact that you had to write a line of code in order to do that is kind of like a liability. It's like a column in the tech debt stack. 

And so I think there's a very real world where the amount of code in the world just balloons and there's all this slop which wasn't even written - there was never a human brain that conceived of it and understood how it worked, at least with human written code. Somebody at some point understood how it worked. With AI-generated code, that might not ever be the case. And then it might even exacerbate this problem of like, "Hey, the existing code is a tangled spaghetti mess," to the point where the AI can't work in it, and I as a human can also not work in it. 

And so, yeah, I think that remains the challenge ahead of us. And that's one of the challenges that we want to solve. And that is sort of this fundamental challenge of the Mythical Man-Month. For those who haven't read the book, it's kind of like a classic about the nature of software development. 

It's core thesis, it was written the 70s basically. Someone looked at the progression of software projects inside these mainframe companies back in the day and noted the fact that there's this weird phenomenon where when they would add people to a software in order to speed things up, it would actually end up slowing things down because it introduced just more communication overhead and a loss of kind of coherence or unifying vision in the codebase. A lot of abstractions that would clash with each other. Pieces that wouldn't fit nicely together. 

And so the conclusion there was software development, it's more like birthing a baby rather than churning out widgets in a factory. And I don't know if it's from this book or somewhere else, it was like you can't have nine women make a baby in one month, right? There's no way to speed that up. Because there's all these like serial dependencies. And the task is fundamentally - it requires one person to shepherd it along. And the analogy applies to software - it applied to software development back then. It applied to software development in 2019, before ChatGPT, and it still applies today. 

But I think the possibility, the potentiality that we see and are excited about is that, with the benefit of LLMs, you actually have a lever around human intelligence to be able to go to an architect or like a senior engineer and say, "We will give you this tool that you can wield to enforce the type of coherence, and standards, and consistency across the codebase that you need to keep things clean as you add more contributors and you scale your engineering org." 

I don't think there's a way around the scaling because the fact of the matter is, if your software becomes successful, you're going to have more users get on it, you're going to have more customers. Those customers are going to have feature requests that tailor the software to their needs. And so you kind of have to spin up more people to build all those features. The challenge is how do you manage the complexity and maintain that coherence of vision as you add more contributors into the project? 

[0:09:36] SF: In that case, this idea, maybe you can leverage AI to do a lot of the code generation, but then you sort of have this architect that's acting like the moderator or the orchestrator. They understand how the pieces go together. They understand what the standards are, and they're there to do essentially the understanding piece of this where AI is maybe not ready to help us with today. Is that the idea? 

[0:09:56] BL: Yeah, that's exactly right. In an ideal world, you would have one very smart individual reviewing all the code that goes in your codebase day, ensuring that it fits architecturally into the vision that they had in mind. Now, people don't do that. Because if you tried to do that, you literally could not fit the amount of reviews you'd have to do. That person would basically have to spend all their time just doing that. And even then, they would only cover like a fraction of the code that's being committed into the codebase. 

[0:10:24] SF: On the generation side, I think that AI for coding I think is one of the areas where AI is finding a lot of success today. And in a lot of ways, that seems surprising in some sense. But I also think that it also makes a lot of sense in a number of ways as well because there's a lot of examples, there's a lot of code that's been digitized as available for training. And the other thing is that one of the hard problems I think when working with large language models is it's hard to kind of benchmark and eval them. Versus if you're building like a purpose-built model, you know what the inputs and outputs relatively should be. So I can essentially create a set where I can evaluate the model. I make changes. I can re-evaluate the model. It's hard to do that with a general-purpose model. But if I'm doing code generation, it constrains the universe and I can actually build a relatively good benchmark and eval set to test against. I'm curious, from your perspective, do you think that is one of the primary reasons why AI in coding is finding success, or is there also other factors at play? 

[0:11:24] BL: I definitely think that that is one of the primary reasons why it's finding such success. I think valid code is a much more constrained domain than valid language, because you have this kind of like oracle that tells you first, "Does it compile?" If it doesn't compile, you have something that tells you exactly the error that preventing it from compiling. And then if it does compile, you also have these things called unit tests that validate the correctness of the function. 

And so, I mean, you can use that in a variety of ways. You can use that inference time to verify and validate the output that you get to make sure that it's correct. You can also use it at training time. You have a way to generate a high-quality set of synthetic training data by basically taking the output of an existing model, running it, and then checking the errors that it produces when it attempts to generate code that conforms to the user query or prompt. 

And so you put those two and two together and you have a vector for making models much more precise and useful at inference time and also a way to improve the models at the training step itself, which I think a lot of the frontier labs have been doing. And if anything, it's almost like the fact that this exists for code, it's great for the coding domain, but it also probably generalizes logical reasoning capability outside the coding domain. 

I remember when ChatGPT was first released, there was a lot of speculation that a lot of the kind of emergent capabilities came when they started training the model not just on natural language but also on code because there's certain logical referential patterns in code that kind of mapped logical reasoning that then can also generalize to logical reasoning patterns in natural language as well, which is really interesting. 

[0:13:06] SF: That kind of helps us with the generation part of this. What is Sourcegraph doing to kind of help with this understanding problem? I'm using AI, or even I have engineers, humans that are generating all this code, how do I actually, as that thing grows over time, maintain an understanding of it?

[0:13:24] BL: Yeah. There's kind of two ways where we differentiate in the understanding domain and make our product just much more powerful and impactful on big, messy production codebases. One is at the kind of context retrieval layer. And then the other one is at the kind of validation and verification layer. The way you think about this is, if you're trying to use AI to generate code, first you want to give it an example so that it gets within the rough ballpark of what is right or what is reasonable in your codebase without the benefit of contextual snippets. It's basically analogous to asking a human to write code for your codebase, but not actually letting them look at any of the other code that exists in your codebase. 

What would a human do in that case? They would just go to Stack Overflow and write code that fits your description, but uses just the common open-source frameworks that are well-represented on GitHub or Stack Overflow. And that just doesn't fly within a lot of messy production code bases. Because the longer code bases existed, the more kind of divergent and special snowflakey it becomes. 

The first thing we do is we use our code search engine, which is our first product, our original products, which is kind of like a Google for code and does much more than search. It also does code navigation and it has access to a couple more structured databases about code, these ontologies of code and technical context. We bring those relevant snippets into the context window and that helps steer the LLM to generating something that's much more within this distribution of what is acceptable in your organization. 

What we find is that there's a lot higher acceptance rates for the code that we generate inside the organization. It doesn't have to be reworked or rewritten as much because it has the benefit of that context. And then on the second side of things is the validation verification layer. We've been partnering with a lot of our enterprise customers to figure out how to automate a lot of the review that is done inside their organization. 

Code review is that's kind of like half box checking exercise, but half - actually there's some like useful things that we want to catch. And I think the way that humans do it, oftentimes it becomes like 98% box checking and like maybe 2% actually catching stuff, because the box checking part just takes up all the time and everybody wants to get over with so they can get back to actually building stuff, which is more fun and also something you get more credit for. No one gets a promotion for doing a great code review, right? But machines can do it much more thoroughly and to much higher quality than we can. 

One of the things that we kind of got pulled into is we found all these customers of ours were basically hitting our APIs and building their own internal code review bots against them. And so we're like, "Wait a minute, there seems to be some commonality across these things." And so we started building a code review agent that's now in early access. I just gave a talk about this at the AI Engineer Summit in New York with one of our partners, booking.com. 

And the interesting thing here is I think it actually is - it opens up a new paradigm for modifying or constraining what code goes into your code base. Our kind of like catch phrasey way of describing this is it's almost like a declarative style of coding where you declare these rules and invariance that you want to hold in different parts of your codebase. These are things that like your architect or your senior engineer could specify. I wrote this class this way, it should only ever invoke this other component in this other way. Or if you ever see this pattern code, this is an anti-pattern, rewrite it to be more like this. Change all your map functions to for loops or maybe vice versa, depending on what your overall philosophy is. Now you can define these rules in one place and just have them automatically enforced. You don't have to go and manually review every piece of code looking for these sets of patterns and anti-patterns. 

[0:17:13] SF: What's the representation of those rules? Are you using ontologies as the underlying representation? 

[0:17:19] BL: The representation right now is mostly just like natural language plus a way of specifying which set of files in the code base this rule should apply to. There's kind of like a selector for like, "Hey, this is the sub-directory or this is the logical project where this rule should hold." And then the rules themselves, I think right now, they're mostly natural language. I think in the next phase of this, they'll probably be - some will be natural language and some will be natural language descriptions, but implemented by the invocation of a precise tool. This particular pattern in the AST should never exist. I've described it in natural language, but translate this to a tree-sitter query or something like that. And that's how you can like test and verify. 

[0:17:58] SF: In the context retrieval or the code search, you mentioned that you're using ontologies there. Is that ontology like a base model representation of the concepts? Or are you starting with a base and then making something that's customer domain-specific as well?

[0:18:13] BL: We think of it as like a knowledge graph that is built around the structure of the code. This basic skeleton of the knowledge graph is really the reference graph in the code. You have definitions and you have references. And that's all code is at the end of the day. It's a bunch of definitions and then they're referenced in other places. 

And this is the knowledge graph that you are kind of implicitly walking when you as a human go in and try to understand how something works, right? Maybe you do a couple of searches to get within the rough ballpark. But then after that, it's like, "Go to definition. Go to definition. Find references. Hover over the symbol to see what the docs are." And so that knowledge graph is very useful for humans. It's also very useful for surfacing potentially relevant contexts for LLMs as well. 

And then around that kind of like core code graph, there's other things that also come into play. There's a long tail of technical knowledge that is stored in things like issue trackers, corporate chat, production logs. And we've built a way into our system to integrate these other pieces of knowledge. We had this thing called Open Context, which is basically a simple protocol for saying like, "Hey, there's this knowledge source." Maybe it's your issue tracker. "And here's how you query it, here's how you select a specific item." It's very strong parallels to MCP now, but kind of like predates MCP by a year. 

We built Open Context to MCP integration. I think we might merge the two in our platform moving forward. But this was something that we had very early because we recognized a strong need to pull in relevant contexts from all these different other sources that were ultimately connected back to the code graph in some way. You have an issue tracker that pertains to a certain area of the codebase. You have a production log that has a stack trace and all those things in the stack trace mapped lines of code at some specific revision. Constructing that knowledge graph just allows you to have this map of like, "How do I pull in all the relevant pieces of context when a user asks a question or wants to generate some code?" 

[0:20:13] SF: Yeah, I think having developed something similar to MCP before MCP became public makes a ton of sense. Because if you're doing any of this stuff with relatively complicated systems, you're just going to end up with your AI having to pull in data from all kinds of different places. You can't have essentially a gravitational model for data where you're just dumping it all into a lake and then surfacing it from there. You have to essentially tap it directly from the source in order for it to be the latest information and not have to deal with people like re-architecting the data. You're not going to be storing your source code in your lake anyway. It's not going to make sense somewhere else essentially. You have to liberate the data from all those different places. 

[0:20:52] BL: Yeah, totally. And it helps to have a structured knowledge source, right? These two things are complementary, right? LLMs have this basic form of reasoning and the ability to integrate context. But what they don't excel at is structuring things down to a very precise spec or description. And you saw this especially the gen 1 LLMs, where they had trouble emitting valid JSON. Now, after billions of dollars of training, they can finally do that, but it's still a very inefficient way to acquire that structure. It's far better to just put it into a symbolic representation and have the LLM kind of walk that and then use that as a context fetching mechanism to complement its core memory and reasoning abilities. 

[0:21:39] SF: In terms of the commit review or PR review agent, can you walk me through? What is that workflow? I submit, I commit a PR, this agent monitoring my PRs. Then what happens essentially? 

[0:21:51] BL: There's kind of like two steps to this. There's like the rule definition process and then there's the rule enforcement stage. The rule definition process, it actually happens in these files, these like Sourcegraph rules files. And you can put these in a particular directory, or subdirectory, or repository, whatever they're defined. You can define like a pattern of like what files these rules apply to. And so that's something that you define in your codebase and you define once. 

And once you've defined them, we're actually building multiple hooks into the software development lifecycle where they are enforced. One of the things we're doing is we're actually bringing these rules into context in the editor. When you're generating code with AI in your editor, using our editor extension, the rules are part of the background context of the LLM. And so the LLM can take into account the rules when generating code. 

Now, that's not 100 % guaranteed that they'll actually follow the rules. Oftentimes, they ignore them, but it's better than nothing. It saves you some time. It's kind of like shifting left this process of getting feedback and constraining the code that you're creating to be valid or acceptable in your codebase. 

Now, the second place then enforces that code review time. So you push up your patch for review. And then after that, you tag in our code review agent and then it goes and takes a look at the rules that apply to each hunk that's been changed and then post comments to that hunk noting problems that break the rules or suggestions on how to fix the code so that it conforms to the rules. 

[0:23:20] SF: Is my understanding there - basically, every source file has some set of rules that represent its essentially file-related context. And then you can use that as a way to essentially evaluate the commit against the lines that change and what the rules are for that specific line or specific context. 

[0:23:38] BL: It's not every file. It's more, I would say, at the subdirectory level. 

[0:23:41] SF: Okay. 

[0:23:41] BL: Because it might get kind of tedious to have to define. You probably will have a lot of duplicative rules if you had to define it per file. But you can define it - think of it as like roughly analogous to like an owner's file or a code owner's file, where you're not necessarily defining that per file, but you can say like, "Oh, this subdirectory belongs to this project. And these are the set of rules that should be enforced for this project." Or only apply this rule to TypeScript files or TSX files in this codebase. 

[0:24:08] SF: Is that how you help essentially limit the size of what potentially has to go into the context window? 

[0:24:14] BL: Yes, we're not shoving the entire rule set across the entire codebase into the context window to review a single hunk. We sort of fetch the rules that are kind of relevant to that particular part of the code. 

[0:24:27] SF: And those rules, are those LLM-generated to start with or are they manual? - 

[0:24:31] BL: They're human written, yeah. 

[0:24:33] SF: Okay. That's kind of key step to having something like this set up. 

[0:24:36] BL: Yeah, exactly. Again, the basic idea here is to give your senior engineers, your architects, the people who want to define the vision for how the codebase architecture should be. It basically gives them a lever to enforce that vision. Because the pattern that happens across every large-scale codebase is you have your senior engineers who have - they have enough scar tissue to know the certain like patterns and architectural designs that you should incorporate to make the codebase workable to make it continue to be the type of place where people can contribute. 

Then you have junior engineers who have less experience. They're just focused on building the feature. And they often commit code that kind of breaks those architectural constraints. And it may check the box for what you want to ship in that quarter or in that sprint cycle, but you're taking on tech debt. The code base itself is getting messier and the incremental next feature gets tougher to build. And so it's this kind of like losing battle because the juniors outnumber the seniors. 

And sometimes the worst-case scenario is that when the codebase gets so messy, you have to do something like a large scale code refactor or code migration. And these can stretch on for months, even years. We have a customer that had a feature flag retirement initiative that was scheduled to take - I think it was like 11 years or something like that. It's kind of like this mind blowing. That's at least one era of tech. Who knows whether we'll still even be around then? But those sorts of projects have to be led by senior engineers because they're the only ones who have the historical context and the technical depth to execute them. But the problem is once you put your senior engineers on those projects, then who's building stuff for the user? Who's actually improving the UX? Your best and brightest minds are focused on feature flag retirement instead of building an awesome user experience. 

And so the vision for the code review agent and just more generally the declarative system of rules is to give these senior engineers a way to express these constraints and just have them automatically enforce it. They don't have to spend a hundred percent of their time enforcing these rules. They can actually like build new things as well. 

[0:26:51] SF: In most these situations that we're talking about, this is an existing codebase that was built sort of from scratch by humans. 

[0:26:58] BL: Yeah. 

[0:27:00] SF: People have this historical context. They know the decisions that were made to get to the place that they're at. They know sort of those rules that they want to enforce. What about for net new code? How does it work in that situation where I'm starting from scratch and I'm going to leverage AI to do a lot of my generation? And does that impact the person's ability to know what's going on and create those types of rules down the road? 

[0:27:22] BL: Yeah. First of all, I would say in many organizations, no code is truly an island. Even if it's a new project, it needs to talk to some existing system. It needs to integrate with some existing system. Maybe there's like a common framework that is the way you do things. Because if you do it that way, you get observability for free, you get robustness for free, get scale for free inside that organization, right? Because that's the platform that they've built internally. 

And so there, even if you're starting from scratch, it's not truly from scratch. And you still want the benefit of the context of other similar projects so that you can pattern match against those and also the context of what the rules and constraints are. How to use particular APIs? Common pitfalls or foot guns that you want to avoid and things like that. For truly, truly from scratch problems, then I would say that rules are less relevant, right? If you're just creating like a flappy bird clone that you want to share on Twitter and get a of likes, then it's like, "Who cares?" Right? 

I was talking to someone at one of our customers recently who had this very insightful thing to say, which was technical debt is only debt when you have to touch the code that's messy. His idea is if you're writing a throwaway piece of code, if all you want is like for that code to do this one thing, you're not going to come back and have to like upgrade it or modify it later, then, yeah, who cares about how messy it is? As long as it does the job, right? 

[0:28:50] SF: It's a one-shot app. 

[0:28:51] BL: Yeah, it's a one-shot app. It's a single-use app. That's not real technical debt because you don't care. You're not going to have to dig through that. It's not going to cost you any time in the future because you're never going to dive into that code again. In those cases, who cares about the rules, right? If I bit into existence, and then if it works, it works. Then you go on and forget about it. 

But most of the software we use day-to-day is not like that. If you're actually building something that is going to be driving millions of users or hundreds of millions of revenues through it, you're going to want to evolve it and change and update it over time. And for that, you do need to think more thoughtfully about the architectural considerations. It's not like you need to define all the rules up front. These things evolve organically over time. But I do think you want to have a system in place such that, when you start to scale, when your software becomes successful and you start to hire more people or bring more contributors into it, you want to have a lever, I think, as like the keeper of the vision of that code to be able to like enforce your vision across all the new minds that are going to be ramping up on it. 

[0:29:54] SF: And presumably, a senior person is coming with their own sort of history of projects that are maybe not the same project, but they're going to apply some of those patterns, essentially, in that net new project, essentially. 

[0:30:05] BL: Yeah, yeah, exactly. 

[0:30:06] SF: Around the feature flag retirement problem, where you end up putting your best and brightest on it, because it touches so many different systems, you need someone who has sort of a very comprehensive understanding of the code base and the impact if I remove something. Can AI help us with that? 

[0:30:22] BL: Yeah, it absolutely can. In fact, that was another use case. This is another agent that we're building in partnership with this customer. It's not a review agent, it's actually an agent that can run a large-scale code migration. In this case, the feature flag retirement thing. There's kind of a spectrum of how difficult these problems are to tackle, right? 

Initially, we went into this thinking like, "Oh, it's just dead code removal." You could almost do that pre-AI, right? Just look at if this is referenced and just run one of those standard AST-level checkers and remove it. Why can't you do that? And then we started whiteboarding this out and it became clear to us that this was far from trivial. Because, essentially, all these feature flags had these like implicit dependencies. And when they walked us through like how humans were cleaning things up, it was like, "Okay, you had to find location of the feature flag, but then you have to like kind of walk up the application step." You maybe cross an API boundary and see if this thing is triggered. And sometimes it goes through like - you have to identify the set of APIs that are like in the middle of the path between the thing that you can change and the thing that you want to actually influence an end user experience. And then you have to like tag those as well. It became this like kind of involved task. 

But what we realized was there was sort of an 80/20 rule that applied here. Maybe 80% of the sites that needed updated. They were at the level where you could just AST remove them, but they were probably at the level where a simple agent with some kind of heuristic criteria plus a decently intelligent LLM could probably figure it out. 80% were "easy". And then you had maybe 80% of the remaining 20% that were kind of medium and then you had this final 2% that were really hard. 

And so what we proposed to them was like, "Why don't we just take the 80% that's easy first and that will substantially reduce the amount of tech debt and make it so that you don't have to like -" because tech debt breeds tech debt, right? If you'd hack around an existing thing, oftentimes you add spaghetti code to work around the existing spaghetti code and you just end up with more spaghetti. Why don't we tackle the easy 80%? Then we'll build a more involved agent for the remaining 80% of the 20%. And then for the last 2%, maybe that still needs to be manual. But by now we've scoped it to 2% of the original problem. And so instead of taking like 11, 12 years, we can bring this down to a year or maybe less than a year end-to-end. 

[0:32:50] SF: Yeah. I mean, I think that, also, if 80% of the cases are relatively simple, it's a heavy lift on your resources to have sort of your best engineer doing the easy stuff, right? If you can alleviate that pain through automation, that's usually valuable. I think it comes down to intelligent deployment of automation. What can you do reliably essentially with this technology? And then for the things that are not reliable, rely on the human-in-the-loop expertise to go and do that work. 

In my PhD work, I worked on large -scale data integration problems. And this is years before large language models and so forth. But a lot of what I did there was bringing together essentially sort of standard ML to solve the easy part of the integration problems while using sort of leveraging human-in-the-loop to do the hard part of it. I think these systems are highly valuable and they'll probably be here for a long time. sort of this like human-in-the-loop AI systems. It's like you want to enter chess level essentially. How Can I use AI plus a person to do something that is not possible by a single person to do? 

[0:33:51] BL: Yes, yes. And at Palantir, we used to call this human computer symbiosis. I think that was something that Shyam, who is now, I think, the CTO, I think he coined that term. And it's exactly right. The stuff that you can do, a combination of like human plus computer is always going to far exceed what you can do with computer alone or with human alone. And I think that's the thing that maybe some people don't realize without AI. It's like AI, it definitely moves the frontier of what the computer can do. But all that means is that you can just do more with a combination of human plus computer. It's not like a zero-sum game. I think it grows the pie of what we're able to do as like a species. 

[0:34:24] SF: I was having this conversation the other day that if suddenly you could have essentially billion dollar startups that were a single person because they were able to leverage all this AI to do all these different things. Well, that just means that you have a lot more companies doing amazing things. It's not like there's less companies just because you can do it with less people. 

[0:34:41] BL: It's like, today, if you could teleport them back in time, there'd be a billion-dollar company powered by one person. But in the future, they're going to be in a competitive landscape where if one person can do that, then so can 10 other one-person companies. And so it just means that on the consuming side, we're just going to get a lot more useful stuff. 

[0:35:02] SF: In terms of AI encoding today, a lot of it is - and this is true I think of even outside of coding, but a lot of it is like assistive technology, essentially. It's a co-pilot. There's still a person involved. How far away do you think we are from having fully autonomous written code that's a large part of our codebases. 

[0:35:19] BL: Oh, I think that's already happening. That's what we're working on right now in conjunction with these enterprise partners of ours. This is like kind of like automated. Actually, I forgot the final step of that rules-based system, which is you have the rules that are enforced in your editor and then at review time. And then it's kind of like what do you do to keep the codebase into that state? 

And so our vision is just to have like all these bots and agents in the background constantly doing things updating to the latest version. Because a lot of these rules are not like static. A lot of the rules are just like use the latest stable version of the NPM packages, right? In order to keep that rule updated, you have to be like constantly doing stuff in the background. And right now, you don't do that nearly as often as you should, because that takes time. And oftentimes, it's like senior engineering time, which is probably like the most valuable, non-fungible resource inside your organization. But if you can have computer to do it, you can just have a computer do that and fix whatever breaking changes happen. 

And over time, I actually think like the vast majority of code written today kind of falls into that bucket where it's like it's not really interesting code. It's not creative code, but it's just like glue code or it's a configuration code that needs to be updated. And in fact, some of our customers with the highest percentage of AI-generated code are using it specifically for tasks like that, where it's updating configuration or updating these things that are kind of boiler platey or not that interesting, but still critical. They're on the critical path for keeping things clean, and secure, and stable. 

I think that world is kind of already here today in a lot of the organizations that we work with. I think what we'll start to see in this year, specifically, is much, much more automation within the editor. I think, now, the latest generation of LLMs - we're recording this on February 28th. And within the past month, there's already been a couple of new models that have dropped that I think substantially move the needle as to what they're capable of. 

And so I think like 2024, we're still very much in the human in at every stage of the loop, like code assistant mode. I think 2025 is when we kind of like shift into like, "Oh, now it's more like the LLM taking the driver's seat for a lot of these things." And then a human just needs to pop in every now and then when you need to do something that's a very special, or specific, or "out of distribution" of what the LLM has seen in their training. 

[0:37:47] SF: How do you think this changes the nature of the junior engineers' job? If they're leveraging these tools primarily, how do they get to a place where they have sort of that senior level understanding? 

[0:37:58] BL: I think it definitely changes the path to getting there. I think there still is a path, which is, at the end of the day, you have to validate. You as a human pushing the code own the responsibility of ensuring that it's correct. How do you validate that? In the pre-AI world, you validate it through the process of writing it. Going through the process of thinking through how to write the code gives you a certain understanding and a certain confidence in the correctness. And in theory, you write unit tests too to validate that. But people do that infrequently. 

And unit test coverage is always like spotty, right? The pre-AI way of doing this is like, "Well, you wrote it. So your brain must have understood it at some point." We have a reasonable amount of confidence that it's like mostly correct. 

[0:38:37] SF: Yeah. And someone reviewed it. Sort of reviewed it anyway.

[0:38:40] BL: Yeah. Yeah. And then I think the post-AI world looks more like, "Well, the AI - I vibe this code into existence. The AI generated it. I don't really know how it works. But how do I validate that it does work at a sort of a gray box level?" Right? And so then, now, let me generate a unit test with the AI and have the AI explain what the unit test is testing for. And let me augment the test cases. 

I can actually generate a far more comprehensive test suite in a shorter amount of time now with an LLM than I could previously. I'm actually going to do that now. And if it finds bugs, maybe that's the point at which I kind of dive in the weeds a little bit and understand what's breaking and how to fix it. I think there still is - I'm not as pessimistic as some other people who are like, "Oh, the bottom rung of the ladder has been eliminated." I think there's still like a way to kind of like understand what's going on. You're still going to have to do that for some percentage of things because you have to verify the correctness. I just think that the bar for test coverage and correctness verification just gets higher now because higher is now feasible. And we'll just have a different way of learning code understanding capabilities. 

I do think that the next generation of programmers is going to be much less good at what I call line smithing, which is writing new code from scratch line-by-line. The more important skill moving forward would just be thinking about it at a higher level. At least a function level of like what are the inputs and outputs I want to verify? And how do I test this properly? 

[0:40:05] SF: Yeah. I mean, it probably changes the level of abstraction. Basically, you're working at a level of abstraction earlier in your career than maybe you are now. And that's been going on for a long time. Even as you move to higher level languages, people were arguing about assembly versus C, C versus Java, Java versus Python, and what you're giving out by using these sorts of higher-level languages. And we've been having that debate for 50 years. 

In a lot of ways, it's a bit of a step function, but it's kind of like the next natural evolution of that is where now the coding interface becomes natural language. It shifts the job to how do I validate that this is correct? How do I stitch these boxes together in a way that makes sense that's going to be scalable and all this sort of stuff? 

[0:40:48] BL: Yeah. What I now just analogize it to is like, when I was in grade school, they still taught us how to navigate the Dewey Decimal System in the library, right? Because like in those days, that was a critical skill for any knowledge worker to have. At some point, you're gonna have to read a book to acquire the knowledge and that book is going to be buried inside some vast, large building that you're going to have to like go to. And then you're going to have to like go through all these index cards. And it's going to take you a while to find that like one book. And these days, it's just who does that anymore? You just go to Google or ChatGPT and type in your question and you get an answer. 

[0:41:21] SF: Yeah. Amazingly, this is the second podcast I've done in the last month where the Dewey Decimal System has come up. I haven't thought about it since I was probably eight years old. 

[0:41:29] BL: Yeah. I mean, it's like anyone who went to grade school in the '80s or '90s, that's probably your pre-training data, right? 

[0:41:36] SF: This was also funny enough, someone who had previously worked at Palantir. Palantir is incepting the idea of Dewey Decimal System, bringing it back. Well, we're getting close on time here. Is there anything else you'd like to share? 

[0:41:47] BL: No. I mean, the only thing I like to say to folks listening is we're Sourcegraph. We build developer tools for massive production codebases. I think we are well-positioned to kind of define how software is built in the next era. We want to solve this problem that has plagued software development since its inception. It's like Mythical Man-Month level problem. If that's of interest to you, we are hiring. And if you are one of those engineers that is toiling away inside a large messy codebase, give us a look. We always love hearing from people who have challenges or pain points that we might be able to solve. 

[0:42:23] SF: Awesome. Well, Beyang, thanks so much for being here. 

[0:42:26] BL: Cheers. Thanks, Sean. 

[END]