EPISODE 1844
 
[INTRO]
 
[0:00:00] ANNOUNCER: One of the most immediate and high-impact applications of LLMs has been in software development. The models can significantly accelerate code writing, but with that increased velocity comes a greater need for thoughtful, scalable approaches to code review. Integrating AI into the development workflow requires rethinking how to ensure quality, security, and maintainability at scale.
 
CodeRabbit is a startup that brings generative AI into the code review process. It evaluates code quality and security directly within tools like GitHub and VS Code, acting as an AI reviewer that complements existing CI/CD pipelines. Harjot Gill is the founder and CEO of CodeRabbit. He joins the podcast with Kevin Ball to discuss CodeRabbit's architecture. Its multi-model LLM strategy, how it tracks the reasoning trail of agents, managing context windows, lessons from bootstrapping the company, and much more. Kevin Ball, or KBall, is the vice president of engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies. Founded the San Diego JavaScript meetup and organizes the AI in Action Discussion Group through Layton Space. Check out the show notes to follow KBall on Twitter or LinkedIn, or visit his website, kball.llc.
 
[EPISODE]
 
[0:01:31] KB: Harjit, welcome to the show.
 
[0:01:34] HG: Thanks, Kevin.
 
[0:01:35] KB: Yeah, I'm excited to dig in with you. I'm really excited about what you guys are doing, but let's maybe start with that. So, can you give our audience a little bit of a background on you and on CodeRabbit?
 
[0:01:45] HG: Yeah, that's great. So, I'm Harjit Gill, and I'm co-founder, CEO of CodeRabbit, which is a startup using generative AI to look at code reviews, essentially code quality, code security for users which are on popular Git platforms like GitHub, GitLab. So, the company's roughly a couple of years old, but has grown tremendously non-linearly pretty much. In the last couple of years, we have 100,000 developers who are using this platform on a daily basis. It's a pretty popular product, loved by the developers across all the industry segments and so on.
 
[0:02:18] KB: Awesome. So, let's first look at this from a user standpoint. What does this look like? And then, I will be excited to dive under the covers and dig into how CodeRabbit works. But for me as a developer, if I want to use CodeRabbit, what do I do, and what does it look like?
 
[0:02:33] HG: Right. So, CodeRabbit is like a tool that is a nice compliment to a lot of the code generation tools, which are out there on the market. As you know, like a lot of the developers are now familiar with Cursor, GitHub Copilot, Windsurf, and so on. As they're now using AI to generate a lot of their code, and we know that AI-generated code has a lot of deficiencies in terms of maintainability, and sometimes, they're like just sloppy errors that AI makes. So, now you got to bring in AI to review AI, because now, review is becoming a bottleneck.
 
So, to consume CodeRabbit, there are a couple of ways. The product primarily works inside your pull request model. So, essentially once you are done with your feature branch, you open a pull request before it gets merged into the mainline and gets shipped out to the end customers. That's typically where all the code reviews happen, like the human reviews, a lot of the static analysis tools that you are running like linters and unit tests run. Your instantiate CI/CD pipeline runs over there. 
 
CodeRabbit sits alongside those tools and uses AI to perform code reviews. Very recently, around couple of weeks back, we also released a VS Code extension that also works with the forks of VS Code like Cursor and Windsurf, so that the developers can also review the code before they even push the code to the remote Git branch.
 
[0:03:50] KB: Okay, cool. So then, let's look at what that looks like on the implementation side, because I think one of the things that I've certainly run into with GenAI is naive application of the models. These models are very powerful. They can do a lot of cool stuff, but as you highlight, they get a lot of things wrong. So, figuring out how you feed them the right context and put all those things in places, very important. So, can you maybe walk us through, I guess first, what is the architecture for CodeRabbit behind the scenes?
 
[0:04:18] HG: I will first start by first contrasting with how different the code generation is from code review, and then, we'll probably go deeper into how CodeRabbit makes it all work. If you look at code generation, it all started with a lot of this tab completion, style use cases, auto-complete. I mean, typically, you will see usage of small low-latency models. So, as you type, you have this suggestions show up in ghost text that you can press tab to complete, right? Most sophisticated approaches will use some sort of a vector database to index your code so that you get more relevant suggestions based on your data structures or coding patterns that you're using.
 
On the other hand, the code review is a problem that requires very, very deep reasoning. So the workflow that CodeRabbit is sitting on is latency-insensitive, because you're running it in the CI/CD pipeline. That workflow can typically take several minutes to complete. So, a tool like CodeRabbit has to be a lot more thorough in terms of its analysis in order to make it actually work. So, CodeRabbit, believe it or not is actually one of the biggest consumers of the reasoning models in the world right now. So, one of the big users of o3, o4-minis on it.
 
That's part of the magic that makes it work, and of course, it send out workflow around it, on how we bring in the relevant context, right? The context comes from - so, the workflow basically triggers as soon as you open a code request. So, the context naturally comes from what's the payload of that pull request, what the diff looks like. Then, you're also bringing the context from the remaining code base, the code graph. So, we understand the impact that code would have on the dependencies that you're using in the code, like other functions, which are not even changed, but now are depending on the code that you're changing.
 
Building the code graph is also pretty critical in terms of context. The other context comes from the JIRA or linear issues that you are trying to solve through that pull request. So, usually, there's some product knowledge or some knowledge about the bug that you're trying to solve coming from the issue systems. Then, a lot of contexts is coming from the past learnings, because CodeRabbit is a very collaborative product. It's a product that people consume at a team level. The way you train CodeRabbit is by chatting with it. So, the more you talk to CodeRabbit, the better it gets over time. So, those learnings that it has learned over user interactions from the previous reviews also get pulled in. 
 
These are some of the examples, 10 to 15 different data points that we pull in during the context. But it's not sufficient, actually. That's the thing. I mean, as you know, these models have a very, very limited context windows. Even though when we are seeing these context windows expand to million tokens or so, it's still not efficient because you basically lose the quality of inferences as you strive to stuff in more context. It's great for summarization, but when you're talking about deep reasoning, you can't really use all that context right. What we try to do is like give CodeRabbit agent enough hints so that it can get a basic bearing on what's happening in the pull request, where directionally are, the trajectory of these changes, where they're going, and so on.
 
Then, what we are doing, which is a cool thing, which is so differentiated right now. We create all these sandbox environments in the cloud. So, we actually do create sandboxes. We clone the repository. Then, we let the AI run an agentic loop to navigate that code base. So, we let the AI run CLI commands, like shell scripts. It can run keyword searches. It can go and read additional files and bring additional data points into the context. It can even run ast-grep queries, abstract syntax tree queries to read entire functions and bring it into its context. Then, continue with its analysis in order to validate a bug.
 
One of the stages of reasoning process is, okay, it looks like there might be an issue if you're going to change this, but can I go and validate if it's really an issue? So, it's a combination of preloading some context and then giving the agent enough agency to go and find missing information. It even runs web queries, like sometimes, you have knowledge cut of issues. I mean, these models have been trained in the past. I mean, sometimes you have 2023 cutoff, 2022 cutoff, which is kind of bad for the coding use cases, because a lot of these libraries and frameworks are constantly evolving.
 
In a lot of these cases, we try to bring in the context from doing internet searches. So, sometimes, we'll say, "Okay, this is a new syntax that we are looking at. Is this syntax something that's really out there, or is it incorrect?" So, you will sometimes say, "CodeRabbit, do a web query to confirm the latest documentation."
 
[0:08:48] KB: That is fascinating, and I'd love to dive into some of those pieces. So first off, you said you start with the diff and building the code graph from there. Is that something that you are moderating through an LLM, or you have a static analysis that you're doing, or how do you build that code graph for what's likely to be impacted?
 
[0:09:06] HG: It's a combination of both, actually. So, that's a nice thing. There's a lot of these abstract syntax tree analysis and understanding the relationships. I mean, you're familiar with language server protocols, LSPs, it's kind of similar what we are doing there. But it's an own proprietary implementation, so not exactly like LSPs, but somewhere in the middle, in terms of the memory footprint, and everything we need to build that code graph. It's all being done on demand. It's not being pre-indexed like source graph or something. We just create this live as we're doing the analysis. 
 
The other part is, the large language models are able to then further understand the relevance of that code graph. I mean, a lot of things can be references and dependencies, but which are really relevant for understanding their diff in code review. So, there's a lot of clean up on the context as well happening before we trigger some of the more expensive reasoning models. 
 
[0:09:54] KB: That's interesting. So, could you walk me through, maybe like what is the pipeline of steps that you go through? So, it sounds like there's some amount of static analysis, there's some amount of cleanup with cheaper models, there's some amount of then these expensive reasoning models, maybe not in full detail, no secrets here, but kind of what are the different types of steps involved, and how do you think about sequencing them?
 
[0:10:15] HG: Yes. I mean, we have written about it as well. I mean, when we started the company, there were a couple of initial blog posts on how CodeRabbit works, and what it makes it both cheap and good at the same time, which is hard to engineer those kinds of things in the world of AI. So, one of the things that we do really well is understanding the context. It's not like tools like cursor where you're picking a model, and then you're running with that model for your entire flow. Like CodeRabbit is an ensemble of models. We don't even expose what models we are using to the end customers. Sometimes people ask, which models you're using. Can we choose the models? We don't let them, because they'll most likely make a mistake in picking the right model for the use case we have. So, our team does a lot of work behind the scenes to pick up the right kind of a model for the different parts of our pipeline and the workload we have. So, we use like seven or eight models, depending on which one's like a good fit for which part of the workflow.
 
A lot of the context preparation is where we use like cheaper, faster models, like GPT-4.1 nano or GPT mini, 4.1 mini. Those are like kind of the big world courses. They're dirt cheap, but we still spend a lot significant amount of money on them, given how much volume we are running through those models. They do all sorts of tasks, from summarizing large contexts like files, entire files, and previous issues, and so on. So, there's a lot of summarization that goes in before we even go into actual code review workflow, right? So, there are multiple steps, right? There's a whole setup process where we're creating a sandbox. We are running a lot of the static analysis tools in them.
 
So, there's a lot of contexts being pulled in from your existing tooling. So, basically, we go and identify what kind of tooling you have set up on your repository. Let's say you are using ESLint, we will go and detect that. We will use existing configuration that your DevOps team might have set up. Sometimes people use golangci-lint. So, we pick up all these tools right, and we run them for you. So, basically, that's one of the contexts we bring in.
 
Then, there's a lot of contexts we bring in from your CI/CD failures. That's another place where we use large language models to understand your failure logs. So, if you have a build failure or a unit test case failure, we understand exactly what happened there. And that context is also used during code review, so that we can provide remediation, one-click fixes for those steps. So, yes, as I said, like seven or eight models, and for different use cases, for chat. There's a different model for some of the agent verification flows that we run. Those are like different reasoning models and so on.
 
[0:12:43] KB: Cool. You mentioned a lot of this is being done on demand, but you also said, "Hey, you can train CodeRabbit, it will incorporate past learnings based on conversations or things that you've done in there. So, It sounds to me like there's some sort of kind of summarization or indexing that you're doing of previous PRs that gets fed in at some layer. What does that piece look like?
 
[0:13:04] HG: That's right. I mean, this is where we have like a very different indexing system, similar in some ways, but different in many ways of the entire code base. So, we do look at the entire code base and based on what got merged over the last thousand, 2,000 commits. I mean, that's how the system works. Over there, we are indexing not just the code snippets; we all understand that, okay, these are the relevant code snippets. That's how like everyone's been doing it, they use abstract syntax trees, three-seater grammar rules to extract out the relevant snippets and index.
 
One of the unique things we also do on top of that we also convert those snippets into docstrings, document, like natural language, because a lot of the user queries are in natural language. So, when you're doing code completion, and your similarity search happens on the code snippets itself, so you have a good match in the vector DB. But when you're going into a lot more agentic use cases like CodeRabbit is, the input is natural language query, right? So, you have a better match when you're converting code into natural language representation or summary of it. So, we do a lot of that at scale.
 
[0:14:04] KB: That makes a lot of sense. Then, I presume you expose it to your agent. Here's a query framework: you do a natural language query and load up whatever might be relevant.
 
[0:14:12] HG: That's right. We're bringing in knowledge from the code graph. We're bringing in knowledge from the code base index we have created. It's very different kind of an indexer than what people have been doing in the space. A lot of that context is also shown to the user, so that people also can trust the AI, because the AI is known to hallucinate. So, one of the things you build trust is to also show the context and how that insight was bubbled up, like what led to that review comment on the conclusion. So, all that helps in making a great user experience.
 
[0:14:42] KB: Let's maybe talk about that exposing piece, because I think that is key for any of these LLM-driven applications, giving you the paper trail of, how did this get here? Why is this here? So, I can, as a human, validate it and detect those hallucinations and things. When you have this long pipeline of context that you're loading in, you mentioned a bunch of different steps from a bunch of different sources. How do you keep track through the process to be able to bubble up the right sets of relevant contexts?
 
[0:15:11] HG: Right. So, it's all in the UX. When you are posting these review comments, so sometimes we will show what kind of additional context was used to bring up that insight. Sometimes it's just pure LLM logic, like there's no additional context. It's just an issue that was detected on a surface level. But sometimes, it's deep inspection of the code base. Sometimes, the agent will go and read additional files in the repository. So, you could see like an analysis chain in Code Drive with Comet sometimes, if you open that chain, you will see all thought process, and the paper trail, as you said, what kind of commands were executed to come up with a certain insight, and then, it will pinpoint the files and locations. Even the files have not changed in the pull request.
 
So, I mean, it will also bring up insights from your remaining code base. But you could go back and follow the paper trail. And if it ever went off track, you know exactly why it went off track. Then, you can chat with CodeRabbit and explain why its analysis is correct or incorrect. And if it is incorrect, it will remember it for next time.
 
[0:16:05] KB: That's super cool. So essentially, just to make sure I'm understanding, your agent is outputting its logs of what it's doing, which includes both LLM reasoning and tool calls off in different places, and then the results of what those tool cards are. And you keep that track and just bubble that up straight to the UI for someone to be able to explore.
 
[0:16:22] HG: That's right. We think it helps a lot. Actually, we were one of the first companies to pioneer this whole sandbox and CLI. Now, we see it like this becoming a commonplace, Codex came out, and all. But CodeRabbit's been doing this its last two years since the days of GPT-4. We are the first ones to actually find out that a lot of the code-base navigation is a great way of finding issues versus doing pure RAG. So, everyone awhile, was prioritizing a lot of code-based indexing. I know the code-based indexing helps, but a lot of what makes CodeRabbit comes from this code-based navigation that happens ad hoc using shell scripts.
 
[0:16:56] KB: Yes. That's super interesting. I think it is something that we've started to see in a lot of more recent agents of, "Hey, let's just expose programming tools, essentially, to agents and let them figure out the right way to apply that.
 
[0:17:09] HG: Even today, CodeRabbit doesn't use tool calls. So, some people think we use tool calls, we actually don't. I mean, the entire system is based on CLI commands. So, we actually generate code as, instead of doing tool calls, we have a sandbox and CLI, that's all you need. That's the only tool you need, actually. Everything can be, even MCPs you don't need. Even to open GitHub issues, we use a CLI command. GitHub CLI to open GitHub issues. We don't actually use MCPs, because we don't have to when all the tools are available over CLI.
 
[0:17:37] KB: That is fascinating. Let's dive into that a little bit more. So, in terms of, I think one of the concerns about giving the LLM full access to any sort of code is, how do you sandbox it properly? How do you decide what is in and what is out? So, how do you think about that sandbox, especially if you're giving it web access and access to GitHub systems and things like that?
 
[0:17:59] HG: Yes, I mean, the standard techniques on sandboxing, people have been doing it in the past for many use cases. Even people have been doing like dev environments, preview environments. So, CodeRabbit in a lot of ways standing on the shoulders of the giants in many ways. I mean, there's some proprietary stuff we have done to make it fast and cheap. I mean, we are kind of running these sandboxes at scale while also being very cost-effective in doing so. But yes, I mean, there's - I wouldn't say any big secret sauce on how containerization, C groups, and all those things work. I mean, those are like standard systems techniques.
 
But the main thing is, like, how do you further block off access? In our case, we don't block off Internet access, because that's something we feel the agent should have. Sometimes it will also make curl commands. It will sometimes want to use the GitHub CLI to read more other PRs and so on. So, we don't, we don't restrict any Internet access, but at the same time, we do want to make sure our cloud services are protected. We don't have access to our internal systems and so on.
 
[0:18:55] KB: That makes sense. Do you list out for it? For example, what sets of CLI tools or what permissions or access like GitHub CLI, presumably, you have to give it a token to be able to access the appropriate place and things like that.
 
[0:19:09] HG: Yes. GitHub CLI has a nice way to authenticate. So, the token is like, we just provided the token once, then the CLI works like that. Then, that token is in an insecure vault in, in service GitHub CLI. The main thing is we don't actually have to give AI a lot of information on these tools, because it's already in the training data. So, when you understand set command, cat command, ripgrep. I mean, those tools are well-known, well-understood by AI. So, there's not a lot of hand-holding in and making it understand the schema of these tools, because it's just shell scripts. It's trained on that.
 
We do explain the scenarios in which certain tools might be handy. So, we try to influence the behavior in some ways on when it can make certain commands. For example, if you see, if you're doing a package.json update, let's say, go and read the vulnerability database GitHub has to see if these packages are not out of date. It does that, it's pretty effective actually. So, each time you see a package.json command, you will see agent making a command call to GitHub's open vulnerability database they have to detect whether these Python packages or Ruby packages have any vulnerabilities. 
 
[0:20:13] KB: Yes, I like that a lot. So, in terms of scenarios, and I'm going to explore this because I think, as you highlight, you are one of the most successful examples of these agents in the wild, but it is a technique a lot of people are trying to figure out and explore. So, can you give us a ballpark? Are we talking tens of scenarios? Are we talking hundreds? What does this look like?
 
[0:20:34] HG: Yes, they are, in the order of more than tens for sure, from what I recall. This all coming from the tribal knowledge, like if you are being an engineering leader or good engineer yourself, you kind of taking what you know best and then programming that as a prompt. Taking your own knowledge in many cases. A lot of times, we are just learning from the sheer amount of customers we have. One of the reasons why CodeRabbit improved a lot is because, we have a lot of open-source usage. That is a great feedback loop. So, we have every few seconds, we review some pull requests in open source. A lot of people interact with CodeRabbit, We kind of observe what they're doing in those pull requests. Some of that behavior goes back into training our agent. 
 
[0:21:16] KB: Yes, that makes a ton of sense. We talked a little bit earlier about the challenges of prompt stuffing when you've got these big context windows and too much is in there. Is the scenarios amount still small enough that that's all going into the base agent prompt? Or do you do some sort of dynamic loading or figuring out of what are likely relevant scenarios in any particular time?
 
[0:21:37] HG: Yes, that's right. I mean, it's the latter. First of all, like we're using multiple models as I said, like there's not single one base agent prompt. It's not like agentic loop that everyone else has. I mean, it's a pipeline in a way. And a lot of the work goes in preparing the context actually, a lot of the money is actually spent in - because one of the things with the reasoning models is these models get thrown off track very, very quickly if you're doing RAG and just stuffing in the context without cleaning it up first or re-ragging it. These models tend to go completely off-track and haywire, as opposed to non-reasoning models. They overthink. 
 
That is one of the reasons why some companies struggled when Sonnet 3.7 came out. Sonnet 3.5 was working really well for a lot of the coding companies, but when 3.7 came out, they had no clue what happened, what hit them. Like we were prepared. One of the good things is because we were built with the reasoning models in mind from day one. In fact, even before reasoning models came out, we had a lot more internal reasoning process. Actually, we are in a lot of stages, which were just doing internal monologues and reasoning. 
 
We always benefit each time a new reasoning model comes out. So, there's not big changes into our system, but some companies had to fundamentally rethink how they were doing their prompting with the reasoning models.
 
[0:22:48] KB: Yes, that makes a lot of sense. So, let's maybe break down a little bit the agentic loop, because I think a lot of people building agents right now, it's essentially one big system prompt, and tool calls, and a loop around it. So, you said, for yours, it's more of a pipeline, you have this more dynamic set of things. So, how do you think about the design of your agent? 
 
[0:23:06] HG: Yeah, I mean, we work on like large complex code bases, so single loop doesn't work for us. We have to figure out how do we have the main agent that figures out what kind of things it has to do, then their delegations happening. There's a lot of complexity over there as well on how we break up the work. There's a whole task-tracking system where you have a main root task breaking up into subtasks. That's how we do it. We do divide and conquer, essentially, the problem with agents. The results bubble up, and the visibility bubbles up. That's how it works effectively on a large code basis.
 
I mean, a lot of that is proprietary. It's not like we're using any framework, LangChain or something when it's all in-house. Going back, it's a loop, but the trick with these systems is also making sure that AI or the large language models saw the right context. Sometimes, you have shell scripts, you know that the quality of the output over there won't be high for you to make a good judgment. Sometimes, there's a lot of suppression happening, even though AI would say, "Okay, it looks like there's a bug." But you know that it didn't see the relevant context. So, this might not be high-quality inference, I will just hide it, rather than bubble up a lot of noise.
 
So, we do a lot of cleanups, even on this agentic loop; it's not like a pass through to the user. There's a lot more understanding of in our system on what kind of quality context is going in into the pipeline, so that we know there's a decision or the inference we are getting in other days are going to be high quality or we can even trust it.
 
[0:24:29] KB: Yes, that makes a lot of sense.
 
[0:24:29] HG: One of the examples is like lack of output doesn't mean there's a bug, sometimes you will run a find search on a file, and you won't find that file, which is probably like you're maybe looking in the wrong place rather than that file not existing. So, those kinds of scenarios you have to account for. There are many such scenarios by the way.
 
[0:24:47] KB: Yes. So, let me make sure once again that I'm understanding. So essentially, you have a top level and it breaks things down into a task graph. It says like, essentially, here's the set of things that I think we need to do to dig into this. Then, delegates those tasks to subagents in some form, which go and do work. Then, as they complete, it kind of bubbles up through the graph to the top-level agent.
 
[0:25:09] HG: That's right. This task graph is dynamic, as you can guess. I mean, it's figured out by the AI. So, there's a system that figures out what the task should be.
 
[0:25:18] KB: Now, thinking about those tasks, are they fully dynamic? Do you pre-define classes of tasks? Does that connect to how you decide what's going to be relevant context and how high quality it's likely to be like, or is it completely driven by the LLM?
 
[0:25:31] HG: It's a hybrid system. Like, we do know the nature of these tasks, because we let the AI choose what kind of task is running, and then, we know what this task should look like when we run them. But the graph itself is dynamic to a large extent. I mean, there's a pipeline. I mean, it's a hybrid architecture. There's some pipeline stages, which are always like hardcore in the system. These steps have to happen. But then, there is like, we give enough freedom to this agent to go and find stuff as well and plan around it.
 
So, what we found is like planning is a big part of the quality, like the more you plan, the more you give the agency to go, and first like, go and navigate the code. That's usually yields high-quality outcomes in the end of the day, rather than just rushing into doing something or concluding something. You want to let the AI follow multiple chains of thoughts, and then, some of them could lead to a dead end, but that's fine. Maybe four out of five, doors were closed, but one of the doors leads to some interesting insight.
 
[0:26:21] KB: Yes, this is all connecting for me, because as you build out those tasks, they have classification. That's going to help with what we talked about in terms of picking what are the relevant scenarios to load into the context for that sub-agent to decide what it might check or do. The filtering that you talked about is that also done kind of agentically by the LLM, where it's judging quality or do you have some sort of static analysis in there in some form as well? 
 
[0:26:43] HG: Yeah, it's mostly LLM-driven, I would say. There's some static stuff, as I said, like we know exactly, okay, these models did not see the relevant context, so it's very easy to sometimes figure that out from the quality of commands it's running and so on, and the outputs. But in many cases, the validation is done by another kind of a judge LLM, which is running online, and which is also able to decide whether the result so far has been accurate or not. 
 
[0:27:08] KB: That makes sense. Then, in terms of what you mentioned, in terms of adapting to inputs, then, as things come back, I assume different layers have the ability to say, "Oh, that was a dead end. Go try this. Let's replan. Let's restructure this as you go." How do you limit the extents of that or decide when you're done?
 
[0:27:26] HG: It's an arbitrary number. I know 10 levels deep. I mean, when it's done, it will just say, it's done. But sometimes, it's like we have to have - it's the stack depth problem, like the maximum stack depth you want to do, and it's a cost thing. Now, I don't remember what's at constant right now. Maybe it was five or 10, something like that. We picked a number and said, "Okay, this is the deepest we want to go into the rabbit hole."
 
[0:27:44] KB: That makes sense.
 
[0:27:45] HG: These things tend to loop around, especially the earlier models. There was a lot of this looping behavior where it will go and check same thing again and again.
 
[0:27:53] KB: Well, and cost does bring up an interesting question. I have a co-worker who is way down in agent land and exploring all sorts of different agents and trying different things. But they tick up in cost pretty quickly if you just let them run. So, you mentioned you've done a lot to try to control costs and keep this contained. How do you approach that?
 
[0:28:12] HG: It's multiple things, right? One is like, the reason we use a lot of the cheaper models is the cost. Yes, you could use an expensive model for everything, even summarization, but that doesn't make sense. It's like orders of magnitude more expensive, right? For example, o3 is like five times expensive than Sonnet. And Sonnet, it's like orders of magnitude expensive than 4, 4 mini or something. Speaking next part about manning the workload to the right model. So, that you get the best price-to-performance ratio for workload that you have in mind.
 
The other factors are like being smart about, especially the incremental thing. One of the things that people love about CodeRabbit, it's an incremental review. So, it will remember the last time we left the review, and next time, when it resumes, it will first see whether I have to really, really review something or not, whether it's a trivial change. Can I skip it? So we do a lot of the prompts that actually just figuring out whether we need to even do a deeper analysis or just approve it.
 
[0:29:04] KB: There's like a short circuit, basically.
 
[0:29:05] HG: There is a short circuit. So far, no one has noticed or complained, because sometimes, we do skip. Then, the quality on that has been really high, at least the decisions we have been making have been very high quality on that. The other part has been rate-limited. So, we do - and you would sometimes see on Twitter, like people complain, "CodeRabbit has rate limits," but that's one of the ways we kind of control the views so that it's kind of fair to - so unlike like a lot of the AI companies, which are now going into consumption pricing, like you would see, agentic companies are now - like Cursor, for example, has a max mode, which is - I was reading the documentation, 20% markup over the API cost.
 
So, you're passing on the Sonnet cost, Gemini costs to the end user. Like CodeRabbit, on the other hand, has a per-seat pricing. It's all you can eat. But the way we sustain as a business at scale is through a lot of these techniques on the LLM site and rate limits. I mean, we are able to have, for our open-source plan, we have a lot more strict rate limits versus like relaxed rate limits for our peer users and different plans.
 
[0:30:00] KB: Yes, that makes sense. What would you say some of the most kind of challenging technical areas of building out CodeRabbit have been and how have you addressed them?
 
[0:30:11] HG: It's been fun. It's been a different kind of a project. I don't know how much - it's my third startup now. So, a very different flavor than the previous two that I did. I mean, the earlier, we're in observability, and infrastructure, cloud infraspace, reliability management. This has been like the very different kind of a product where we had to unlearn a lot of the way you build software, like it's not deterministic, like there are a lot of deficiencies in the large language models themselves. But they're amazing in so many ways. The trick has been - how do you hide those deficiencies from the end user, like they tend to be noisy, they tend to be slow, like create a lot of slop, otherwise, and build a product that people love.
 
So, it's a combination of reliable execution of these agents and also a great UX that becomes part of your daily workflow. For example, CodeRabbit sits inside a pull request model, and one of the very few companies which have been able to successfully bring a product into an existing workflow. A lot of people hate AI if you ask me, like people are trying to bring AI to every workflow you might have and people hate that. But CodeRabbit has been one of the very few exceptions, where it's actually being loved and being pulled in very rapidly by the developers themselves. 
 
[0:31:18] KB: You highlighted a couple of really important things, and I want to go deeper on there. So, one is that these models come with fundamental trade-offs. They have strengths and they have deficiencies. If you want to use them effectively in a product, you need to build around those. You can't just treat them like software. Then, as you also mentioned, many companies are failing to see that and just kind of like trying to bolt them on to things without even thinking about, is this a useful use case for this. What are the strengths? What are the tradeoffs? How do I do that? I'm curious through building CodeRabbit, if you've developed principles for how you think about what is going to be a good use case for LLMs, and not or how you build a product around a large language model.
 
[0:31:59] HG: That's a great question, actually. One of the things that people love about CodeRabbit has been how surprisingly reliable it is or accurate it is, given that bad experience or bad taste in the mouth that every other product leaves. That is the bar we try to keep up with the new features, which also means tracking where these models are technically, like in terms of both price and performance. So, there are a lot of use cases we want to do, but we deliberately don't go and build them, because we know that the capabilities are not there yet. We don't want to lower the bar or on CodeRabbits.
 
For example, a lot of companies are now doing issues to PR. But if you give an open-ended prompt, 80% of the time, you're still going to end up with the wrong implementation. So, these are still, I would say, experimental systems not ready for large-scale, mainstream use case. CodeRabbit is mainstream. We are being used in even traditional companies, not just like Silicon Valley startups, but even traditional companies on PHP, Java applications, even older applications are using us very successfully. So, those are some of the principles. 
 
Yes, we could do a lot with AI, especially with tool calling, it doesn't require a lot of code. I mean if you look at agentic system, they are very simple systems, like they're just a bunch of tools coupled together, and this usually, like Sonnet, doing all the magic for you. But those are not products yet, you need a person who is really expert in prompting and able to drive the outcomes. Yes, Twitter is a different bubble, like when people say they're successful with AI. They are good prompt engineers; they know exactly where these models will fail, and they don't even try those use cases. But the rest of the world is not ready for a lot of these prompting and models. These are the guiding principles, like some of them.
 
UX is another one, like we do try to make sure that we understand really the user's existing workflow, so that we can seamlessly bring AI into their daily life versus something they have to remember to use. I mean, the one of the big difference between CodeRabbit experience and other tools is, it's not a chat product. Every other product requires prompting and chat. It is not like we are the very front of the very few products that has zero activation energy. There is no activation on the user. You open up here, it gives you insights.
 
[0:34:00] KB: I think there's something really powerful there because Chat GPT was so successful that it has kind of made everyone have this mental model of LLM equals chat. To your point, you are not a chat product; that is not at all what you are doing. What is your mental model for what makes a good LLM problem? If LLM does not equal chat, what is it providing for you? If someone else was trying to go through the learning process that you have of, how am I going to apply this in a useful way to create real value? What's the picture that you have of what the capabilities this LLM provides?
 
[0:34:35] HG: That's a great question. You have to first understand where the data is coming from, what the training data looks like. We know that these models are trained on software. They've been very successful because data has been very easy to obtain, things like shell scripts. Intuitively, that, hey, we have thousands of repository, these LLMs are trained on what good shell scripts look like. So, those are the strength, like you have to play with the strengths. If you suddenly come up with like a use case, where you know that there's been very scarce training data or even the reasoning models cannot solve every problem, like they were good at things that they've seen in the past or they've been - reinforced learning has been - data has been there, even for RL.
 
One of the things we have seen that these large language models don't really make someone who's already 10x become 100x effective; they really make an average, let's say 1x person become 10x. Because it's bringing in a lot of the training data, which is trained on best practices, good use cases, to a more average developer. Also, that what makes it effective in automatic repetitive work, the toil. Some of these code review comments are actually toil, like most of the time it's the same thing, best practices around security, best practices around some null pointer checks. It's again and again the same thing. Or it's sometimes, unit test case, generation docstrings, like those kinds of things, it's very effective. Those are the use cases we typically go after, where there's a lot of toil, repetitive work, and we know that people just don't want to do these things. Those are things you go and automate.
 
If you ask me, can an LLM create a brand new or make someone who's already a really good programmer become 100x? I mean, that I don't know yet. But we have seen a lot of people become 10x with Thanks to large language models.
 
[0:36:17] KB: One of the things you said a little earlier was around essentially not wanting to build features where the technology isn't there yet. What would you say is kind of the edge right now of the types of things you would do with an LLM, where you think it might get to in the next few months, versus, "Ah, that's not going to happen anytime soon."?
 
[0:36:38] KB: Yes, we constantly track the envelope, Like that's the whole idea with the evals. Like one of the other secret sauces these good AI app companies have is like evals, where they're able to track not just the efficacy of the current system or the new models that come out, but also track the limits of these models.
 
We have some test cases we know that even advanced models like o3 are not yet able to solve for us. So, it's very critical that we track the progress and we have seen our own benchmarks, and our own evals getting beaten progressively from 4.0, to 0.1, to 0.3, and so on. That gives us a good idea. The second is a price on how effectively can we offer, because even these providers don't have enough quota. We have to fight with the provider, sometimes to get rate limits. So, even if let's say we have a use case in mind, and people are willing to pay for it, we just don't have capacity for it to be delivered at scale.
 
So, there are multiple factors which kind of hold us back on some of these frontier use cases that we have in mind. It's a complicated thing, like I would say, on way to big bets. I mean, overall, like in this space, there's massive appetite in the market to bring AI to, as I said, automate the toil and the mundane work, right? But at the same time, there are the practical limitations on how much capacity you can get, and the capabilities of the AI itself. 
 
[0:38:01] KB: Yes. I think it's the first time I've seen in quite a long time where it feels like the whole industry is capacity-limited. We just can't ship enough GPUs.
 
[0:38:08] HG: That's right, and it gets expensive as well. I mean, we do see like there's going to be orders of magnitude reduction. But then, again, some of these other use cases will start opening up. It's challenging, I mean, that aspect. Overall, the models are, in a way, designed, I mean, especially with RL, like, yes, you can make them competent on a lot of use cases, provided you have the right kind of data. It's like recording the usage, not just what's available on the Internet, but observing how people do things, how I think if I - that's how the RL thing works. Sometimes it's just synthetic. But the thing is that you have to have ability to record that data somehow.
 
That's how other use cases will open up. But for now, it seems like coding is something you can easily obtain that data. Either through code editors, through open source, through by hiring humans. I know these companies are also hiring a lot of contractors to go and solve programming puzzles. So, data is relatively easy to obtain, and that's why we have seen a lot more success in coding use cases initially with AI. But that doesn't mean other use cases are out of reach forever. It's just a matter of time. People will figure out how to obtain quality data to make those use cases reliable.
 
[0:39:13] KB: You mentioned evals, and that's another place it might be worth us digging for a little bit because this is something that I feel like there's a lot of chatter, but I haven't seen big standards coming out yet in terms of how to eval. It feels very company-specific, oftentimes. So, how are you thinking about and managing evals? 
 
[0:39:32] KB: It is indeed company-specific, and we have burnt in the past by looking at public evals and trusting them. That's what happened back in June, July, where we were burnt by even GPT-4, who came out. It was actually worse than turbo, but at least, our use case. We didn't have good evals back then, and we saw a lot more. The main eval is like, hey, are we seeing the same number of conversions? Are people still buying the product at the same rate? Some from sign up to paid? Are we seeing a big turn rate? So, those kinds of things are the real data points, like the business outcomes. As long as you release these models and your outcomes improve, or remain the same, this means something is working. So, a lot of it is like wipe checks as well.
 
At the same time, you want to still do as much as you can at your end, because if you're rolling out these new models, you don't want them to backfire, like we have 100,000 developers. The last thing we want is disrupting their daily flows. So, we try to be careful, like we try to curate some of these examples we see in the wild, where we think we'll make a good eval. It's about like, we're taking more like a cattle versus pet's approach. We don't have millions of examples like other companies. We try to curate a golden data set of, as few examples as possible, which allow us to track where the AI is today,  and where we can also able to compare these models more effectively, very quickly.
 
[0:40:46] KB: What granularity do you apply that at? Because we talked about you have this complex and valuable task graph and pipeline of things going on. Is the eval at the level of the whole pipeline on a particular code change, or are there more granular things that you are testing? 
 
[0:41:05] HG: It's both. We are taking the end-to-end approach as well, where we are running the end-to-end flow. But a lot of the times, we are also running as a unit test case kind of a thing, assuming a lot of the context we are able to provide is perfect from the other stages of the pipeline. How is a certain stage going to perform? Because it's a complex pipeline, and especially, agentic, and your errors compound the deeper you go. That's a hard part. I mean, if you have 5% error rate, it becomes 20% end of the day downstream. So, the idea is like how do we decompose this pipeline and test each stage independently as much as possible, by keeping lot of the other factors the same.
 
It's kind of a balance. Yes, there are end-to-end tests as well, and at the same time, it's very granular. I wouldn't say we have 100% coverage, because some of the prompts are simple. We don't feel like writing a lot of evals for them, but some of the more complex prompts, where a lot of the classification happens, a lot of the reasoning happens, like those kinds of prompts. We have extensive tests for now. 
 
[0:41:59] KB: Are you using any particular framework for that, or it's homegrown?
 
[0:42:03] HG: It's mostly homegrown. I mean, we do have some visibility in tools like Langsmith, especially from the open source, like we don't trace our paid customers to private repositories. But that's where we have a lot of the open-source data coming in, that provides us live visibility into how the system is performing.
 
[0:42:19] KB: That makes sense. Slightly different direction. You said this is your third company, and I think I saw CodeRabbit's completely bootstrapped. You didn't go to the venture capital route or anything like that. I know that's something a lot of developers dream about doing, taking a project and bringing it to be something sustainable. What did that take? How does that look? Were you able to get to something that could sustain you very quickly? What was that timeline like?
 
[0:42:47] HG: That's an interesting question. Yes, we had success in the past. My first startup was a good exit. Second, not so much. I mean, that was in the reliability management space. CodeRabbit was kind of an internal tool that started out there, but then, it like flourished independently. In this startup like one of the unique things has been just the compressed time frames, things are moving. So, it's not like we didn't take venture capitalist money. We are funded by CRV, which is one of the big investors in the product-led growth companies. So overall, we raised around 26 million, so it's not like it's completely bootstrapped at this point, like there is significant VC money which has been raised in this company. But yes, I mean, it did get to [0:43:26 inaudible] without the seed funding ground.
 
We were already a million dollars annual record revenue last year when we did that round. That was completely on a bootstrap budget. But we could do that given that, yes, there was some prior success, so we could invest. We were at a stage in life where we could take that kind of a risk. That makes sense.
 
[0:43:45] KB: How did you get your initial sets of customers? I think this is like zero to one phase, is one of the most challenging and particularly for developers finding the market. And you're targeting developers, which a lot of us, when we think about, "Oh, I could do something, that we start with an itch that we want to scratch for ourselves. How did you kind of get to that one million out of the gate, no background budget, except what you could fund yourself?
 
[0:44:09] HG: A lot of that is thanks to my co-founder, Goor, who did things I would not have otherwise done, first of all. The first two startups are all enterprise sales, very content marketing driven, very different go-to market. I'm not saying that was ineffective, but that's what those products needed. On the other hand, the developer market is very consumer-style market. It's a massive market compared to selling cloud info, for example. The strategies that work here are very different, even things like ads work very effectively in this space.
 
It was a combination of multiple things, like influencers, organic tweets, like our users talk about the products. A lot of it is not even us pushing it, like it's - the flywheel effect of the users that talk about it. So, a lot of our customers who come in, inbound, are primarily coming because of word of mouth. They're not being acquired by marketing by anyways. Our cost of acquisition of customers is very, very low in the industry, because it's just a flywheel effect. The key things we did is, we made the product accessible to as many people as we could. We made the product free for open-source users, so they could try it out. We made the product free for all individual users on VS Code.
 
The idea is like, we know that this AI thing is so new, it needs a massive habit change. The main battle is not building a product or raising money. The main thing is, are people going to form this new habit or not? That was our biggest worry two years back. We saw it coming. Everyone was trying to bring AI products to the market. We knew 90% of them would fail because people are not going to change their habits. So, we saw that early on, and in order to quickly iterate on the product, and then make sure that we build a habit-forming product, we had to make it accessible. There was no other way, and we kind of innovated a lot on that. 
 
That's what led to a lot of user love, because we could iterate and hammer it to the point where it has a very good product market fit and gets universal love.
 
[0:46:07] KB: Yes, great lessons there. I guess we're getting closer to the end. Is there anything on the horizon? What's the big release coming from CodeRabbit?
 
[0:46:17] HG: We're doing very interesting stuff now, actually. So, code review has been a very interesting starting point, getting us through the door in pretty much most companies now. One of the things now we are seeing is wipe coding takeoff. Now, we are seeing even more acceleration in our growth. Like we have been growing crazy, but last three weeks has been like, I would say crazier, we have never seen that kind of growth. Because all these OpenAI codecs came out, background agents  Cursor is doing, and cloud code is there. There's so many vibe coding tools out there. What we're seeing is like this huge opportunity in being a tool that can make the vibe coded systems production-ready. 
 
So, there is still some last 20% polishing or we call them finishing touches, and those are the areas we are focusing on that. In the PR, can we eliminate all the deficiencies? Like, for example, if you're missing documentation, and you as a company care about it, can we add docstrings? Can we add missing unit test case coverage? Because those kind of things you're going to discover when you actually open a PR, you're not going to discover that in your Cursor or code editor. You're going to discover that in the CI/CD. That last 20% polishing is what we are focusing on as a company. 
 
[0:47:25] KB: That's super cool, especially because I feel like one of the things I've seen with people exploring vibe coding is, the better your code practices are, the better the AI is able to generate things in it. If you keep things modular and well-named and all these things that get caught in a code review, then you're going to be able to sustain this longer as well.
 
[0:47:44] HG: That's right. I mean, there's so many things. You're talking about maintainability. You're talking about can we fix some of the CI/CD failures? Like, there's just so much downstream of a PR as well that need to happen. We are pretty excited like them. I mean, the massive appetite and a lot of these form factors haven't been thought of in the past, and we are so excited to bring all these new ideas to the market.
 
[0:48:04] KB: That's awesome. Well, anything else that you would like to leave our audience with before we wrap?
 
[0:48:10] HG: I mean, the only thing I would say is, definitely try out CodeRabbit if you haven't tried it already. I know that a lot of people have heard about it, but the thing is like, it's a tool that will surprise you once you actually try it, because it's that good. So, I recommend everyone at least try it once.
 
[0:48:25] KB: Awesome. I think that's a great wrap-up.
 
[0:48:27] HG: Thanks, Kevin.
 
[END]