EPISODE 1894 [0:00:00] ANNOUNCER: Visual Studio Code has become one of the most influential tools in modern software development. The open-source code editor has evolved into a platform used by millions of developers around the world. And it has reshaped expectations for what a modern development environment can be through its intuitive UX, rich extension marketplace, and deep integration with today's tooling landscape. Now, in an era defined by rapid advances in AI-assisted programming, VS Code is at the center of a profound shift in how software is written. Kai Maetzel is the engineering manager leading the VS Code team at Microsoft. He joins the show with Kevin Ball to talk about the origins of VS Code, how AI has reshaped the editor's design philosophy, the rise of agentic programming models, and what the future of development might look like. Kevin Ball, or KBall, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through Latent Space. Check out the show notes to follow Kball on Twitter or LinkedIn, or visit his website, kball.llc. [INTERVIEW] [0:01:25] KB: Kai, welcome to the show. [0:01:26] KM: Hi Kevin, thanks for having me. [0:01:29] KB: Yeah, I'm excited for this conversation. So, let's maybe start a little bit with you and your background and your journey to leading this VS Code team. [0:01:39] KM: Oh, actually, it started very, very early on. My first internship was already with DevTools. And I never really left DevTools. And then, 10 years ago, I joined Microsoft explicitly for the VS Code effort. There was promise that there is something that could get traction in the market. And that's the moment I joined. And we pretty much went from no users to a whole lot of those. 44 million by now. [0:02:11] KB: Yeah. I remember when VS Code first emerged, and I was like, "Another IDE?" And then it kind of took over the market. [0:02:18] KM: Yeah, that's true. I mean, when you think about this, a very well-established market, right? There are editors forever, IDEs forever." But all of us somehow lived in this in-between world, where it's like we're not super happy yet. It's like, "Yeah, I can do this there super, super fast. And I can do this there in a good way, but I have to wait until it starts up. And it has too much stuff in my face." And so on, right? It was really finding the sweet spot in the middle. And that's actually also how we talked about this, right? It's really two ends of the spectrum. Editor on the left-hand, full-fledged IDEs on the right-hand side. Where's the spot in between? That's really what we tried to find. And I think we really hit the bullseye there. [0:03:08] KB: Yeah, absolutely. And I feel like you were winning for a ways. And now we're in this kind of moment in the tech industry where what it means to write code feels like it's shifting very rapidly. And so I'd love to kind of dig in with you about the ways in which you are thinking about this. I think initially bringing Copilot into VS Code and looking at that. But what has been the sort of VS Code journey to this new agentic coding world we find ourselves in? [0:03:41] KM: Yeah. When you think through this one, we easily forget what we knew and what we didn't know. I mean, you just go six months back, and what our understanding was of how coding should be and what it is today, right? Finally, I just had a conversation with someone, and this person said, "Oh, 60 days ago." I was like, "What? I thought that was in March or so." Right? And we have November right now. It's like we're working on very compressed timelines, right? Lots of things are happening. And I just want to keep this in mind when we talk about all of this, right? In the very beginning, we started working, actually at the time, as a VS Code team with the GitHub Next team. And the GitHub Next team is pretty much the internal research kind of area. And GitHub Next had good relationships with OpenAI. And so this is pretty much where the AI-powered intellisense suggestions came from. There were already attempts in other areas before, right? This was not new per se, right? And usually it was integrated with code assist and so on, right? And then they got little stars, for example, saying, "Oh, those are AI suggestions. The other ones are coming from the language servers and these kinds of things." That was the first part, right? And then we're like, "No, no, that needs to change, right? We need a different UI for this." This is then where we really pushed hard into ghost text and these kinds of things, right? And in the beginning, we already had multi-line completions, but then we realized no one is using them because now you have to review code rather than to stay in the flow, right? Then we pretty much walked backwards and saying, "Oh, smaller completions." That's pretty much this whole journey of completions, right? Then ChatGPT came along. We started off the year right after ChatGPT launched with a hackathon within the VS Code team, or saying we have four days here, and we just go and build what we think we can actually build with these new models, with this new kind of approach. And that was super interesting because that pretty much immediately made clear that you cannot really just put this on from the side, right? It needs to be really part of the tool itself. AI really infuses in every single aspect, right? So you think about the command pallet in VS Code, right? You type in there. The very moment you have good AI, you think about that. It should be smart enough to figure out what you actually mean rather than what you type, right? But then you have to find the right spot between "No, I actually meant what I typed." Compared to "No. Don't guess widely, right?" And then there were, of course, performance considerations, right? Everything in VS Code is about performance all of a sudden. The AI answers were not that fast, and so on. But it was super clear that this needs to be core part of the experience, right? Then there was, for us at least, very interesting conversations. Because GitHub, at the time, GitHub Copilot, was already an established brand. VS Code didn't have a sign-in that you needed. You just fire it up. But in order to use GitHub functionality, you needed to sign in. GitHub had already billing and all these pieces in place, right? And then there was an established brand. How do we now find in a way the balance between what comes in through in the extension compared to what is in the core? And that journey, this duality, that really took us a while to get right. So then there was a lot with chat. There was a lot with what models do you actually have available. Not just what models, but what capabilities do those models have? How much context window do you actually get, and so on? And I think there is a difference between if you sit in a startup and you think about those problems, or you come from a world that is already profitable. And so Microsoft thinks about this in very different ways. And for example, it took us a really while to convince others, "No, we need larger context windows. You cannot work with a 4K context window very efficiently." There was these kinds of challenge in the beginning. It was an extremely steep learning curve, I think, for an organization as a whole. And then I think over time, we kind of figured this out. We're still learning. So we're not done here. But I think we figured this out. Yeah. And I just think about the last year, we came with edits. We came with what we call NES. So the tab-tab-tab model. We have the agentic loop. We integrate the cloud agent that GitHub has Copilot coding agent. There's now the Copilot CLI. We integrate all of those now into the VS Code interface. We use the agent sessions view in order to make that right. We're now actively working on improving the agent session view because it's still somewhat rough in usage, right? So we're actually improving this. There is a lot of these kinds of things that happened, right? And at the same point in time, the competitive landscape has shifted. The capabilities of the models have shifted, right? It's now not only about capabilities now. It's about how long can it run? How fast does it respond? Like time to first token. What model mix are you actually using at the right time, while people actually learn how to use it? It's an extremely dynamic area. [0:09:39] KB: It absolutely is, yeah. An area I'd love to kind of dig in with you a little bit more, you mentioned how, even from the beginning, as you started to look at more advanced tab completion, AI-enabled tab completions, not just language server, you had to kind of find this balance between how much were you showing. How much were you asking to allow the developer to course correct, right? I think one of the beautiful things about these AI models is you can do this sort of intent-based UI where you kind of try to guess what the user is doing and lead them there faster, but you can get it wrong. And so I'm curious, as you've layered on each of those pieces, as you've gone down chat-oriented programming, and now agentic programming, and all these things, how do you think about that balancing act of, "Well, we can do a lot for you. But how do we make sure we're doing the right things?" [0:10:32] KM: I would actually start with the sentence of saying that this is an unsolved problem, right? And it seems maybe surprising because we have been doing this. But for example, you look at the tab completion, NES, so Next Edit Suggestions, there is always between what do you show to the user? How often do you show something to the user? How often does the user actually accept what is being shown to them? And then, also, how often do they explicitly dismiss it? And that is the space in which you operate. And you try to find, as we discussed before, what's the right thing between an editor and an IDE? You try to find the spot. And there are extremes where you can go. For example, you show everything that the model proposes immediately to the user. And, of course, the acceptance, the absolute number of acceptance goes up and up, right? But you also annoy the user at the same time more and more. So you really have to find pretty much how often do you show something? How many opportunities are there that you actually show to the user? And how many of those does the user not explicitly dismiss but accepts? You really have to find this explicit acceptance, explicit dismissal, and so on. And that is an ongoing kind of fine calibration. Because people also learn that is the next part, right? A person who actually uses NES for the very first time has different expectations from a person who is actually much better. It also has to do like, for example, the typing speed that a user has. How long do you wait, for example, until you show something? A slow type. Or for them, it might be way more annoying if the model is actually very fast. While a fast typer is annoyed that they don't have the proposal. They pretty much want to type in without stopping. Just hit the tab key in order to accept. Because they anticipate what the model actually will bring. It's an really ongoing effort. And we have quite elaborate dashboards with metrics on this, where we really go back and forth and adjust those pieces, and then see and run a 5% flight. And seeing does that actually change? How people actually interact with it? If you show a little bit more, has it a positive outcome, negative outcome? How often do people hit the escape key? And it's really interesting. For example, the escape key hit rates were not that high. There are around 3% the last time I checked. But when you ask someone, they're actually saying, "I hit escape all the time." And then you look at the data saying, "No, actually you don't." It's really, really interesting. Because in the end, you have to get a happy developer. And happiness is a combination of how productive you feel, of how well you actually thought you could go through your thought processes. How focused you could be? How little annoyed you were? All of these kinds of things. And that is an ongoing kind of process to really get this right. [0:13:48] KB: Absolutely. I will say, as a longtime Vim user, no amount of escape in VS Code ever feels like a lot of escape. [0:13:56] KM: Yeah, absolutely. [0:13:57] KB: I'm curious, you talked about how this can vary across, for example, different typing speeds or experiences. When you're tuning these knobs, are they global knobs applied to everyone? Do you have some sort of adaptive system in there such that, for example, if I'm a faster typer, I get more rapid completions? How does that end up working? [0:14:19] KM: It's mostly global right now. But we're, for example, working on how is typing speed encoded actually in the input that the model gets, so that it actually can take that into consideration. [0:14:34] KB: Got it. It would still be a global model, but this would now become an input with features developed based on it. Interesting. Okay. And looking at that feature space you mentioned, typing speed is how new a user is, or some sort of representation of that also encoded in some way? [0:14:55] KM: We don't have a good way to really encode that as an experience level. Because you could argue from - you look at the workspace. And I think the workspace is a good indication. But it doesn't necessarily tell you if the user is new to that particular workspace or not. It's quite complicated to have a good profile of a user, because each of us actually go through those different stages depending on what repo they are looking at. If I open a Rust repo, I might be more intimidated than when I'm looking at a TypeScript repo. And these kinds of things also should, in a perfect world, play into what we're doing. Not right now, but we're thinking about those. [0:15:44] KB: And looking at some of the other interaction modes beyond the next edit suggestions, as you start looking at chat-oriented development or even this increasing agent loop types of development, what are the interactivity trade-offs that you're exploring there? [0:16:01] KM: I mean, let me put it this way. The original chat interfaces across all of the different tools, ours included, they were interesting. You looked at those and saying, "Oh, this is really amazing what it can do." And at the same point in time, "Oh my god. This is bad. This sucks so much." [0:16:20] KB: I feel like this is my experience with all of AI. [0:16:23] KM: All right. And what I mean by this one is we spend years optimizing to go from, let's say, 3 seconds for a particular interaction, to 2 seconds for a particular interaction, right? And we do all of this in order to keep you in flow state. And then we put you into a chat, and now the answer takes 20 seconds in a good case. It might take longer, a couple of minutes in some other cases, and so on, right? And you only tolerate this because you still think that the outcome at the end is quicker, is better than what you have could done on yourself. You torture yourself a little bit in order to accept the better outcome, right? And that is pretty much the baseline where we started with these kinds of chat interactions, right? And since then, I think you see that the world actually changed a bit, right? First of all, we have much faster models, right? People have also developed different styles of interacting. For example, one style of interacting is you use a really fast model in order to do your research. You go. And interactively, you go and try to figure out what you want to do. But this is something where you actively research. And then you kind of know what you want to do, right? You are able to actually put this in a reasonable prompt that you then delegate and let run in a background agent, for example. That's one way. There's another school of thought or another behavior that is almost the inverse of this, where people go and say, "No. I run multiple exploratory asynchronous agents. I roughly tell them what I want. They're responsible for creating a plan for this." And it can take a long. They use a large, slow model for this until you get your plan. You work a little bit on the plan. But then you pretty much use a dumbo fast implementation model. And you do this because you kind of know that AI doesn't get it right all the way to the very end. You are actually helping along. You're perfectly fine to go only to 90% and do the other 10% manually. And there's not such a big difference between going to 92% and doing the last 8%. That's why people accepted a model that is not that sophisticated sometimes for this. But those are really different work styles. And one, you do all of synchronicity, and speed, and exploring, and thinking. You do this interactively. And then delegate. And then review. But because you actually had the thought process at the beginning what to do, the review is kind of easier, right? Compared to that, the models think that the agents think. Then I'm helping with the implementation. And that also makes the whole review much, much less. This is quite interesting. And there is really this kind of tradeoff in what you're saying. The interactivity versus not. We implemented different custom agents. And so, for example, we have the one that only - the ask mode, where you really just go and make no edits. We have one where you can define yourself, what is the scope of the modifications that actually can happen. That's called edit mode. Then we have the agentic mode. We're working or playing around with something that's called interactive mode, where the model becomes, or the agent becomes exceedingly steerable. When you go and say Make this change in this file, it will make this change in this file and not go off and fix five other files that actually now have compile errors, right? Because it's you who actually steers it, right? But again, that needs to be super, super fast, right? We have a planning mode that we ship a planning agent. There's all these different kinds of trade-offs. And we're learning at any given point in time. But it's just like what it always was with developer tools. There are different breeds of developers with different interests and different preferences. And you've got to be giving the right tools, the right combination of tools to each of them so that they can find a place that they're happy. [0:21:04] KB: So let's dig in a little bit, because one of the things that you talked about for these different modes and how steerable they are, there's obviously model differences, right? Sonnet loves to edit all the files. It just likes to talk. Whereas some of the other models don't. But there's a lot that you're doing in the agentic harness and how you're defining these agents. Can we maybe dig in a little bit? I think coding tools are probably some of the most advanced agent software pieces we have out there. How do you build it? What's the stack for defining one of these agents, an ask agent, or what have you? [0:21:41] KM: I mean, at the very core, right? And I'm pretty sure you have heard that answer several times, right? An agentic loop is not that particular complicated. It's like you give it a bunch of tools, you give instructions how to use those tools. Most of those instructions are actually with the tool description. Sometimes they are outside. Each model has certain kinds of preferences. And there are prompt guidance for each of those. Some, for example, like that you tell it, "Oh, give the user an update from time to time." Others actually stop the agent loop when you give such instructions. For example, Codex is one of the models that stops when it wants to give an update to the user. There are all these kind of differences, but that is the basics. And then you've got to pretty much instruct the agent. That's actually one of the more interesting problems is when is it done? At which point in time should it actually consider to be done? And so that is the basics of all of this, right? When you ask a custom agent on top, that's actually something where we say, "Okay, in a custom agent, you can define what tool set is available to that agent." Out of all available tools. And they actually can come from different sources, right? There are built-in tools in VS Code. Extensions can actually define tools. And on top of this, you can install MCP servers. You can have a quite large set of tools. So then you can specify pretty much in a custom agent file, you can say, "Oh, here's the tools that you should make available. And then after that, you have pretty much the normal syntax that we use for everything else, which is a markdown-inspired syntax, right? Where you go and say that you can say how an agent should actually operate. And then many people have seen what Claude skills look like and so on. All of the definitions are pretty much comparable to each other, right? That kind of aspect, right? It's a markdown file where you give references to other instructions files, where you can say what tools to use. Under what circumstances? For example, you would go and say something like, "Hey, because I'm in a workspace, that actually I know I've defined it all. I really like that you use the test runner tool rather than go and do npm run tests, or cargo test, or so, right?" You say this explicitly, right? Or sometimes a model goes and kind of assumes, let's say, that your tests are wrong, right? Funnily, I said that was the first moment I thought models really get intelligent. And they pretty much rewrote tests to assert true. In one way or another, it was somewhat obfuscated. But that was pretty much the bottom line, right? I was like, "Oh, all my tests pass." It's good. Yeah, you did, right? Sometimes you just go in very explicitly say never touch test files there. Everything is good there. You might make if you do a refactor, you can adapt them. But asserts are untouchable, for example. It's pretty straightforward like this. But a lot of work then actually goes in and saying, "Okay, how much instructions do you need for tool usage? How much guidance do you need to give?" And that's where pretty much this whole machinery comes into play, of what evals you run. How many of them you run? How often you run them? How do you really think about actually assessing/evaluating an outcome, right? And that is quite different, right? For example, we run the rest of the industry SWE-bench, for example, right? As one of the benchmarks. And we're not just looking at the resolution rate because the resolution rate is you get from A to B, right? That can be super, super messy, right? We look at how fast do you get from A to B. How many tools did you actually call? What is all of the amounts of tokens that you actually used in order to get there? Did you call the tools that we think you should use, right? For example, if you have a terminal tool, you can use that tool, and you can get to the end of it. And that's perfectly fine if you run as a background agent. But if you run as a foreground agent in VS Code, then the user actually expect that when you say that the tests failed, that they can look at the test explore and see the failing test and click on there, right? You wanted to use that tool then. If you have a watch task, it should not try to spend a bunch of turns in order to figure out how to build the project called the watch task. It's right there. Right? These kinds of evaluations we actually then run and compare. And then there's the fine-tuning going on, word changes going in, how you group tools in different categories, right? That actually makes a difference there. There's a lot of that work that actually goes in. [0:27:03] KB: Yeah, it's deceptively complex inside of this very simple wrapper. There's a couple pieces I'd like to dig in on that. So, one, as you mentioned, different models tend to have different preferences, I guess we'll call them, right? In terms of how they invoke tools, how they check in with the user, things like that. In the UI, I have this very simple model switcher. I'm just changing models. All of that is opaque to me. Are you customizing tool descriptions, the core agent instructions, all these different things by model to help them behave consistently? Or how is that all functioning? [0:27:39] KM: There is a current state, and then there is the near-future state. I mean, we're all open source. So you can take the repository, you can look at this, and you actually see. I mean, when you run inside VS Code, we have a log view where you can see every single call that is actually being made to the model you see every single detail of this. You can actually verify by words here. So, in your own day-to-day experience. The current stages, we actually have specific prompt for, I would say, roughly every model family, sometimes more detailed. Sometimes we go down and saying, "Oh, it's not just a GPT model." We really separate between GPT5, GPT5 codex, GPD5.1 codex. We move so fast that I say five rather than 5.1. But as an industry, right? Different entry points pretty much in our main prompt file generation, right? And then we'll pretty much pick what tools are available for that particular model. What are additional instructions that need to be given, and so on? We customize the instructions that are outside of the actual tool descriptions. But we don't have a model - we don't have it yet in code, where we actually say, "No, we know exactly that this is the tool description that works better in this particular kind of model." That's actually something that we discussed several times. Never quite made it to the point of, yeah, now it's coming. The last iteration plan, we again had the same conversation, and I was saying, "Oh, we do this right when we ship beginning of December," or have model specific tool descriptions, right? Where pretty much the prompt file can override and saying, "Oh, if this tool show up right here, this is actually the tool description that you should use." [0:29:40] KB: Another kind of detailed topic in here is how you assemble context for the agent in terms of you're operating in this type of repo, you have these thing. Beyond tools, there's people who do different amounts of like pre-injection of context. Maybe it's not just system prompt, but it's got a whole bunch of additional things. How do you think about the right ways to sort of present things to the agent so it kind of starts out going the right direction, versus everything's in on-demand tool calling? [0:30:13] KM: There are a couple of things here that - and let me actually start with the tools first, right? Most models these days have been trained with a particular tool set, right? Out of the box, they already know a certain set of tools. For example, apply patch for GPT models, string replace for string models. Those kinds of things that they all must have in those individual tools. And then the next question is, beyond this, how well do models actually generalize, right? And so you want the tools. And then, again, the context, as I said. In what kind of environment is that particular prompt now executing? Is that foreground agents, background agents, and so on? That's the first kind of question. Those tools, how are they actually represented in your prompt? Then the next one is, let's say you have a couple of MCP servers in stock. Some MCP servers have only one or two tools, right? But others come in with dozens and dozens of tools, right? Most models have a limit to how many tools you can actually put in a prompt, right? 128. But then on top of this, there's a lot of tokens that you actually put in there, right? How many tokens do you actually want to spend on tools that are rarely used or only in particular, specific situations are being used? A technique we're using there is, for example, we go and take all of the tools that an MCP server gives us, and we actually now create pretty much virtual categories of tools, right? And in these kinds of virtual categories, they are represented as tools in their own right. We give this to the model. And the very moment the model decides to call one of those virtual tools, then we pretty much expand it, right? But now you have immediately this kind of tradeoff discussion, which is the very moment you do this, you actually - [0:32:27] KB: You've blown the KV cache, among others. [0:32:29] KM: Exactly. Exactly. In some cases, some models actually support that you put this at the end of the prompt. Others actually don't. Now, you immediately have to make this trade-off, right? And that's pretty much where a lot then also of evals come in, right? You run in all of those different configurations. This is optimizations functions that you have to hit here, which is like, "Oh, if I blow my cache once or twice over such a long time, I'm still good," right? Or you're saying, "No, actually, I can run with a slightly larger prompt," right? That is fine because I have a cache hit rate of 87%, or whatever, right? And I was like, "This is okay. There's no big advantage here." It's a constant kind of tradeoff. And that is also true for all of the other context part, right? And there is not a real stable. I mean, there are some stables. You say who the user is. You say what the repository is that the user runs in. But the very moment already, how much information do you give about the project itself? For example, we put this kind of prefix in where we're saying, "Oh, this is what pretty much the top level of the project looks like to the user, right? And we still believe that this is actually reasonable token spent. But then, on top of this comes what we dynamically include, right? And dynamic inclusion is clearly, if you have an AGENTS.md file, we have custom instructions that we actually do support. And custom instructions actually can be tailored in different ways. They can be just in a certain location, and that's fine, right? In the front metal of those custom instructions files, you can say apply to. And then you can actually give glob patterns and saying, "Oh, in this particular test folder in a TypeScript file, this is a file that actually applies." And then there's yet another mechanism where you can actually give an actual language description under which circumstances that custom instruction implies, right? And then we actually start collecting these and actually putting them also in the prompt. And that is actually a process that I think is the one that is the most valuable, right? Because you make sure that the user is in control of how much they pretty much AI-prepare their codebase, right? But when they did a really good job, then they can actually make sure that the agent, extremely quickly gets to the right place, knows where to start, knows where to look. [0:35:11] KB: Let's maybe talk a little bit about interactions between agentic pieces and the IDE itself. Now, you mentioned a couple examples of this of if it's running in the foreground, use tools that connect to parts of the IDE so that they're running in the right place rather than using the terminal. But what is the surface area that you expose to the agent, to the IDE? And how do you think about changes coming from the agent versus coming from a human? [0:35:44] KM: I mean, it's all about how you interact with it, right? Again, the most straightforward form is you actually have a foreground agent running in VS Code. And that foreground agent, we give it actually quite an interesting set of things that it can do, right? It can look at terminals. It can look read selections in terminal, all of these kind of - it can run tools, watch tasks, right? All of these. Specialized edit tools that we then actually - where we actually are able to run pretty much snapshots at a given point in time, right? So that we can show you, "Oh, here's all the appropriate diffs and so on, right?" There's a good chunk of tools that we actually give a foreground agent. In a background agent, that's quite different, right? In a background agent, we give significantly less. Why? Well, the first thing is if you run the agent in the foreground, you have this kind of expectation that the agent actually is reasonably quick, right? If you think about this more, that is the interactive part. You don't want to sit there and wait two minutes and twiddle your thumbs, right? You want to get the answers relatively quickly, right? And then again, you want to make sure that this all kind of is the extension of what you would do anyways. But when you go and move something into the background, then you clearly don't want that it touches your UI state at any given point in time. You don't want it to - we mentioned the example of the test runner a couple of times. You don't want it to mess around with your test runner. You don't want it to open up a terminal on you, so that your mouse all of a sudden clicks a different place and so on, right? There are different tool sets that you're actually giving. A cloud agent is yet different, right? While a background agent is still running on my local box and is still exposed to me closing the lid, the cloud agent is not. And there, it's about in what containerized environment is that agent actually running? What is the project that you actually have? Can it build successfully in that container? Can it execute and test run in this container? Yes and no? And so on. And the agent. But again, the cloud agents have significantly less tools in order to do so. Remind me of your question again. [0:38:15] KB: Well, my question was kind of how you think about those interactions? And you've given me a fair amount. This actually leads to something that, or a curiosity I had as I was listening to you, is how much does this differential exposure of different types of tools end up influencing how well the agent does? I'm imagining the same prompt in an IDE context, versus a background agent, versus a cloud environment might result in quite different coding behaviors. [0:38:46] KM: The model choice, I think, has a much bigger impact on the actual outcome. What we're trying to do is really straddle the line between the user experience and the success of the agent. Because as we said, if you have a terminal tool, execute terminal commands tool, you can get really far. You don't need an edit to and cat comment with input redirection. And you see this is your edit, and then it writes it to the file system and so on. You don't really need a whole bunch of tools in order to make an agent successful going from A to B. There are some differences. For example, when the industry introduced the to-do tools in order to have longer running agent, self-organizing, and so on. But in big parts, when you think through this, you don't need a huge amount of tools in order to make that successful. That's one. When we actually bring it in the foreground and give it more tools, then that's very specific to the environment. For example, when we come and say no other agent needs an - hey, maybe you should install this extension to have a better user experience, right? But inside VS Code, that clearly is a tool that is available. It's particularly interesting. And you scaffold, for example, a new project, right? You go, you say, "Oh, I want to do this. And now, create this workspace for me." Or you go, because you told it to go, but you don't have the Go extension installed. It makes sense that the agent actually goes and saying, "By the way, go install. Should I install the Go extension for you?" It's really more about the user experience that we try to give to folks in the appropriate environments they are in. That's really the biggest difference. And then it's also coming back to how you interact, actually, with an agent. I think for a background agent, I don't want to have a lot of interaction. There, I want to have context isolation. And I want to make sure that it's even running in a sandbox environment so that I'm not bothered by tool calls. That I have to approve tool calls and these kinds of things. But in a foreground agent, that is more like I'm not quite sure yet exactly what I'm doing, right? At least that is the use case I see primarily. People are talking about code. They kind of go and make a selection and say, "Hey, change this." Short prompts, right? They are not particularly long. Sometimes people go and actually use NES in order to start a change, but then they don't finish it. And then just go and foregrounding, "Hey, finish this up." And it should then be very quickly. Just in this particular file, just do the rest. Or when people create new test cases, right? So test case generation usually is not something that takes particularly long. I mean, it depends. But in most cases, it's pretty straightforward. Pushing this in the background and then coming back after a while to review it and all of this. It's usually probably more like, "No, no, I do it right now." And then immediately run it. And then let me review it. So that there is no cheating going on. That the tests are not already playing to how the actual behavior is. Very different styles of interacting. While the foregroundization is really like short interactive behavior, talking about code, pointing at code, collecting pretty much the context that you want, that's all very code specific to background agents, you don't really do. You try to be precise the moment you started. And then at the end, yeah, you can follow up a little bit if you want. But it's different expectations, different levels of preparation, and so on. And what I'm actually saying here is there's more to this, which is when I say you talk about code, then I'm more like you actually are a person who cares about code. And you are actually really working on something where you need to guide in regards to software architecture, certain patterns that you want to enforce, etc. There's this whole other world where you don't care. You don't care what the code looks like. It's really just outcome-oriented, and so on. And those lines, they also shift back and forth within the same project, by the way. [0:43:36] KB: 100%. I care about my core architecture. This tool, just vibe it. I don't care. [0:43:42] KM: Yes, exactly. It's exactly this, where you pretty much - and this is an really interesting point. And I think as an industry, we maybe don't talk about this enough, which is that how do you actually AI-ready your code bases? That is exactly right. If you have a project - I mean, our code base, the initial commits, and all of this, they are more than 10 years old. And since then, we built on top of this. In order to make our code base AI-ready, we really have to think about what the core abstractions that we really - agent never go and change those things. If you should change anything here, we tell you. But then there are other parts that are a little bit more peripheral, as you said, right? Some tool or so that you just want to have on the site. Now, tool in a more generic way. It's like, "Yeah, just go do it." And you might even just check in pretty much the prompt file that you use in order to generate this. It's very, very different. You think about what is untouchable. What kind of lives at the periphery? Where you care? Where you don't care? Most people really love using test-driven development for a bunch of this. Tests are pretty much my prompts that I use for the implementation side. There's really this great flexibility. And people are operating in quite different ways. [0:45:12] KB: I'm curious, when we talk about these different modes of operating, and the fact that we kind of flow between them, how do we connect the dots? An example that I'm going to bring forward, and I'm very interested to how you would think about this. I'm often working on something kind of interactively in that interactive mode. I'm thinking about it, and an idea comes, "Oh, it would be great if we do this." And I have a set of sort of predefined research style prompts that I can just kick off. I'll kick off a background agent and say, "Okay, go and research in my codebase what it would look like if I were to do something like this. Write me an analysis doc in Go." It'll go off and do it as I continue on my main line. And at some point, I want to come back and pull, almost suck that into interactive mode. Now, I can do this right now with branches or doing things like that. But I'm curious if there's something in the IDE that lets me kind of - it's almost like I'm pushing ideas into the stack. And then I want to pop them down into my interactive world. [0:46:06] KM: There are different ways of thinking through this. Actually, it's interesting, because we really just discussed that. And we have a mockup. We have not implemented this yet, where a similar discussion came up, but more about at which point in time do I go back to a background agent, or to a cloud agent, right? And yours is similar because it depends on the output. What cloud agent generated for you? In your case, you want to see the analysis. You want to look at this and so on. And the way you think through this is, well, you kick them off. And at some point, you've got to go back. So, you need an indication that it's telling you that it's ready for review. But then the interesting thing is that this is not necessarily just looking at something. Does not necessarily mean that you did all of your due diligence. In a way, you need interaction, saying, "Yep, it's ready. You can go there." But then at some point, saying, "I took action on this. It's actually good." You want to have this awareness of those. And when you think about this, I mean, there's really, really - how should I say? Prior art, right? I mean, when you think about email management tools and so on, they are quite similar kind of characteristics. One thing that we had was pretty much, at any given point in time, when you interact with chat, and so on, clearly, you can make the things disappear. But you pretty much have awareness about where your background agents are, and which ones are ready to review. You don't see the if you don't want the running ones you don't see the ones you took action on. But really, just those that you haven't acted on yet, right? But they are done. They have produced what you asked them for. And then one of the mockups that I just talked about is pretty much where we bring this right into pretty much the very top of the title bar, where you have pretty much something that just can come down as an overlay just super quick. You just see it, right? And it needs to be a first-class citizen. If this UI, what I describe, is what it will look in the end, it's a different question. But it is absolutely clear that you need this kind of peripheral awareness no matter what you do. You need this kind of peripheral awareness that something else is ready. I mean, there are workarounds for this, right? And that's actually something that sometimes we're maybe too focused on the one tool that we own and that we operate in, right? But again, I mean, if you want something from me, you select me. And I get a notification that you want something from me. Integrations and these kinds of other work environments that actually tell you that we have a Slack channel with your agent. And just comes up and actually says, "Hey, I'm done." It's good. And you have the notification, right? And it fits in your other workflows and so on, right? There's a lot to explore here. That's where I'm going. We can do some things in the IDE. We can do some things with GitHub on github.com. But I think it's not necessarily where the line is, right? If an agent is an actor, there's a lot of other kind of tools that already are custom-built for actors. And so I think we've got to broaden our way of thinking through these problems a bit. [0:49:44] KB: I love that. And it's a good segue to another topic here, which is - I mean, VS Code has always been very extensible, very plug-in centric, very open. How are you thinking about that within this new world? Are there things that need to change? I know you mentioned MCP servers. That's definitely one way of interacting. Are there changes going on in that landscape as well? [0:50:10] KM: There's an interesting duality here. One is that in order to get new functionality into VS Code, you had to write an extension. If you actually have direct access to LLMs who can actually operate some of that, right? You said, "Oh, I have a custom prompt file that does X for me." To kick off and research background agent. Wow. But you can also add custom prompt files and do actually things in interactive mode in the IDE. And now if you give you capability to keyboard shortcut this, all of a sudden, there's extensibility right there without writing any extension and so on. Interesting kind of duality here. Because some things you still want this extension. But some others, you have a lot of flexibility already without even required to write an extension. That already changes extensibility. Then MCP service is a new - newish - new in these days, right? Newish concept, right? And that is interesting, right? Because it has MCP spec covers a lot. But the most interesting one, the most used one, is the one that you actually can make tool calls. You get pretty much a tool host. And that is clearly - I mean, there we're still very early when you think about this. It's kind of obvious some of the aspects. But now you actually have people who kind of say - Anthropic just posted this a couple of days ago. The whole part a couple of weeks ago about programmatic MCP tool calling. And then you're like, "Oh, I can clearly see where it comes from." But now we're really just making different APIs. It's like, why are we not calling them the real APIs? Why is there a differentiation between the normal APIs and the MCP servers, right? That you can see that they then start fusing together. And MCPs is just the API that you publish, right? And nothing else. And so that would make a lot of sense. But now you put this together with autonomy of an agent, and you end up in a potentially scary world, because now you need to think through the security implications. You need to think about how do you actually control this. What do you allow? What do you not allow? Identity management, permissions for agents, all of these kinds of things, right? What we're doing today is we create a sandbox for some of this, right? But it already starts. People go and say, "Oh, context 7." I should be quite careful how I phrase this so that your takeaway is not, "Oh, this is context 7. There's an issue." But I give this as an example. You register your website there. It's scrolling markdown files. I'm sure there's some sanitization going on. But wherever there's sanitization, you can actually play it. Now, you have an MCP server that actually finds those pieces of documentation, puts it in your prompt, and now what? Now, it starts building and executing code and so on. It's this poison the well kind of problem, right? You need in a way control all entry points into this. But this creates the most awful user experience. [0:53:46] KB: Yeah. No, this is a fascinating domain. Because in essence, all of these large language models they're another form of running computation. It's word-programmed rather than formally programmed. And so any MCP server is injecting code that's running on your box. Do you trust it? Do you trust everyone who's able to get anything into that? [0:54:09] KM: Yes. Right. It's exactly the question, right? In VS Code, for example, we had a - the way you actually do tool approvals, right? First of all, this is just running the tool. Again, input/ outputs, right? What you just said is the part where you're saying, "Oh, are you okay with this command being executed?" And it can be local if the MCP server runs local. It might install packages, right? UVX or NPX if it didn't run yet, right? Interesting. But there's number one, right? Are you okay with being this one executed? Saying okay. How do you want to do this for this session, for this particular call? Are there certain patterns in a command that you actually want to allow? There's a lot of room in order to get this kind of configuration right. But then you make the call. And let's assume this is a remote MCP server. Now, what you pretty much said is I'm okay that this server is being called, maybe with my authorization token. But now there's a response being computed. And that response also goes into your - or either in a summarized form or in its actual form into the history of your chat. And I was like, "Oh, what now? Now you need to pretty much review everything that comes back." But again, that's awful. You want to use then specialized security models to actually do this kind of monitoring. But now you're pretty much in this kind of who wins, right? Head-to-head race, right? What I even haven't touched on is like this is chat one way, right? But now you go to terminal, in terminal commands, and do you want every terminal command to ask for permission? Now, you've got to go and say, "No, no. Let us actually read and understand what that terminal command is." And if that terminal command actually feels safe, let's do it. But then you also need to give the user control. I mean, a user in an enterprise setup might think about what tool should be called without permission explicit, every single time permission. Differently than if I'm running on a VM in the cloud that I just leased for this particular kind of use case, for example. [0:56:46] KB: I do wonder if it leads towards a world where essentially all development is actually happening inside of a container or VM. [0:56:53] KM: I think if you think this all logically to the end, that is, I think, the part you're getting to. But then, still, you need to control the inputs and outputs to this container. Fetching a web page, right? When you go and say, "Hey, I need the latest version of Node." Well, you get an install command that runs in the terminal the moment something comes into your box, right? Yes, you want this to be safe. You want to be able to close the doors and say you cannot get out of it. But my point here is the problem doesn't go away even if you put this into containers. But you can control the environment in a better way. But we still need to think about how to make this a good user experience. How to make that understandable for you and so on. And when I say understandable, is we're now talking about - we didn't say this explicitly. But I think our conceptual model here, when we talked, was, "Oh, there's this one or two background agents that I have." But now multiply this with 100. And all of a sudden, that is a very, very different problem. And we need to rock through all of this. I mean, that's already starting when you think about this. With cloud agents, for example, let's say you're on GitHub, right? You groom, you assign a bunch of issues to Copilot, or you actually have auto-triaging enabled. You go and say, "Oh, we auto-triage certain ones." Now you get those PRs for this. You review them. Reviewing five or 10, right? Depending on size. It might be good. Reviewing 200 a day, you - [0:58:40] KB: I've reviewed more code in the last six months than I can remember. It's ridiculous. It's wild. I think this gets to kind of where I want to take us towards the end, and we're getting closer to the end of our time here, which is where do you see this going over the next year or two? I hesitate to go too much farther out. Because, as you've highlighted it, things are moving so fast. But how is VS Code, and this whole world of how we're managing the writing of code? I say it that way because maybe we're not actually writing the code, but we're managing the generation or writing of code. Where do you see it going, and what's coming down the pipe? [0:59:21] KM: I'm not sure I look two years down the pipe, right? Because we might surprise ourselves how quickly we end up in a given place. But when I think about - we're still learning about the interactivity models. And that's active research where we go in and say we implement it one way, we implement the other way, we look how people actually accept it. Where does the percentage go? In the beginning, it was a lot of tab-tab-tab. Now it's more like, "Oh, the percentage is lower." But depending on the experience level of people and what part of the code in, right? When you talked about there's something that is really sacred land, and then there are other things. All of these kinds of things are influencing how those interaction models are. And that will change. And we'll figure this out. I mean, different ideas will come from different areas and so on. But I think then, there's this whole point about how do we use agents effectively. And I think that is also a very hard problem. Because, again, people, "Oh, we run many agents parallel." Yes. But now, in what circumstances? If I have a project like VS Code, where we go through 3,000 issues a day - sorry. A month. Not a day. [1:00:46] KB: That's projecting forward two years, right? [1:00:49] KM: Probably. But, yeah, 3,000 issues a month, right? And that was just based on human activity. You put now AI into the mix, that number needs to go up. Then how much of - you can do more parallel work, right? Because there are different boxes that you can execute. As a team member, you own a bunch of tags, right? Those are yours. And you can go and have a couple of agents running on each of those tags. And that's kind of fine. But if I go and think about creating something new, than actually running multiple things in the background, that is way more complicated. Because it's easier to think step one, two, three, where one, two, three built on top of each other, rather than, "Oh, it's one. And then it's 2A, B, C, D, right?" It's way, way more complicated. [1:01:47] KB: We hit cognitive limits. Yeah. [1:01:47] KM: Yeah. Absolutely. Right. As a human, we're pretty much the weakest link in the chain. Assuming that we really get to a place where that code that comes out is in a good shape. But we're not there yet. But you can see that this is happening. How do you actually really work with this level of parallelism and so on? I think there's work to be done. I think where we clearly will end up, and that is now - just think about real-world, large-scale operations, where you go and say, "Hey, I have to make a change here. I want this new vertical feature to go in. But it now actually touches dozens of repositories, different service deployments, all of these kinds of things. Now you end up in a world where you pretty much need to create a plan, almost like a project plan. Agents need to go, and one solution is you have a mono repo, and you just have one agent running around. But what it's more likely is that it stays a distributed world, at least for many people out there. And then you need different instances of agents that actually the main agent delegating to other agents. They are running, and they need to communicate to each other. They need to report back where they are. The reporting back cannot be a markdown document anymore. They need to potentially go back and say, "No, here is really the change tracker. Here are the different issues. They're linked to each other and so on. Maybe you need to see this on the planning board in order to understand this. Has a lot to do - again, the human is the weakest link in the chain. It's about transparency. What is actually happening and so on. I think there is a lot that will happen in this particular area where we need to go. Yeah. Then there's one other aspect, but that's the one I personally struggle a little bit the most with, which is what is the - it is very easy to say, "Oh, I can code wherever I want." And if I have an idea, I just type it in my phone and send it off, and so on. Certainly true. There are some use cases for this. But I'm not quite sure how - I mean, I want to work on my iPad. That's clear. But this is just a replacement for - I just sit in a different place. I don't want to use my laptop. And so these kinds of things. But really, what's the role of mobile, right? Of smaller form devices, really? How much do I want to do on my phone? I can see a maybe voice that plays a big role in this one. But other than that, I don't want to review code on my phone. [1:04:43] KB: Yeah. I was going to say, kicking things off, great. Reviewing code, miserable. [1:04:47] KM: That's right. That's right. And then I think that last part - and again, it's actually not that surprising when you think about this. We like to be creative. And then what environments are we creative? And I really could see that we're still in very traditional kind of collaboration forms. And as we said, you could have a Slack channel with your AI agent, for example. These kinds of interactions. But I think the other one is what is it that we really like as humans? We like to stand on a dashboard, work together. Huddle together and do something together. There's materials on a table that we shuffle around in order to talk, right? These kinds of things. How would you replicate these kinds of things? I think sometimes, Microsoft had this studio PC, right? It's a large screen. You could flat down, right? But you now take something like this, where you can draw on the screen. Particularly when you do UI development, for example. You go, you draw on the screen, you say this is what I want, right? And so on. And then you can actually talk to it at the same time while you're drawing and saying, "No, this here should be a little bit more over here. What do you think about this? Give me two alternatives." Right?" And so on. Then all of a sudden, you have a very, very interactive rockstar that makes us happy as human. There's a lot of dopamine in this. And it can be very well AI supported by voice, by how it's actually multimodal inputs and so on. And there might be, or there might be not code involved in this. There's code involved in it behind the scenes. But again, this as a rockstar I can clearly see. And I think we will see a lot of this coming forward. [1:06:46] KB: I think that's a great cut point. [END]