EPISODE 1898

[INTRODUCTION]

[0:00:00] ANNOUNCER: AI coding agents are rapidly reshaping how software is built, reviewed, and maintained. As large language model capabilities continue to increase, the bottleneck in software development is shifting away from code generation toward planning, review, deployment, and coordination. This shift is driving a new class of agentic systems that operate inside constrained environments, reason over long time horizons, and integrate across tools like IDEs, version control systems, and issue trackers.

OpenAI is at the forefront of AI research and product development. In 2025, the company released Codex, which is an agentic coding system designed to work safely inside sandboxed environments while collaborating across the modern software development stack. Thibault Sottiaux is the Codex engineering lead, and Ed Bayes is the Codex product designer. In this episode, they join Kevin Ball to discuss how Codex is built, the co-evolution of models and harnesses, multi-agent futures, Codex's open-source CLI, model specialization, latency and performance considerations, and much more.

Kevin Ball, or KBall, is the Vice President of engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through latent space. Check out the show notes to follow KBall on Twitter or LinkedIn, or visit his website, kball.llc.

[INTERVIEW]

[0:01:48] KB: Hey, guys. Welcome to the show.

[0:01:50] TS: Hey.

[0:01:51] EB: Hey, thanks for having us.

[0:01:52] KB: Yeah. I'm excited about this one. You guys are doing some really interesting stuff, and I want to dig in. Let's start with you a little bit. Can you each give a little bit of your backgrounds and then how you got involved with Codex and what you do there?

[0:02:05] EB: Yeah. I'm a product designer on Codex. I've been in OpenAI for just over a year. Before that, I worked in robotics and generally at the intersection of design and research. Yeah, I've been on the Codex team for about six months, and with each model release, each product release have just got more and more into the coding side and excited to chat about how we use on the team today.

[0:02:24] TS: I'm Thibault. Joined about the same time as you, actually. Been tinkering and thinking about AI intelligence systems for as far as I can remember. It's one of the first programs I tried to write as a kid. Then over time, it got more and more fascinating. I feel like, today is really, it's come to life, right? Where I finally have the thing that I was trying to build when I was seven, where actually, I'm able to type in my terminal and get an intelligent response back and have this little assistant in my computer. It's actually a while to think about that that has come true.

Yeah, joined OpenAI about a year and a half ago. None of this was possible. We didn't really have reliable agents doing work over many, many hours in periods of time. I've been tinkering with that at OpenAI since I felt that models were actually capable of that. Late last year, it became obsessed with this idea that model capabilities were continued to evolve. It was really about getting the right infrastructure and product around it, so that we could continue to benefit, and have that step change in utility that you can get from the models compared to just being able to chat with them. Felt like chat was a bit saturated, and then we're able to express a lot more things.

Evolved over time. There was a lot of prototyping earlier this year, and then it really came together as a team. Now we're pushing on Codex with quite a few people over here, and it's more exciting than ever, I would say.

[0:03:46] KB: Yeah. I definitely have felt that acceleration across the board in the last couple of years, that is wild to experience in our industry. I'd love to actually dig into a few of those different pieces. One of the distinctions you made there is around the models, the model capabilities and their advancements, and then the infrastructure and the harness and all these different pieces around it. I'm curious, from your perspective, how you and the team think about what is the relationship between those two? How do they connect and feed back into each other?

[0:04:17] EB: Yeah, it's a good question. I mean, I think on the research side, on the infrastructure side, I defer to Thibault. I think one of the really interesting developments that's happened over the past, say, six or seven months, is this co-evolution of the model and the harness. I think it's really come together in our products, and that if you use our models and our harness, it's different from then if you use it elsewhere. I think that's really exciting. As a product person, as a designer, the idea of not just building a model that you can use in an API and shows up elsewhere, but really co-evolving these two together and all the incredible things that that can lead to.

[0:04:51] TS: Yeah. Definitely that element of co-evolution. That co-evolution is happening at many levels. There is co-evolution of the harness and the model, co-evolution of the products that need to evolve at a really rapid pace right now. It definitely doesn't feel like we have yet figured out the ultimate form factor of how you interface with an ever more intelligent system that is doing all these things for you on your behalf. If you think about the harness, it's really just your body, right? You have your brain, you have your body, like how you end up acting upon the world around you. Then there is a little bit more to that as well, which is how you act, but safely.

One of the things that we do out of the boxes, could exist inside a sandbox. It's the network access is restricted, the file system access is restricted. This is really important, because it allows the model to experiment and touch its environment, but without potential negative consequences. Then this is an important topic, where we view cutting agents very much under the lens of alignment and safety. There is this aspect as well of like, where does the harness stop, and where does it start to be the world? Definitely seeing that when we think about the two together, we get much better results, and I think this will continue to be true.

Then there is this separate aspect of what is the right interface to this agent? That's where products really come in into light. I think that, yeah, this will definitely need to continue to evolve as we have agents that just are never interrupted and just run forever. That's going to be a whole other game at that point.

[0:06:20] KB: Sandboxing is an interesting one to maybe just pinhole down into for a minute there, because I think, one of the things that stands out to me, I use all the agents, at least as an aspect of research. Some of them I use every day. I was using Codex to solve a problem for me earlier today. Some of them I just try and then I say, "You know what? You're not ready, or I'm not using you." But one of the things that stands out about Codex is the strong sandboxing model. Everything is sandbox to begin with, and that is both good and sometimes frustrating and can cause some awkward user experiences. I'm curious how you think about that balance and how you see this safety question evolving across the ecosystem here.

[0:07:03] EB: Yeah, it's a really good question. I mean, I think from a product perspective and from a user experience perspective, as you say, that's where some of these tensions surfaced from you're always being asked to approve set commands. Ultimately, these agents are extremely powerful. We have great sandboxing, great safety features, and that's a core part of the product as well in terms of why you might use Codex above others. Within those constraints, I still think there are some interesting things that you can do around user experience to make it a little bit easier and put some control in user's hands.

Users can change their sandbox permissions. They can change the mode. If you use our product in the IDE extension, you can basically choose between agent mode where it will go off and make changes in your working directory, or just read-only mode, which is a little bit more restrictive and will ask you permissions in many areas. I think one thing we recently released, which I think is quite exciting, is as you go along, as you approve certain commands, we give you fine-grained control over what exactly, which commands are you approving, how will they be saved into your config. I think exploring as well, where that sits between you and your team. I think, ultimately, giving users control, but still maintaining that really high threshold of safety.

[0:08:11] TS: Yeah, and if we take a step back of where we started with Codex, it was Codex on web, sometimes referred to as Codex cloud, but we started with the idea of all of this should happen in a safe environment. It's a completely isolated virtual machine with its own sandbox. We use catacontainers under the hood. Then from there, we decided to actually bring that to your machine, through Codex CLI and the Codex VS code extension. Then definitely keep true to that principle that it should be safe by default.

It doesn't matter how convenient it is to run outside of a sandbox. Ultimately, you are giving control to a very capable and intelligent entity to do whatever, if you were not using a sandbox, to do whatever it would want to do to your machine using your own credentials and having any consequences that this can carry. By default, we prefer to be safe. Obviously, there are use cases where you don't want to use a sandbox and we do caution against that. But it's also something that we do support if you do know what you're doing.

[0:09:13] KB: Yeah. Well, and I will say, Codex has never tried to delete my database, which is not true of every coding agent I've tried.

[0:09:20] TS: Sometimes it can be that the agent maybe does something inadvertently and that has negative consequences on you as a user. It could also be that it's been instructed. There's obviously prompt injections and other risks to think about. Ultimately, if you do give control to an agent to something that's quite sensitive, that you'll either want to have it deleted, or take any other nefarious action, that is something worth thinking about as a user. Is that we do really feel the responsibility that we have their - to make sure that there are no unintended negative consequences.

[0:09:55] KB: The thing that you mentioned in terms of the different ways of running Codex brings me back to another thing I'd love to hear from you guys, which is how do you use Codex internally? Are you running it all through Codex web, or cloud, whichever one you're calling it now? Or do some of you use the IDE? Are you CLI geeks like I am? How does that play out internally?

[0:10:12] EB: That's a really good question. Yeah. I mean, I think there's a bit of a meme, which is everything is Codex, right? We have a bunch of Codex models, you have Codex web, and then we have the CLI product. We think about it as the same coding agent that shows up in different spaces. Internally, it's been really cool to see how it's evolved pretty over time. As Thibault said, we initially shipped the web products earlier this year and got great excitement internally for this. I think the really cool thing about it as well is you connect your GitHub and your team settings, and you can go in and you cannot touch a line of code and you can ask for something. It can do pretty amazing things. That's super empowering for perhaps, a UX copy team, or maybe go to market who want to change some string about pricing. They can do it themselves. They don't have to bug some front-end engineer to do that. I think that's really cool. That was one of the first use cases that we saw.

Then I think the CLI is really popular, right? We have a bunch of incredible developers across the company and developers often live in the command line. I think that's become really popular. Also, personally, I use it in the IDE extension a lot. I prefer the GUI. I prefer being able to click around. I also, that's just my go-to development environment. Some other really cool things as well that I think we've seen recently, we've shipped a linear integration, we've shipped a Slack integration. What you will often see as well now in threads is you might be chatting back and forth, maybe about a piece of customer feedback, or some new feature that people are discussing. Someone can just hop in and @codex, and basically, that will kick off a task in the background. It will route it through all of our different Slack, or linear, and it will just ping you with a task that you can click, you can open it in the web. That's cool.

It's seeing it surface within threads, and you can assign issues as well and linear as well, which is super fun. I'd say, yeah. It's one of those everywhere things, but I feel like the CLI is pretty popular.

[0:11:55] TS: Yeah, there's a lot of different use cases among technical staff, but we also have a lot of ambient intelligence, where it's old around you, including code review, where every single PR that is written in OpenAI, and always reviewed by Codex, and it acts as this safety net, where it's hard to think about the world where we wouldn't have that safety net anymore, given how many critical flaws it catches every day, and how much time it saves. It's really able to go much more in depth than the time that we have, like when we're reviewing each other's code, especially now that generating code is so cheap. The cool thing as well is it's not just about technical staff, like more and more people across the company are using this tool to do a lot more than just writing code.

[0:12:45] EB: Yeah. I think one really cool trend that I've seen over the past few months is within the design team, right? We have a few of these Slack groups, these work in progress groups, where people will post work, and I've seen this basically slow, well, not that slow, change over the past few months from static images from Figma to these interactive prototypes, even sometimes links that you can click into and use yourself, which is cool. I've DM'd a few people who post them as like, "I didn't know you could code," and they're like, "I couldn't, until I tried Codex."

There's this range, right? It's for professional software developers, who obviously have very high bar of code review standards to go through. It's for these throwaway prototypes that designers can play with, so you can test responsiveness in all of these edge cases that you can't in a static prototype. It basically collapses the boundary between disciplines, which has slightly been artificial over the past 50 years or so, or the recent history in technology, because of these disciplines and certain, often even in organizations, yeah, boundaries of this staff can access this technology, and it's a great equalizer, I think.

[0:13:47] KB: It might be worth us going through, you mentioned a few different points in the software development lifecycle, where Codex is taking place now, or speeding things up, or simplifying, or collapsing boundaries. Have you thought rigorously across that whole lifecycle? If we look at our industry, we're all trying to figure out how do we adapt. I think the process of developing software has probably changed more in the last year and a half than in my 20-year career, like before that. It is wild. How are you adapting across all of those different points using Codex?

[0:14:17] TS: It maybe goes back to that co-evolution, where you can do a lot of from first principle thinking and trying to understand how exactly you should structure the teams and the work to best benefit from this as it's going, or you can just stay very flexible and learn every day as you're co-evolving the ways that you work as a team, as an individual and an organization together with the coding agent. That's definitely a lot of what we're seeing where, for example, small teams that have a lot of energy and ambition are able to achieve so much more, and are highly effective because they can iterate and learn much faster.

We've seen this with Sora. We've seen this recently with Atlas as well, where entire parts of the code base were able to be spun up just based on an idea and the few individuals that were really steering a whole series of Codex agents. But then, also, it's clear that bottlenecks are moving around, so code generation is almost maybe solved right now, and the bottleneck is moving to code review, moving to deployment, also moving to planning and bringing in a lot of these ideas and the user feedback.

We're thinking about how to solve those bottlenecks, with the Codex teams, we're definitely not just focused on code generation. This is why we started to invest very early on in code review, because we identified that this was going to be a bottleneck. There's a lot to the story here, to the picture. It's like, some of the bottlenecks we anticipated beforehand, some of them were like, ah, we hadn't really thought about this, and now this is breaking, because everything else has gotten so productive.

[0:15:53] EB: Yeah. I think the thing that has really surprised me since joining OpenAI is just how small some of these teams are that build these products that reach billions of people. I remember chatting to a designer who was on, I think it was deep research, one of these products, and it was like, 1pm, one designer, a few engineers, a few researchers, and it's purposefully small by default. I think internally, the way that we're able to do that is that we're co-evolving as co-workers with models as well, right? We're building the models and we're able to access them immediately and really integrate them into people's workflows. I think that's very cool to watch.

Also, yeah, the way that we're building products is we're building for professional software benefits, which means thinking through the entire lifecycle of product development, which Thibault says, it's not just writing code, it's the planning process at the beginning, right? It's using tools like Linear, or Slack and meeting people where they work, where they speak, where they plan work, integrating coding agents there. It's about at the code review point as well, which Thibault has already spoken about. I think an interesting thing to look at in the future is thinking through what is the full lifecycle of a software development cycle, and where can you support beyond just code generation.

[0:17:00] TS: There are parts there that are easier to crack, going back to the safety and the sandboxing part of the conversation is like, clearly code generation there. It's easier for it to happen in a sandbox. If you're thinking about what happens next around deployment and being on call to a service, now you enter a whole realm of this agent. If we want intelligence to be driving this, if we want agents to be driving this, they need to act in a way that also carries a lot of risk. How do you do this? How do you achieve this? This is still very much, I think, an open question of how to achieve this safely.

[0:17:36] KB: This goes to another question I have. As we talk about applying this in a wide range of things, one of the things that I've definitely observed in my work and working with a bunch of different things is that different models seem to be better at different things. When GPT5 came out, we do a lot of work in Go and it is phenomenal at working with Go. It is phenomenal. Hands down, blew away every other model we were using. Sometimes less good at working with HTML and CSS. We still sometimes go to other models, maybe even non-open AI models for some of that work. How do you think about the multi-model aspect of this and the extent to which are you aiming for a model that can do everything and you go to the right things? Are you imagining a multi-model future? How do you see that ecosystem playing out?

[0:18:22] TS: Yeah, we're definitely aiming for a holy grail, or one model that is spectacularly good at everything. Then you don't need to ever think again about which model to choose. In practice, what we do think it's going to evolve into is more like a multi-agent type of world, where you don't necessarily have to be the one deciding of like, hey, what is the right underlying setup of which precise model, which configuration, which tools in order to achieve that job? Maybe you will get help there as well. Realizing that as much as humans also collaborate in order to achieve useful things in the world, maybe it will also be the same for agents where they have to collaborate together and use the specific strengths that they have.

There is a whole series of issues there of like, as a model, how do you disclose your strengths? Is it something that the model even knows? Is it intrinsic to the model and a knowledge that the model possesses? Or is it something that needs to be discovered by you as a human, or by other models in order to be able to understand, hey, this is actually the strength of this particular setup versus this other setup, which achieves maybe similar results at lower cost, or maybe this one achieves better results, but higher latencies. There's all these trade-offs, where I think it's going to be this beautiful world of collaboration between agents, but hopefully, also much simplified for you as a user.

[0:19:48] EB: Yeah. I think that the meme in the design world is all designers are redesigning the composer, right? Work with this tension of how much do you expose the capabilities, right? These different modes, these different amazing things that they can do, like image generation, for example, a model like 4.0 is natively multimodal. You can just ask it stuff and it will do it, right? But how do you expose that in the UI, the same with the model picker, right? This meme as well as we go back and forth, not just us, everyone. Do you list out a thousand different options and you yourself have tested them, so you know which one is exactly right for your use case, or do you simplify it? Thibault as well, I think, obviously, we're aiming for this, for the ideal, the single model. But yeah, how we get there is to be.

[0:20:29] KB: Now, if I open up, I have Codex CLI running here and I do /model, I see a list of five. You're clearly not falling into the show everything. One differential I'm going to ask about here is GPT, I see, for example, it's defaulting to GPT-5.1-Codex-Max. There's also 5.1 Codex. There's also 5.1. If we were to peel back the covers, how would you describe the difference between any five generation and the Codex version of that?

[0:20:55] TS: Yeah. Where we started, when we got significant traction with Codex and Codex CLI, it was roughly three months ago when GPT-5 came out. Just saying that, I'm like, I have to do a double take. That was three months ago. Three and a half months ago. Yeah, GPT-5. Then we have been training on the side another model which was even more effective, specifically within the Codex hardness. This is how to think about it, it's like, you have GPT-5 and then you have GPT-5 Codex and GPT-5 Codex is a version that will be more at ease within the harness that Codex provides and be able to achieve better results. This is always the model that we recommend.

You have the same for 5.1 and 5.1 Codex. Then with 5.1 Codex Max, we were able to have a few research breakthroughs that we packed into that model, which made it even more effective and able to work for longer. We published a benchmark there as better results across different tier. Able to achieve stronger results, but also using fewer tokens and being cheaper on average, which allows us to just pack a lot more in the same subscriptions, whether you have a plus, or a pro subscription, you just get more out of it.

At the end of the day, it's really about how much economical value are you able to achieve, right? Either in a unit of time, or in a unit of cost. This is really what we're striving to provide. We've restricted the model picker to the few models, which we think work very well in Codex. Then there is a default as well, which that's the one we recommend by default for folks. If you don't just really want to think about it, just use a default and you'll be well off.

[0:22:34] KB: What goes into making the model - you said it works better with the Codex harness. I will say, within the Codex CLI, I always use the recommended, and it seems to just work.

[0:22:46] TS: That's great.

[0:22:47] KB: When I'm often using, for example, Cursor, I will also use GPT-5.1 or whatever. Actually, in that context, I found just the bare model, 5.1 often works better for me than the Codex model. Now I'm like, what is it you're doing that's connecting it to that harness?

[0:23:00] TS: Yes. I was really thinking about that co-evolution of the harness and the model and thinking about it as one entity and one agent, fundamentally what we're building as Codex. The Codex team is like, it's an agent. Then we figure out where to put it to work. The agent isn't just a model itself. The agent is the model together with the set of tools, and the way that it's going to handle its context and be able to think and reason through which actions it should take. It's pretty clear that if you co-evolve and co-train these two things, you can achieve better results, which is what we're achieving.

[0:23:35] EB: I think one cool thing as well is Codex, the CLI product is completely open source. To your question of what's going under the hood, the great thing and we have a really vibrant open-source community who contribute a lot of great ideas and issues. You can just go, you can look at the system prompt. It was also a funny thing. When we released the new model, there was this tweet which was system prompt leaked. It's like, yeah, it's in the open-source repo.

[0:23:58] TS: It's just on there. There's nothing to hack.

[0:24:01] EB: Yeah. I think in terms of continued capabilities, or tools, you can go and have a look, which I think is super exciting.

[0:24:07] TS: There's a lot of effort and research that goes into what are the optimal tools in order to get the results that you want. Oftentimes, we're actually quite proud of how simple the harness is and how simple the set of tools is. This is something that we strive for, is that simplicity of being able to have the harness scale with the continued levels of capabilities jump that we expect to see over the coming months and years. It's something that if you don't optimize for, it eventually comes back to you, because you have hyper-optimized something in the short term that doesn't scale with continued capabilities improvements.

Then by being so close to the - Codex, we ran it as one unit. We have product, we have engineering, we have research, we all sit together, ideating a lot and using some techniques from research to put them in the harness and using parts of the harness and using that in training. There's just this, the zeitgeist there and the sharing of ideas and always zooming in on what will make the agent perform better as one unit. It's not about optimizing the model in isolation, it's not about optimizing the harness's isolation, it's finding that combination that works the best together. That's what the Codex series of model offers as well is that guarantee that we have considered how well it actually works in Codex. That's the best that we can do.

[0:25:30] KB: If it's not too much secret sauce, how do you consider that? Is that related to the reinforcement training that you're doing? Is it different initial data set? What is actually causing it to behave differently there?

[0:25:42] TS: It's really about thinking about the model as not just needing to be intelligent, but needing to be an efficient agent. If you think about what an agent is, it's going to be a model that gathers its own context in an accident in its environment in order to achieve a goal. If you set yourself to train a model to be extremely good at that and be an extremely good coding agent, you're doing different trade-offs. You'll find that you're able to take different trade-offs at the research level, be it at the post training or the RL, or the specifics of the training, which we're not going to go into, but the trade-offs are there. You're able to achieve efficiency gains and move up in the performance curve.

[0:26:24] KB: Now, I mentioned some of the models I see. There's one other model that I see in this list, which is GPT-5.2, which I think was not there when I looked at this a week ago. What's that about?

[0:26:35] TS: We just released it yesterday. It's been very successful. More so than maybe we anticipated. Actually, some of the team was up all night, shuffling computer on and making sure that it kept working and we were achieving the latency target that we have. For agents, the latency of the model and the reasoning and exactly where the compute is is more important than ever before. This is because you have that latency element between that GPU that we run somewhere and then your computer where the tool calls are run. You have always this back and forth. Then obviously, if the model is able to perform and we're able to sample more tokens per second, that's going to translate into a shorter amount of time to get that result. 5.2 is a particularly exciting model launch, I would say.

It's a significantly higher jump than one might expect, compared to 5.1. GDP file captures this fairly well. I think a lot of the benchmarks these days are saturated. A good way to think about it is economical value that you're able to create in the world. GDP file, I think we see a more than a 20% jump there. Definitely recommend trying it out in Codex. It's quite exciting. Depending on when this podcast goes out, you might have something even more exciting to try out, but we'll see about that.

[0:27:53] KB: I think that is interesting. Thinking about that interaction between local and data center, so how are you all - I mean, some of that, I'm sure, is proprietary, but how are you thinking about locality in this? Are you pushing compute? What does that look like for someone, at the scale you're at?

[0:28:12] TS: The closer the compute is to your laptop, if that's where you run Codex CLI, the better, right? Because you reduce that run trip. Another way of doing it is to bring the compute environment closer to the compute, right? To bring, for example, virtual machines and have those effectively be as close to the GPUs as possible, that's the approach that we take with Codex web. But then, if you're running locally within your VS code extension, and the agent runner is effectively running on your machine, then you do want that GPU to be as close to you as possible. There's an element of like, where in the world is that running for you? Sometimes it might be better off if you're somewhere in the middle of nowhere on an island. We're not running a data center there. It's just like, you're going to feel that actual latency.

[0:28:59] KB: Let's actually dive in a little bit to the guts of the agent, because I think one of the things that most of the software development world right now is trying to figure out how to build effective agents. I think coding agents are really at the frontier there, pushing the edge of what that looks like. Can we break down, just first, the very high-level pieces that you think of that go into this agent, the software layer, not the model?

[0:29:24] TS: Yeah, what are the higher-level pieces of - we touched a lot on it already. You have the model and the inference. That's going to be the intelligence that's driving the rest of the software stack. You have this interesting combination of a piece that is non-deterministic and a piece that is deterministic. At least for now, a lot of the harness is considered to be deterministic. It's quite simple. If you look at it under the hood, it's all open source for Codex, is like, there isn't that much magic. It's a for loop and then a bunch of tool calls and then tools that have been designed to work well for coding. It's a pattern that you can apply, essentially, to any other discipline, any other agent is that control going from the model back to its environment, executing an action and then taking what's been observed in the environment and then pushing that back to the model in order to decide the next action and then doing that over and over and over again, maybe hundreds of times, until the point where the model believes that the desired outcome has been achieved and decides to stop.

At the very beginning, you have a prompt, or an intent from a user, then you give control to the model, decides on the next tool call, goes on and on and on, and at some point has achieved, or is unable to achieve and decides to yield back control. That's when the agent has finished its job. Delightfully simple, it's just a couple of tools, a for loop, and then a model that's given control over that. The really exciting thing is not - that I think it's really the products around it that allow you to have control, steer, and supervise those agents. Then, as well as thinking about the agent being its little system that will continue to evolve and being increasingly more complicated and being able to perform increasingly complex works, it's not maybe a single agent that's going to be at work. Maybe it's going to be multiple agents that are going to be at work. I think a really exciting thing is how do you interface with this ever more complex system that is doing work on your behalf?

[0:31:39] EB: Yeah, totally. If you think, that there are some parallels, I think, if you look at ChatGPT and when it was released, right. It's very, very simple. You go online, there's a text input, you type some intent, some message and the model responds back. As Thibault says, in this world where we have an agent loop and an agent is carrying out work for you, maybe it's delegating to other agents, it's collaborating with other agents, it's speaking to external agents even, I think the user experience changes. It goes from this back and forth to a little bit more how we interact with other humans in the world today, right?

If I ask Thibault to get me a glass of water, it's going to take him a little bit of time. He's got to go outside, he's got to do a bunch of stuff, right? If I'm collaborating with a colleague, I might ask them to do some significant tasks. So, build some new infrastructure project. It'll take time. They'll have to go out. They'll have to coordinate. I think we're moving from to longer and longer tasks with more and more complexity and models that are increasing in capabilities, and I think the interesting question from a product perspective is then how do you design those interactions in a way that is simple, maintains simplicity, also fits into everyday workflows nowadays and also, exposes these incredible capabilities of the model as well in a very simple way.

[0:32:59] KB: One of the things that's interesting to explore in this domain of user experience of agents, especially as we're going on is like, when I wrote code in the olden days, right, it was doing multiple things for me. It was creating this runnable artifact that somebody could interact with. Okay, that's great, that's useful. It was updating my mental model of the system that I have that I'm working with. It was also doing some amount of problem solving and updating of my mental model, probably of the user's problem, or at least how to map that user problem into my system. There's these cascade of mental models that I'm updating, as well as this final artifact that's being generated.

Now, as we get into this world where we're delegating more and more of the work of generating the artifact, there is still this very real need for us to update our mental models across the board. How do you think about, or see that working in this agentic world? How does the product facilitate it? What does that look like?

[0:33:52] TS: Yeah, I think that's a super important point. It ties a lot into in my mind of what we've seen people use coding agents for primarily over the last, say, six to three months, which has been a lot to solve and write code for them on their behalf. I think there's a much deeper role that agents have to play in the future, which is to understand, hey, what do you care about? How can it help you understand the state of the world efficiently around you? Maybe it should send you something every day, but here's how the code base changed, here's what users are thinking about the product, here's how to really explore this topic a little bit more.

You go much further than just a code generation. You're helping with planning. You're helping with ideating. You're helping with understanding user feedback. You're bringing a lot more context into play than just code itself. In a way, if you're just focused, if you were just to focus on code generation, you would miss out a lot on the opportunity here. We're thinking about this broader set of things that we can help people with. I think it's going to be ever more important. Maybe code generation actually will be a very small part of what agents end up doing for you. We're definitely thinking about this as a product.

[0:35:08] EB: Yeah, yeah. I mean, it's interesting, a very small maybe example on the team is I think, when a new starter comes on board, right, it often takes a long time to get used to a code base, you have to really get to understand it. Yeah, as well as writing code. I've seen new engineers on the team just speaking to Codex and really deeply understanding the code base, going back and forth. That means that they don't need to tap on their colleague's shoulder as much anymore. If they do, it's for some really high-value touch point. Yeah, as you say, I've seen people use it for all sorts of things; writing notes, code understanding. It's, yeah, really beyond just code generation.

[0:35:42] TS: Yeah. There's this awesome thing about giving Codex to someone who just started on the team and be like, "Hey, explore the code base with the help of Codex." Then we barely write documentation, like how things work, because that's just in the code itself. What we tend to document more is why things exist. I think there is going to be an evolution there as well of like, how do we maintain the knowledge base? How much of it is redundant? Definitely when you have intelligence and a little buddy that you can just send off in order to explain something for you, so you tend to find that maybe it also shifts what you want to write down.

[0:36:21] KB: Yeah, absolutely. Well, and there's an interesting thing there. One of the techniques that we found works really well for us with agents is actually, documentation that is maybe transient in some form. It's like, here's this problem that I'm solving, gather all the relevant pieces and documentation and link to the relevant file, so that I have one short piece of context. Okay, now use that to get me to a solution on this particular thing. It's much more temporary documentation than permanent documentation, but giving the agent this map that it can work with.

[0:36:54] TS: Yeah. Is this more like a design doc, where -

[0:36:57] KB: Depends.

[0:36:57] TS: Or like, how does it differ?

[0:36:59] KB: We can dive into this in a couple of different ways. I'll use a very quick example of one of my common practices, and I've done this with Codex or other agents. Say, I have a problem to solve. I know roughly the area of the code base involved, but I don't want to - maybe I don't know it that well. I'll say, to Codex, for example, hey, I'm going to be wanting to muck around with this subsystem. Please do an analysis of how that system works today. Write me a document that includes filings and symbols and all these other things. Conceptually to me, what I'm doing is I'm creating a map of the territory. It's essentially a context condensation, right? It doesn't need to read all those files all the time, but it needs to know where roughly everything is, so when it needs something, it can pull it.

Okay, now I have that subsystem. I say, okay, I'm looking for a solution that looks something like this. Can you map out three different variations of that, have an argument about which ones are better, whatever, make some characters do this, map out the solution space. Now I have these two very rich documents and I can say, okay, based on these, look at this, look at this, pick which is the best solution. Write me an implementation plan. Okay, pretty good. Break it down to set of test lists. Go.

I'm in some ways, manually managing the process of this, but guiding it towards here's all the relevant parts of the code base with me and the loop often to be like, no, you missed something over here. You got to go look at that again or something along those lines.

[0:38:16] TS: The workflow that you're describing is extremely powerful. It's all based on files and you deciding on like, hey, this workflow is actually very useful for myself. You discovered it by yourself, maybe talking to other people and it is sharing element right now recipes of how to work with agents. It's not necessarily that the product is prescriptive about it. It's delightfully open-ended actually right now, where you can ask Codex to do anything for you and it is this creative aspect of like, what do you actually ask it? How can it help you? Maybe it's through this complex workflow of planning and then ideating on different options and then going and performing some implementation of it.

We like that a lot, that flexibility. We try to also be very mindful when we introduce more opinionated frameworks into the product that could also restrict that flexibility. That's definitely not something that we want.

[0:39:16] EB: Another interesting angle as well is just seeing how some of the maybe more non-technical people across OpenAI use that, or in different disciplines from just traditional software engineering. There's one person that I know who basically uses it for everything. He uses it to write documents. On the design team, I know a lot of the designers and the product managers, they might do some coding, but they'll do it for lots of other things. Ideation, as you say, planning. Some very cool things as well on the data science, go-to-market side, a lot of just data analysis, crunching through numbers, looking through a CSV. Yeah.

I think from a product perspective, we've been deliberately opinionated about keeping things simple, just like ChatGPT, right? It's this general-purpose interface that you can go to and you can ask it to do anything. With ChatGPT tThat might be generating images, answering a question, searching the Internet. I think for us, the amazing thing about a coding agent is it's extremely general purpose. You really want to keep it as simple as possible and then let the user, let that creativity go on well.

[0:40:13] KB: On that note, and looking forward a little bit, one question I'd love to ask is, are there any plans for enabling an inside Codex SDK, or hooks, or some other way to generate? Because for example, I mentioned I have this workflow, which figuring out like, oh, I want to steer it in this way to fit in my workflow. It would be great if for example, every time there was a context poll, or something like that, it reinforced this, or other different ways to nudge and control, tightly control the context that's going on to fit esoteric workflows that you don't want to have in the core agent.

[0:40:48] TS: Yeah. When you're getting as like, really what we're seeing with the power users, even including within OpenAI, some of our most prolific users maintain their own fork of Codex. That's one of the awesome parts of it just being code that's open source as well. If you want to change it, you can just fork a code. If you happen to be advanced, that shouldn't be too daunting either, Codex can help you change it in a way that's productive to you as well. It is written in Rust. Sometimes we get some comments on that, but we want it to be very robust and performant as well.

It's quite delightful when you just have - you type Codex and it opens instantly. That's what we get from putting a lot of effort into that. Hooks are something that we're debating. We'll get there eventually. What we're super excited about right now is building the right set of primitives for the agent to be able to perform increasingly complex work. You can think, what will happen if you're able to run an agent for an entire day, or maybe an entire week and steer it as it goes. Is that a different thing? Does that require different product thinking?

Then we touched up on multi-agent as well. This is something that we think is extremely exciting and is going to emerge in 2026 for sure as something that is not at a prototype stage, like you're seeing across the industry right now, where maybe folks are excited about their little subagent, but it's going to be these really robust networks of agents collaborating together in order to achieve something for you. That's the stuff that we're really excited about right now. Hooks, maybe at some point.

[0:42:29] EB: Yeah, nothing massive to add. Just to say, we have a Codex SDK. It's possible now we have a documentation page up on it, and you can start to play with it. Yeah, to Thibault's point, I think there's this tension between catering for very specific workflows and then thinking about what these primitives are in these building blocks, so that you can build on top of that and build some really incredible experiences.

[0:42:49] KB: What would you say the missing primitives right now are?

[0:42:52] TS: We have a long list of GitHub issues. Some of them are top voted. They're actually the ones that we tend to prioritize. One of the things, really that's been requested, like subagent. We're actively working on how to think about multi agent networks. Then a lot of it is still product overhang, I think, where it's not about the agent itself, but how can we make the product more delightful and more interesting and better suited for managing, steering and supervising agents at scale? That's what's keeping us very busy right now.

[0:43:29] EB: Yeah. Again, without going too deep into the roadmap, I think what interesting provocation to think about is as the complexity of using agents, or in a multi-agent world becomes incredibly complex. How do you stay on top of that? How do you keep track of what different agents are doing, what actions they're taking and whether you need to give any permissions and the like along the way, there's any artifacts that they've created, whether that's code or elsewhere, keeping a track of that and staying on top of it. I think for me, as a designer on the team are really interesting interaction design problem, right?

Say, we're moving from this world where you watch a rollout of a minute to, as I say, this 10-hour job. How do you stay on top of that? How do you keep it delightful? How do you meet users where they are, so you're not context switching all the time between all of these different things. Yeah, as well as the cool primitives from engineering perspective, they're the problems that we're thinking about on the product side.

[0:44:23] KB: One other thing you talked about there was all of the non-technical use cases. I think one of the most amazing things I've seen with, as coding agents and LLMs, just as coding assistants have grown is the extent to which now subject matter experts are able to build at least their own prototypes and often their own applications to help them in their workflows. Are there any aspects from either a product, or technology standpoint that you're thinking about particularly for those non-technical users looking forward?

[0:44:53] TS: Yeah, we're thinking about it, especially since it's been a natural thing that's been happening, where we see increasingly amounts of non-technical people inside OpenAI, outside of OpenAI use Codex in their terminal and get it to do cool things for them. It definitely got us thinking about how we can do this better. Also, there is this pull for generality as ultimately, the very best coding agent is a general agent that's able to reason across much more than just code. The Codex agent and models are extremely good at instruction following. People find us very useful for data analysis, for editing spreadsheets, for doing market research and these things. It's definitely something that we want to lean into and cater to at some point.

At the same time, right now, we're also laser-focused on making Codex a very best tool for professional software engineering. There's this tension of hey, it's like, if we want to be really good at this, should we also think a lot about these other things. Ultimately, we see it combined very well. It also got us thinking, is very satisfying to see Codex used for more and more things, in addition to just being an extremely good tool for coding.

[0:46:12] EB: Yeah. From a product perspective, right, there are some things, we're building an agent, for example, a coding agent. On the web, right. You have to set up an environment. There are some things that you just can't get around, which are pretty technical. If you're a software developer, you need to go to these places. I think from a core product experience perspective, there's also more than we can do. this is what I'm focused on, which is just what's that first experience? Is it delightful? Is it simple? Can you as a non-programmer rock up and just get involved? How can that be an on ramp for you to learn more about coding and to get deeper into it yourself?

This is something in the design team, for example, we had this off site, and we had a few other people on the team who code going around and basically, onboarding everyone into Codex, into the CLI product, into the extension, depending where they worked. To be honest, for some of the non-coders, it's a little bit intimidating to get into the terminal. They were installing NPM and these things that might be a little new for them. Once they got onboard, and I think once they saw what the work that the model was doing and started to learn a bit about it, some of the people who just dip their toes now, I see them more and more coding. I think it's also a really cool opportunity to expand the aperture of what is a software developer and create a really great on ramp for people to learn more and go and dig deeper themselves.

[0:47:26] KB: Awesome. Well, we are getting close to the end of our time. Is there anything we have not talked about yet today that you think would be important to leave folks with?

[0:47:35] TS: One thing we haven't talked about is the mindset that's important to continue to adopt. I feel it's an amazing time to have problems as solving them has never been easier. Then there's also the aspect of like, hey, this really helps with answering questions. There's this curiosity that gets super rewarded right now. Definitely being able to try and get interested in changing your approach of how you're going about your day and thinking about solving the problems that you have. Maybe you have useful ways of doing that, that were effective two years ago, and you've stuck to them. I definitely feel it's the right time to question everything and try new things.

Personally, I find it super exciting. having always many, many ideas and unsolved problems is finding that the amount of problems that are unsolved, reduces with time, is just like, I hope someday agents will be able to also creatively come up with super interesting problems that I should be thinking about. Because we're not there yet. What a time to just try these things and get a ton of new things done.

[0:48:44] EB: Yeah, possible. I think, also, what a time to be a creative as a designer in the team. When I'm speaking to young designers, or occasionally teach or mentor, mentor young folks, the main thing I say is just, get involved and give things a try, because there's never been a time where curiosity has been better rewarded by really just getting your hands dirty, pushing yourself at your comfort zone and very quickly realizing that you're able to achieve way more than you might have thought before. Even on a week-by-week basis, if you look over the past few weeks with all of these model releases, it's just crazy the acceleration that's happening. Yeah, just stay curious and get involved.

[0:49:18] TS: Yeah, it's not long ago, six months ago, where you would show static Figmas, or slides and just be like, "Hey, this is an idea of mine." Then now, it's fully functional little products. Then I'm like, whoa, this is better than what we have shipped in production. It's like, we better get this out soon. Is that step change in what you're able to achieve solo as a designer is like, I don't know if even referring to you as a designer does it justice anymore? There's this blurring of roles that's quite delightful.

[0:49:51] EB: Yeah, there's never been a better time, I think, to be a software engineer, or a designer.

[END]
SED 1898		Transcript

	(c) 2026 Software Engineering Daily	1