EPISODE 1924

[INTRODUCTION]

[0:00:00] ANNOUNCER: AI coding tools have gone from novelty to core infrastructure in under three years. Today, many devs use AI daily. A substantial share of new code is AI-generated, and expectations for automation are rapidly increasing. Sonar is a company specializing in analysis of code quality and security, and they recently released a new survey, the State of Code Developer Survey. The survey provides a deep examination of how developers are using AI in real production environments, and whether real-world gaps and risks still exist. Chris Grams is the CVP of Corporate Marketing at Sonar, and Manish Kapur is the VP of Product Marketing and Developer Relations at Sonar. In this episode, they join Matt Merrill to discuss what the survey reveals about AI-assisted development, why 96% of developers still don't fully trust AI-generated code, how deterministic verification layers fit into agent-driven workflows, and what engineering leaders should prioritize as AI shifts from experimentation to production infrastructure.

Matt Merrill is a software engineering leader with over 20 years of experience building and scaling software teams across enterprise and product-focused organizations. His background is in back-end development, cloud architecture, and distributed systems design. He currently architects and delivers software products and leads a team of engineers at DEPT Agency. You can learn more about his work at code.theothermattm.com.

[INTERVIEW]

[0:01:44] MM: I am here with a couple of folks from Sonar to talk about the State of Code Developer Survey Report. Before we get started, Manish and Chris, would you mind introducing yourselves and talking about your background and what you do at Sonar? Manish, you want to go first?

[0:01:58] CG: Sure. Nice to meet you, Matt, and thanks for having us on the program. Chris Grams. I'm the VP of corporate marketing here at Sonar, which might immediately make people want to stop listening, except for the fact that I would say, I'm also our resident data and survey nerd, so probably know the survey results coming out of this better than most. I've been in enterprise tech for a long time. I started out early in my career at Red Hat, where I spent about a decade, and then went on to work with a bunch of other software companies in the consulting role, and then was one of the early employees at Tidelift, which is a company that sponsors open-source maintainers and was acquired by Sonar about a year ago. I've been with Sonar in this role just over a year.

[0:02:42] MM: Nice. Well, nice to meet you. Manish?

[0:02:44] CG: Yeah. Nice to meet you.

[0:02:45] MK: Hi, Matt. Nice to meet you. Thanks for having us here. I'm Manish. I'm based in Austin. I've been with Sonar for about two and a half years, almost three years now. Like Chris, I have got a long history with the enterprise software background. I started with Sun Microsystems, and then I joined Oracle, and here I am at Sonar. I've owned different hats in my life. I am a VP of product marketing and developer relations at Sonar right now, but I've owned the hat of a product manager, developer relations lead, and different other technical roles in pre-sales and other technical roles as well. I'm fairly hands-on, technical. Looking forward to speaking with you today.

[0:03:22] MM: Cool. How about Sonar itself? For anybody that's not familiar with Sonar, do you mind talking a little bit about what the company does and what products you offer?

[0:03:31] CG: Yeah. So, our main product is a product called SonarQube, which has been around for quite a long time and is used by over 7 million developers around the world. Basically, what Sonar does is you can think of us as the essential verification layer for code, whether it's AI-generated or whether it's written by developers, that we help ensure the quality and security of all of the code that's created by developers, by AI, and now increasingly by AI agents.

To give you a sense for the scale of Sonar, we analyze 750 billion lines of code every day. We view our mission as helping organizations ensure the code that they're shipping into production is high quality, secure, well maintained.

[0:04:18] MM: Awesome. It's a great overview. Thank you. I read over the survey report and it was really interesting. I think as somebody - my background as an engineering leadership and prior to that, mostly backend engineering and things like that, my immediate question was, okay, another survey report. How does this one differ from the Stack Overflow one, which is pretty well known? As I dove into it, I was like, "Wow, this is really unique and I really enjoyed it." If you could, what do you think that we can learn from this survey that you may not be able to learn from the Stack Overflow?

[0:04:51] CG: Well, first off, I would say, it's nice getting mentioned the same sentence as the Stack Overflow Survey, because we aspire to be part of the pantheon of big developer surveys that Stack Overflow and the GitHub state of the Octoverse Report, I think they call it. To us, those are the ones that are help leading, ensure that people have the information they need to make smart decisions as developers and development leaders. If we are successful with this, we're becoming part of that pantheon of great developer surveys. I would say, where we wanted to be additive is we were like, what unique perspective do we at Sonar have to bring?

I think it really comes down to the fact that we know code pretty much - like I said earlier, we analyze 750 billion lines of code a day. We started earlier this year. We started this state of code series of which we did a lot of analysis. Hopefully, we'll get a chance to talk more about some of the work we're doing on understanding the coding personalities of the leading LLMs. We also analyzed code from a perspective of its maintainability, its security, and release the set of reports on that. What we wanted to do with this particular report is we said, okay, we've looked at codes for the lens of how maintainable, how secure, how reliable our code. We looked at the code that LLMs are creating. Now we want to look at what is the perspective of the developers who are using these new coding tools every day? What do they think of the current state of the art?

This is almost the human side of Sonar's version of the State of Code, going in and really understanding what people think of the world as it's changing extremely quickly right now and trying to get a sense for where we are. One last point I would make on that is that we feel that the survey in October, I think of last year, and the world's changed quite a bit even since then, too. As we go through this, I think we'll want to spend a little bit of time talking about the stats that we feel are holding true. I think Manish and I can give you a perspective on the things that we think have changed as well since we field the survey last fall.

[0:07:01] MM: That will be good to talk about. I love what you said about the human side of it. That was what came through to me is that you really tell us a really nice story with this data. I'm curious about, can you talk a little bit about how you collected and analyzed it? Because the report itself is lovely.

[0:07:20] CG: Again, I've been doing these for a long time. I actually started doing these sorts of developer surveys probably 20 plus years ago at Red Hat and did some of Red Hat's first ever research data. The thing I've always felt about good research is that, by the way, appreciate what you said about it tells a story well, because that's exactly what we're going for, is that we wanted to not come in. I always come into these projects from the point of view of, don't give me data, give me knowledge. What is the takeaway, or set of takeaways? We also go into the design of the survey with a strong point of view on some hypotheses of things that we think may be the case. We want to go test and see whether people agree with us or not, and whether that's their perspective as well, or some places where maybe we're wrong. There were a couple of examples and places where our hypotheses were wrong in this, too.

We come in with a perspective on the stories that we want to tell. Then sometimes we come out of it with a whole new set of stories. In this case, we had a combo of both. We had some data points that I think were validated by what we found and some that surprised us.

[0:08:26] MM: I would love to hear about the ones that surprised you as we go along. If you think of it, yeah, just definitely drop that. All right, we've talked enough. We've done enough meta talk. Let's actually get into the results. Let's start with a bang. I think this question is to each of you. Chris, you can go first. What's your biggest takeaway right out of the gate? What was the thing that you think resonated with you all the most?

[0:08:46] CG: Well, I think just at the top line, there is a couple of data points that at the time, when we first saw the data were staggering. For example, 72% of developers who have tried AI are already using it every day. I wouldn't be surprised if now that number is even higher than 72. But last fall, when we first looked at the data, 72% seemed pretty amazing and probably a validation of what we expected to be true. In addition to that, we asked developers to tell us what percent of the code that they're generating, or that they're writing today is AI generated and how we ask them both about their thoughts on that today, and then their thoughts about that in the future, like what they would expect it to be in a couple of years.

What we heard is that 42% of developer code is AI generated, or assisted today already. 42%, which is crazy. By 2027 developers expect that number to jump to about 65%. As you look at that, again, this is one of those things where we're seeing this even last January, when I started at Sonar, I think most developers were still looking at the code that was coming out of these AI tools with a little bit of a skeptical eye, like, was it doing a really good job? The fact that 42% of code is AI generated as of last fall already is pretty interesting. Here's the other data point that I would add to that is when we ask developers, how much do you trust that that code that AI generates is factually correct? 96% of developers said they do not fully trust the code that's coming from AI.

You have this juxtaposition of 42% of code being written by AI going up within the next couple of years to 72%. But developers don't really trust it. That creates a verification gap, or a trust gap that needs to be solved. That's probably the top takeaway, I would say, from the survey.

[0:10:53] MM: Right. That rings true with me based on my experience and my job. I'm hearing more and more about agentic code reviews and almost, this isn't the right way to put it, but pitting agents against each other to verify and things like that. That is a very interesting finding. Manish, what was your biggest takeaway?

[0:11:10] MK: Yeah. I think I agree with what Chris said. Mostly, I was surprised. The one thing I'm surprised about is how quickly AI coding has been adopted as a use case. If you look back, it was just about two and a half years back when the first GPT versions came out. Here, we are talking about more than 72% developers already using it every day. It's a daily affair for them right now. This number is from last quarter. I'm pretty sure it's gone up higher now. They'll go to 80%, 90%. The developers are using it every day.

It's just how quickly this use case has taken off and how much it is being used across the industry. I've also seen is while AI is speeding up code generation, it's slowing down everything that happens right after the generation part of it, which is a lot of software engineering work which still needs to happen. Code generation is the first part of the equation. Next comes reviewing the code, verifying the code, basically, debugging the code, integration testing, long-term maintenance of the code. I think those things, we are seeing they have not caught up yet in terms of the speed at which code is being produced.

Agents will even be a force multiplier. If you look at coding tools and then agents come around just like what you mentioned, Matt, then there will be a series of agents, a legion of agents talking to each other. It's basically how quickly things will evolve. We have to still see how it pans out. I'm just waiting for what happens next. I think for me, writing code is no longer a challenge. It's about what happens after the code is written and what happens in a world where agents are talking to each other and developers are offline, or not so much involved in their day-to-day coding process and the CNC process.

[0:12:49] MM: Yeah. I am wondering myself with baited breath. Though, I have to say I'm pretty excited. Every day, I get more and more excited about what I see. Speaking of that fast adoption, one thing that I'm encountering in my day-to-day life at work is that people, developers want to use these tools. First of all, they're getting pressured. Second of all, they're starting to become more interested in it. People are using personal accounts for work things to be able to do this, just because they want to move faster, or they want to try. You did have some findings there in the report that were pretty interesting. Do you want to share some information about what you found there?

[0:13:26] MK: Yeah. We were surprised by this. There's a significant amount of shadow AI, which is happening - by shadow AI here, I mean, we found that almost 35% of the developers are using their personal accounts to access AI tools, rather than using the corporate sanctioned ones, right? I think why this is happening is because you are a developer, Matt. I have come from a technical background. I've been sort of developer, but not truly a developer, but written off code to be deemed a developer. Developers by nature, they are builders and tinkerers, right? We all want to tinker with the latest and the greatest technology. We want to explore the most modern tools that are out there. We want to stay at the cutting edge of things.

I think that is part of the reason developers want to go outside of the sanctioned tools as this world is changing so fast. With agents coming around, it goes even further where developers are beginning to use agents for scanning the repos, for generating migration scripts, for refactoring modules for the old code, and all of those things are happening every single day. This also means that the code, the prompts, the data, the context information they are sending to these third-party tools, the shadow IT tools, or non-sanction tools, you're basically taking a risk with the IP and data privacy of your organization, and you start using those tools.

You do need a governance and the governance landscape is not very well defined, unfortunately, as of now. It is going to get very complicated with the agents coming along and the legion of agents working together. It's going to become challenging for some time, but I'm sure we'll solve it like we have solved other problems.

One thing which remains common in terms of governance is anything that is being generated, whether you use corporate sanctioned tools, or whether you use your own personal tools, the code that is being produced has to be verified. The next steps have to still take place. You have to pay even more due diligence in terms of verification and assurance that the code is reliable and production-ready.

[0:15:22] MM: Just today, I had one of our privacy council folks giving us information on like, make sure - if you're going to use these tools, make sure that these are in contracts and things like that that work for a software agency. It definitely feels like the tail is wagging the dog, right? Everybody expects you to use it, but there definitely is not that forethought yet in terms of how that gets governed. That's awesome.

[0:15:46] CG: I agree. Yup.

[0:15:46] MM: Chris, is there anything you want to add about those findings?

[0:15:49] CG: The only other thing I might add is just in terms of the number of tools that people are using internally. It seems like people are testing lots of different things, right? I think we saw an average of the average individual used four different tools.

[0:16:04] MM: Oh, wow.

[0:16:06] CG: People on either side of the curve went higher, or lower than that. If you think about that too, what it says is that we haven't, as of again, when we fielded the survey last fall, that we haven't really settled out into who the winner is on the AI tool side. Although, you could make an argument that maybe Claude is making a pretty strong play for it here in the last month or two. People are still testing a lot of different things.

[0:16:31] MM: That rings true for me, too. Yeah. I mean, even up until November was like, I'm not sure which, but now once I've seen what Claude does, it's pretty incredible. Also, companies can't react that fast to these changes and people are bound to just sign up for something and try it. It's no wonder it's happening.

[0:16:50] CG: If your company had contracted to use a tool that now is the third or fourth best tool in the market, and you have to go through the whole enterprise procurement cycle to get another tool, versus being able to do your job 10 times faster and do something on the side, you can see why even though it's maybe from an enterprise standpoint, not risk ideal from a risk avoidance standpoint, you can see why it's happening.

[0:17:14] MM: Yes, definitely. Let's pivot a little bit. One of my favorite things that I saw in the report - there's two favorite things, but one of them is this concept of the great toil shift. Can you talk a little bit about what that is and what you found there?

[0:17:28] CG: Yeah, Matt. This is one of the ones we were saying earlier, some things that surprised us is. we expected to go in and have people tell us that AI had reduced a lot of the toil work that they dealt with. Examples of toil work would be things like writing documentation, writing tests, stuff that, that just take up an enormous amount of time. That is the bureaucracy of writing code. When we directly ask people the question, do you feel like AI has reduced the toil work that you do on a regular basis? The vast majority of people said, yes. 75% of people said, yes, AI has reduced toil work.

The answer was actually a lot more complicated, because when we later asked them more questions and based on whether you were using AI a lot in your job, like whether you use it every day, or whether you use it less, the people who are using it less still had a lot of those traditional toil work tasks, like the ones I just mentioned. But the developers who are using AI every day had a whole new set of toil tasks that were very different. The kicker is they both represented that they were spending about the same amount of time on toil work. It was different types of toil work. The new toil, as opposed to the old, like writing documentation, AI writes documentation great. That just took that off the table

Then all of a sudden, people have this new task, which is what Manish was talking about earlier of now they have to verify the quality and security of all this code that's being generated at hyper speed. That becomes the new toil work is the verification process getting there. Actually, because AI is never going to be held accountable for the quality of code that it produces. The human is going to be held accountable for the quality of code it produces. To me, this is one of the biggest challenges when it comes to toil work. If you know that the person is being held accountable, all of a sudden, going and checking all this code that's being written by the robots is you've got to do it, and it's not going to be the most pleasurable work necessarily. But you have to do it, because it's just like you wrote that code yourself as far as accountability goes.

[0:19:40] MM: It's so interesting, too. I've used this analogy with people that I've talked to. It might not be quite an apt one, but it works for me, which is when spreadsheets came along, accountants didn't go away, right? It just changed the type of work they needed to do. I feel like, it's exactly the same thing, but I love the naming of it, the toil shift. Probably going to use it.

[0:20:01] MK: Yeah. 38% of the developers say that reviewing code written by AI is harder than reviewing human written code. That's another thing, right? Going back to the reviewing part of it. That was something that stood out to me as well.

[0:20:14] MM: Yeah. It feels harder to me to put yourself in that context and follow the thread through. Yeah, that makes sense.

[0:20:21] CG: We've also found that sometimes there's a little bit of a needle in a haystack challenge when it comes to AI generated code, too. As these models are getting better and better at writing more performant code, and even more secure and higher quality code, sometimes the issues in that code just get harder to find. You as an actual human going into this and not having written it yourself, or having somebody who you know write it in you're reviewing your peer's code, that it becomes just, there may be less issues, but the issues that are there can actually be more pernicious, because they're harder to find.

[0:20:59] MM: Yes. Definitely makes sense. As a company who one of your main products is a static analysis tool, can you all talk to some creative ways that you see your customers using static analysis to combat some of these things?

[0:21:13] MK: Yeah. I think the last 17 years folks have used us and considered us to be the de facto code quality standard. We actually do much more than code quality. Our analysis engine, it does not only review your code for code quality, code security, code reliability, maintainability, complexity. Now even architecture, we also look at the architecture of your code base to see how quickly it is differing from a good architecture to a bad architecture.

Yeah, our customers have been using us in quite a few different ways. In the new agentic world, I think we have added support for all the modern AI native IDs, whether it's Windsor of cursor, either Copilot, any of those IDEs that you're using, or any of the CLI. CLI as you know Matt, are becoming very popular now, whether it's Gemini, CLI, codecs, or cloud code or any of these things. Basically, we have support for all of these. When I say support for all of these tools, we are available for them as the deterministic notifications layer for them, independent deterministic way.

AI writing code, AI reviewing code, it's like, basically, you will end up in situations that the AI will say, whatever I have written is the right code. They may not catch everything. They are trained on the same data set, as what they are using for writing the code and reviewing the code. We may not be able to. We have a deterministic way of reviewing the code. It is embedded right into the SDLC, where the modern SDLC where the AI is writing code, whether it's the IDEs, the CLIs, or even in the pull request where agents are sending us pull request for review, we can do it. We are converted tightly into the SDLC.

We also have a MCP server, which I'm seeing quite a few of our large customers are using MCP server, SonarQube MCP server. It's basically a gateway into the code analysis of SonarQube for the agents, in the protocol that agents like to use. That interface is available for agentic IDE, CLIs and whatnot. Apart from the analyzing the code, we are also looking at how can we improve our detection engine to detect AI specific issues. We have a facility to add custom rules. I have seen some customers looking at adding custom rules within the product to detect AI specific patterns. They can do that. We have added several rules to detect AI specific issues, like prompt injection attack, rules file backdoor attack. These are AI specific issues that AI introduces, which are typical human developers will not introduce.

[0:23:47] MM: What was the last one that you said there?

[0:23:49] MK: Rules file backdoor attack. It's a vector which is referred to as rules file backdoor attack, but basically, what it is is the configuration files, the .MDC files and the .MD files, the rules files that are used by these coding agents and IDEs. You can have hidden Unicode characters in those. Those got hard to detect by AI. We have rules to detect those issues, which are introduced through your configuration files, such as a rules file, and these are called the rules file backdoor attacks. Then the LLM prompt injection attacks, also, we have a rule for that which can detect issues pertaining to LLM prompt injection. These are things which we have added in addition to what we already do. These are specific issues tied to LLMs.

[0:24:33] MM: That backdoor attack, that would be, you copy a skills file from Claude, somewhere else, and you unknowingly copied a bunch of Unicode characters that poisons that prompt, or something like that. Is that the -

[0:24:44] MK: Exactly. Exactly.

[0:24:45] MM: That's really interesting.

[0:24:47] MK: Exactly.

[0:24:47] MM: Okay, note to self.

[0:24:48] MK: Exactly. A bad actor can introduce some hidden Unicode characters in the file type.

[0:24:53] MM: That is really interesting. As I put this together, right, if I'm writing a skill, or I'm describing what my CI/CD process would be like, I can use your MCP server, or other integrations to basically say, as part of my checks, I want to run static analysis and report out on the results. If the results look like XYZ, fail to build, or whatever it might be, is that correct?

[0:25:17] MK: That's absolutely correct. We have quality gates, and you can define the policy conditions when your bill should fail or pass.

[0:25:23] MM: Nice.

[0:25:24] MK: There will be applications that you build which are internal facing, which are not external, or which are not sensitive, or not production meant for production. It's for a small team. You may not want to have a very strict enforcement policy for that one. Whereas, if you're writing a banking application, or a medical application where cost of that application going down, or breached is very, very high, you would have a higher set of policy conditions applied to that. The gate will be set at a higher level. You fail the gate, even if there's a single security bug in it. That's the condition.

Whereas, if you have a low priority bug and you want to pass it for a lower internal application, which is not critical, machine critical, you can do that. You can define your gates and policy conditions based on your needs. That's a use case that our customers very often use as well.

[0:26:08] MM: Very cool. Thank you.

[0:26:10] CG: Yeah. One thing about that I find particularly interesting is a couple of months ago, we were talking to a very prominent analyst, the technology analyst. We were talking to them about the process for code review in the AI world. One of the things they observed is they said, look, our recommendation as a leading analyst firm is people use the same review process for reviewing code that they did for human written code, as they would do for AI generated code. If you had a good process in place with all the things, like quality gates and things that Manish was talking about, if you had a good process in place for your human generated code, you can actually use the same process effectively for AI generated code, because it's still ultimately, just code.

That really resonated with me. You had mentioned at the beginning, AI code review tools. That's one of the things that I'm scratching my head a little bit about is it seems sometimes like, people all of a sudden feel like, just because code is written by AI, that there needs to be a new AI review process for reviewing that code. Maybe in some cases that's more helpful, but also some of the standard tried and true static analysis processes for reviewing code actually work with AI generated code as well, and can give you repeatable results over and over as well.

[0:27:31] MK: Yeah. I think one of the key things here is the false positives from what - I've used some AI code review tools and obviously use our own product all the time. The false positives in certain situations can be a lot higher with AI code reviews. I think you need to have a best of both world is my thinking. Whatever the ideal use case is for each type of technology, we should use it for that. Yes. I think, absolutely, what Chris said is our customers are actually wanting to use a deterministic, which is always consistent in its result, which has low false positives as a verification layer.

For some use cases, they might be exactly to the point that you and Chris mentioned earlier. Writing documentation, writing PR review descriptions, those are better suited for LLMs for sure. LLMs can do a much better job in that, and we should exploit the strength of LLMs for those kind of use cases.

[0:28:24] MM: Yeah. One of the things that you guys mentioned and link off to in this State of Code Developer Survey is a report that you have on LLM coding personalities. I thought that that report was fascinating. I'm hoping that maybe you can talk a little bit about that and what you guys are finding with that.

[0:28:42] MK: A year back, we started working on evaluating LLMs for code quality and code security and code complexity. The history here is we have benchmarks. We have industry standard benchmarks that almost every LLM vendor comes out with, whether it's human eval, or any of the standard benchmarks, testing the coding performance of LLMs. While those benchmarks are very good and they're relevant and they have the starting point, they're half the equation we think. Why do I say that? I think benchmarks basically check LLM output for correctness. They check whether the code is able to write a particular algorithm, or solve a particular problem and whether it's correctly solved or not.

What they don't check for is how each LLM produces the code for solving that algorithm, or solving the challenge, like benchmark problems. We looked at, we have got a evaluation criteria where we evaluate it's a proprietary criteria. We evaluate 4,400 coding problems. We give it to all LLMs and we evaluate how they perform in those coding problems. These are problems which the LLMs are not aware of, because they're not part of the standard benchmark. These are unknown problems new issues for them.

When they run those, produce those code, we score them just like any other benchmark in terms of how many times, whether they have a good enough pass rate or not, whether they produce functionally correct the code or not. We take it a step further. We will look at how many bugs did it produce per thousand lines of 10,000 lines of code, a million lines of code. How many security issues did it produce? How much complexity did it produce? Whether it's cognitive complexity, what does the cognitive complexity look like? What does a cyclomatic complexity look like of the code they generate?

Benchmarks are the good starting point, but we took it to the next, because we are all about code health and reviewing the code. That's all our whole business is built on, right? We took it to the next level to take the analysis and figure out what they are. Based on our findings, we assigned personalities to the LLMs, the first six or seven LLMs that we did. We found that they had unique personality traits. We call them archetypes of personalities. Some of them were writing really good code, but the functional performance was superb, but the cognitive complexity and cyclomatic complexity was way too high. The others were cyclomatic complexity and cognitive complexity was very less, they were writing less number of lines of code, but they were also correct. We assigned the personalities to these LLMs.

It's a matter of how each LLM produces code, how abstract they are, or how word generalized they write the code, or how do they do the error handling, or how much of duplicate logic they have, or what code smells they have, what kind of security issues they have. We did a deep thorough analysis of each and every type of problem that they create, right? We started with personality's report, but we have now evolved into an LLM leaderboard that you can find on sonar.com/leaderboard, where we have about 35 models as of today, LLMs that we have evaluated for security, reliability, maintainability, cyclomatic complexity, cognitive complexity, issue density, all of those things are noted there.

Not only that, we also tell you if it's producing 10 security bugs. What are the types of security bugs it produces? Whether it's pack traversal issues, or it is secrets, leak secrets issues. What kind of issues are there? We have a detailed report. Perhaps, we don't have enough time to go through it, but I encourage you to take a look at it at sonar.com/leaderboard. You can dig into all of the 35 LLMs in terms of evaluating them in terms of the traits that they have.

[0:32:09] MM: You mentioned the personalities. Can you talk a little bit about the personalities? What's a quirky example, or an interesting example of one of the tools and how it was categorized?

[0:32:19] CG: When we first did this last summer, we were really focused on that idea of the personalities themselves for each of the models. What increasingly became clear is that the models were evolving so quickly and new models were coming out. Actually, one of the reasons we switched over to the leaderboard thing and away from the personalities is because the personalities were changing so fast that it was hard to just assign something to Claude code. I forget even what we said last. Summer was Claude code's personality, but its personality now is best coding expert in the world. I think the biggest thing about each of the personalities, like Manish was saying is that we found as the models got more performant, they also got more verbose in terms of writing more and more lines of code, which added to the cognitive complexity as well. We found that to be the case, all the way up until about probably November of last year.

When all of a sudden, some of the models, you'll see this as you're looking through the results on the leaderboard, some of the models are actually both performant, but not getting complex. They're decreasing the amount of issues. Where it was linear for a while there last year, where the smarter, in terms of performance the models got, the more lines of code they would write, the more complex, it's getting a little more nuanced than that now. Some of these models are getting really good. Really good. I would say, the personalities, at least of the top, you see where the cream is rising to the top on some of these, the top personalities are all just getting to be to the point where they're expert coders. I don't know, Manish. What would you add to that?

[0:34:04] MK: Yeah, I think you're right. We had the personalities defined for the first six models that we evaluated back in the day, when we evaluated GPT-5, Claude Sonnet 4, Claude Sonnet 3.7, we assign personalities to those. For example, we picked models which was relatively small language models, like open coder 8 billion parameters and we picked up a large model, like GPT-5. We looked at both the ends. Based on that, we created some personalities, like open code, or a relatively small language model. We called it a rapid prototyper. Why did we call it a rapid prototyper? Because it was basically solving issues relatively fine, using less number of lines of code, but it was not thorough in terms of the number of issues that it was introducing.

It used a prototype. A prototype does have bugs, but it can be quick and efficient way to test a proof of concept. We called it a rapid prototyper. Whereas, Claude Sonnet models were like a senior architect. They took a lot of things into play, like scalability of the application, how many users will be there and performance and all of that criteria into writing the code. We called it a senior architect. We had given some names to the first six models, but now that we have over 35 models, it's hard to give a name for all of those models. We're moving away from giving them modern names of personalities as they're getting returned.

[0:35:22] CG: It was a lot of fun, though, when it was endearing to anthropomorphize these LLMs.

[0:35:27] MM: That is a lot of fun, and it's a nice way to remember that. Now we're getting into cattle not pets, right? We're getting into that territory with, yeah. For anybody listening to this, I'm looking at this leaderboard and Opus 4.5 thinking is slightly ahead of Opus 4.6 thinking, which is really interesting. But then, when you look at it from a security point of view, which is security vulnerabilities issues per millions lines of code, GPT 5.2 high is at the top of this. Reliability, which is bug severity and issues per million lines of code, Gemini 3 pro-high - that's fascinating to me that all those different aspects. I think also one thing that stands out to me, too, is that this was done specifically with Java. That's what it looks like.

[0:36:17] MK: Yes.

[0:36:17] MM: There is that aspect to it as well, is that these models could be trained differently on different languages as well.

[0:36:23] MK: Correct.

[0:36:24] MM: But still, very fascinating. Anything else you guys want to add on the leaderboard, or the personalities before we move off that topic?

[0:36:31] MK: I think if any of your listeners want to get the models tested, there's a form in there. We can have to test it for them.

[0:36:36] MM: Oh, cool.

[0:36:37] MK: We keep getting requests. We try to keep up as much as we can, but we have almost a couple of models coming out every couple of weeks, it's hard. If there's something missing, we can look at it if there's a huge amount of interest in that.

[0:36:49] MM: How long does one of those benchmarks take to run, just out of morbid curiosity? It seems like it would take a while, but maybe not?

[0:36:57] MK: No. Initially, when we started doing it, it did take a while, but now we have made it a framework, which can evaluate models rapidly.

[0:37:05] MM: Cool.

[0:37:05] MK: The bottleneck for us in the last few tests were when Opus 4.6 came out, there was a whole lot of API requests being made to the model. It was overflooded with requests, so it was not fast enough to respond, or sometimes our tests would time out, so we had to restart the test.

[0:37:22] MM: You DDoS'd anthropic, basically. No, I'm just joking.

[0:37:27] MK: Yeah. Not so. The uni was DDoS'd.

[0:37:31] MM: That's awesome.

[0:37:32] CG: The big takeaway, I would say, on our work on this over the last six months from our perspective is, as you're evaluating which LLMs to use, don't just look at performance alone, look at it through a more holistic layer of how verbose is the code that's being written, how many security issues are being created, not just how well it performs at completing a coding test, is that I think there's more nuance to it than that. Some of the models that perform very well, when you take into account all of the other things, like the cognitive complexity, the verbosity of the code, even the cost of running the models is obviously a very big thing too, right? Not every model -

[0:38:11] MM: Very real thing.

[0:38:12] CG: - is created equally from just the cost of the tokens that you have to do. You have to take all those things into account, not just the sheer performance aspect, which I think had been where a lot of the focus was previously on evaluating models.

[0:38:26] MK: And the reasoning in addition to cost. You can go to different reasoning modes for each model. These have a two, or two to four reasoning models. The higher the reasoning, the higher the cost, the longer it takes to solve the problem, but it's more thorough then.

[0:38:39] MM: The skeptic in me, too, is also thinking about how much these prices going to steeply increase, right? Are they trying to just get people on the hook to be the leader and things like that? That is definitely on my mind as I think about what my organization is as well.

All right, let's pivot over to years of experience and how that affected survey results. I've been doing this a long time, 20-plus years. I thought that that was really, really interesting. Can you talk about how the perception of these tools comes into play with years of experience?

[0:39:12] CG: Yeah, I can start, and then Manish, maybe you can go into a little bit more detail. But this was another one of the results that was fairly surprising for us is that what we learned was that there's a big gap in terms of how people at different levels of experience use AI. Junior developers told us that AI makes them 40% more productive, but then 66% of them also admit that the code that it writes looks correct, but is actually broken. They're just more ready to roll in and start writing the code, but then they're also scratching their head and not sure whether they can actually trust the results of what came out.

Now, senior developers were a little bit more measured. I would say, they're using it in different ways. Again, this is data from last fall, but 65% of them say they were helping it mostly to understand old complex code, or write documentation, do things where they're cleaning up the past and using it in ways to maybe check the accuracy of things, versus the junior developers who are ready to roll it and let the AI do all the coding.

Now, again, I would ask this by saying that I think a lot has changed in terms of the quality of the code that's created right around, if you're following along the conversation about this on Twitter, there's a general consensus that seems that around mid-December, maybe when everybody was taking a holiday break and got a chance to play around with some of these new models that they realized that these things had gotten really good. Even some of the more senior developers were starting to do a little bit more sophisticated things than they were doing before. But I would say, that's the big gap is just junior developers are ready to jump in and try the tools, and senior developers maybe are trying it more measured ways, because they're also maybe a little bit more jaded about they know the risks of low-quality code getting into production and how it's going to hurt them later. I would say, that's probably the key difference and maybe a little different than what we expected, but Manish, anything else you'd like to add there?

[0:41:18] MK: No. I think you covered it well, so I don't have anything to add there. Yeah, senior developers used AI tools as reasoning assistants and they understand the code what is being produced and the title makes sense of it. There's more of a trust factor when it comes to junior developers. They jump right into it and develop it and that's just the nature of experience. An experienced developer wants to question everything, or look at all angles, because the newer, younger generation of developers, they are more trustworthy in terms of the new technology.

[0:41:48] CG: It's got to be really scary right now to be a junior developer, too, and see the thing that you spent maybe years learning how to do that a machine can go and do that part and it's the higher order engineering tasks around designing the architecture of an application, or things like that that maybe they don't have as much experience on, that now are the things that are separating us from the machines. For senior developers, I think it's a really interesting time, especially if they can figure out how to turn into orchestrators of AI and Manish was talking about the idea of legions of agents of how can you orchestrate. That's another conversation on X/Twitter these days, seems to be people playing around with how to build their own army of agents out there doing individual tasks and who's doing the best job at coordinating all the agents working together. I think increasingly, junior developers are going to have to move beyond their coding skills and figure out some of those upstream skills in order to stay relevant.

[0:42:49] MM: One of the things that the report pointed out, I have a note here that said, junior developers report higher job satisfaction related to the use of AI coding tools. That makes intuitive sense based on what you guys are saying to, however, what you said. I'm not on Twitter, or X, but what you said about using the holiday break. Also, anthropic gave you those free credits to do exactly that. It's exactly what I did, and I walked away from it with a completely changed opinion about how effective it was, and I've been doing this a long time. I found that absolutely fascinating that it sounds like I was not the only one who had that experience.

[0:43:26] CG: You're not alone. You're not alone. We're seeing this. I mean, every day, it feels like I see some either blog post, or something where a senior, or well-established engineer with a great reputation says, "Hey, if you're not doing this and you're not starting to build your own team of agents to do the work for you," like I've seen several people say, "I don't write a single line of code by hand anymore." Most of my work is figuring out is the architecture level work, or is training individual agents to do particular tasks, like giving agents each their own tasks so that they can help work with each other and review each other's work. It's really fascinating how quickly that's moved literally just in a minute. What is it? February 17th here as we record this today. A lot of this has just happened in the last month and a half.

[0:44:14] MM: Yeah, it's truly astounding. You said that passion, or enthusiasm, excuse me, that junior developers have for these tools. I honestly walked away from it being much more enthusiastic. I'm really curious. We talked about the differences from October to now. I'm really curious if that's going to change with senior developers. Have you guys seen anything like that since October, or?

[0:44:38] CG: Yeah. We're actually thinking about doing another pulse version of this survey, just because things have changed so quick, we feel like we need to get out there in the field and do a little bit more research, just so we can compare and how that we have a baseline to see what's changed. Stay tuned. If we're able to make that happen, we'll be happy to share the additional data on it, too. Again, we walked into it with hypotheses last fall. I have some new hypotheses as well that I want to test. A lot of it is around how much of this, because sometimes there's that conversation on Twitter/X, which good you for staying off of that and staying out of that world. Some part of me wonders whether the people that are having that conversation are way, way out on the cutting edge, or how much of the mainstream of enterprise engineering groups are already harnessing agents.

We did have some data on that as of last fall that we saw a pretty significant number of organizations had already been playing around with agents. Again, those numbers, I think just in the last month and half, it probably changed. But I think we need to test it and see for sure.

[0:45:44] MM: From my experience working with larger clients, it's ticked up in the past six months. It's ticked up quite a bit. Yeah, it's really interesting. Another thing that I noticed that was really interesting is the use of AI in greenfield versus brownfield projects. Greenfield as in new code, brownfield as in legacy code. What did the reports say in terms of clues as to where it's used most effectively and where people thought it was used most effectively?

[0:46:10] MK: Yeah. AI is best for projects which are starting from scratch is working found 90% of the developers use it for new projects. It's much less effective when you have to work with an existing code base, particularly if programming languages used in that project are not commonly used these days. I think if you look at LLMs, they are generally very good with other flagship languages, like Python, JavaScript, TypeScript, Java and all. But if you go back to some of the legacy applications and old code bases, which have some legacy code in there, and I won't name languages or frameworks there. But some of the legacy projects, they're not so good at. Only 43% of the developers find AI effective in updating and optimizing code that is already in use, particularly for older frameworks and older languages. That's one of the observations.

I think it also comes down to correctness. AI excels in greenfield, because surface area is small when you start with and correctness is generally high with the newer state-of-the-art LLMs. With the brownfield applications, they sometimes struggle with the legacy API assumptions. There are some non-obvious couplings in those applications, or implicit rules that are not very well documented, or hard to figure out from just by looking at the code base. That's less of a use case. Currently, LLMS are being less utilized for brownfield applications.

[0:47:30] MM: That makes intuitive sense to me as well. Have you heard anything since - This is something I'm curious about, but haven't tested, is the use of an agent's MD file, or something analogous to it that can help give hints for legacy applications. Have you heard about anything like that since October?

[0:47:49] MK: Yeah. I have not heard for specifically for legacy applications, but that does make sense. It's logical. I think with cloud code having the concept of rules, skills, hooks, hooks can enforce certain guarantees, rules can give you some constraints. With those things in place, I think it definitely, there's a chance that we might see an application usage of an AI in older applications, brownfield applications as well.

[0:48:16] MM: I'll be really curious to see. One of the clients we're working with has a 20-year-old system that's mostly written in C++ and the tech lead that's on it is struggling with it. I was like, you should run an experiment. See if you can embed some of the tribal knowledge that is in that app in some of these agents files and see. For the listeners, we'll report back on that as well as what you find in your pulse check.

[0:48:39] MK: Yeah. I think context is very important, as you said. Context while the files that you give to agent and the information that you give to an agent should help with the brownfield applications.

[0:48:50] MM: Just real quick. I am curious. You had respondents for this all over the world. Did you see any interesting geographic patterns at all?

[0:48:58] CG: Nothing that we wanted to report out on. I mean, I think everybody is living in the same world, basically. We looked to see whether there was anything that we could tweak out that was statistically significant and not enough that made it into our top findings of the report. I think we're all going through this together, regardless of where we live in the world.

[0:49:19] MM: That makes sense. I was just morbidly curious. All right, let's start wrapping it up. I think I learned a ton from the survey. For me, it's been a while since I've been boots on the ground developer. I've been in leadership for a while. I still code, but I'm in leadership. As I put myself in the shoes of a developer who's having executives, or managers push these tools, sometimes those asks are grounded and sometimes they're not. For anybody listening that might be giving what they think is an unrealistic goal for using AI, what's your advice based on what you saw?

[0:49:54] CG: I think the big thing would be to tease out whether your organization is focused on the speed of writing code using AI, or the speed of shipping code using AI. The latter, shipping code using AI is harder. That's why you see many organizations not really getting the promised benefits of AI is because the promised benefit maybe they're focused on is you can write code a lot faster. But as Manish was saying earlier, the name of the game is being able to verify the quality and security of that code before you ship it, or you're not really gaining as much speed as you think you would. I think it's an education process for developers who are working at organizations that already get that. Like I was saying earlier, it may be that if you had a really robust code review process for the age of developer written code, that you're actually probably set up pretty well to succeed in the AI generated code world, where you have automated code review in place, you have tools like SonarQube, or whatever you were using before to check the quality and security of code.

That would be the big advice is to make sure that organizations have a really good process for verifying the quality and security of the AI generated code. Or if you don't, then ensuring that you do your part to educate the leaders on your team that just because AI can write code really fast doesn't mean it's good code that you want in production, that's going to potentially increase your organization's risk, or create spaghetti code that's hard to maintain, or some of the other challenges we talked about there. I don't know. Manish, anything you would add?

[0:51:39] MK: No. I think you said it right. I think don't lose sight of quality. Ship it faster, but don't lose sight of code quality and software quality and application quality. Don't compromise on that.

[0:51:48] MM: Be careful what you asked for is coming to mind. Also, what's coming to mind is Lucille Ball on the chocolate line eating the chocolates. It's like, if they start coming out faster, you got to, yeah.

[0:51:58] CG: Exactly. Exactly.

[0:52:00] MM: If I'm a developer, what do you think the most important takeaway from this survey is?

[0:52:05] MK: Yeah. For developers, the most valuable skill is no longer knowing how to write code. That is pretty much a solved problem, writing code. It's more about understanding the code, making sure the code is correctly written by the agents, or your tools that you're using, and making sure it's being reviewed and you're following all the - putting some guardrails in place to make sure - ultimately, developers are still accountable, irrespective of who writes the code. Like I said, you want to do the right things. You want to put some guardrails in place. You want to validate the code. You want to make sure the code is written there. It's not so much about learning a new programming language anymore.

[0:52:43] MM: Anything you want to add, Chris?

[0:52:44] CG: Maybe just repeating what I said earlier about, if you're a developer who is trying to figure out how to stay ahead and keep your skills relevant, probably the best thing you can do right now is to gain the ability to manage and orchestrate and train agents. If you can do that really well and you can harness the skills of these leading edge LLM tools and you can stay on top of the knowledge, be curious about learning, because these things are evolving so fast. What you learned during your Christmas break may not be even the current state of the art anymore. Staying on top of that and spending at least a few hours a week just learning and testing and trying new things. I read something today where somebody said, this may be the most important year of your career, because things are moving so quickly right now that if you're not paying attention and you're not staying curious, you could quickly fall behind in your career and have a hard time catching up.

One side of me, I was like, oh, am I doing that? I was like, I think I am. But it was a little bit of a call to action for me, too, of making sure that in that hour at the end of the day, when you're winding down and everything, you try a couple of new things every day and just see, did you get better results on it today than you did yesterday? Increasingly, I'm finding you are.

[0:54:05] CG: Yeah, I'm hearing two things which resonate is true with me. Manish was saying, you can't lose sight of the basics. The code needs to be bug - you got to ensure quality, but at the same time, you got to dance on the bleeding edge in order to stay relevant, quite frankly. Yeah. Then last question, if I'm a development leader, or an engineering leader, what do you think the most important takeaway is?

[0:54:27] CG: You can't ignore the trust problem in code. As AI is getting better, but still, that stat I said at the beginning is the big takeaway for me. 96% of developers still don't fully trust the quality of AI generated code. Not necessarily because it isn't good, because in many cases, the code is really good in continuing to get better. It's because they know AI isn't going to take the fall when the code fails, or there's a security issue. If you're a leader, you have to ensure you still have human accountability for the code that gets shipped into production. You have to build the right systems and processes to ensure that you're able to verify the quality and security of the code before it ships. Don't just close your eyes and ship AI generated code. Unless you're in some bleeding edge, vibe coded startup where you can risk doing that. If you have actual customer data and customer information and things like that, probably make sense that you go and really figure out how to trust the code that you're shipping. That to me is the 2026 problem of AI generated code in 2025.

The problem was, how do we generate even more code? That's like Manish said, a solved problem. Now we can generate pretty good high-quality codes, continuing to get better, but ensuring that there's a human who's willing to put their stamp on it can say, "I'm willing to ship this into production and take all the risks that entails." That's the biggest challenge for 2026.

[0:55:59] MM: Yeah. I think, also, the thing that if I'm a leader, I'm thinking about, okay, you're telling me that 66% of people don't trust the code. I need to put that human stamp on it. That's a good thought-provoking thing to end on, I think. Is there anything else that you guys want to cover that we didn't tonight?

[0:56:14] CG: Matt, I appreciate the time. Thanks for inviting us on. We both, I think Manish and I and the whole team, we're passionate about doing this research and the LLM leaderboard project that we talked about earlier is a continuing passion, where I think we go look at it every day and see the new models that go up and the results and just watch these things getting better. It's a wild time. I've been through some different turns in my career as I'm sure you both have, but I'm not sure I've seen anything quite like this. It's a fun time to be of all. Little scary, but it's a lot of fun, so I appreciate you having us on.

[0:56:50] MM: Yeah. Thank you so much for being here. 

[END]