EPISODE 1596

[INTRODUCTION]

[0:00:00] ANNOUNCER: Jodie Burchell is the data scientist developer advocate at JetBrains, which makes integrated development environments, or IDEs for many major languages. After observing the rapid growth of the AI coding assistant landscape, the company recently announced integration of an AI assistant into their IDEs. Jodie joins the show today to talk about why the company decided to take this step, the design challenges of adding AI tools to software products and the team's particular interest in auto-generating code documentation. Jodie also talks about the different types of language AIs, how AI tools will impact software development, and more.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:00:58] SF: Jodie, welcome to the show. 

[0:01:00] JB: Hi, I'm so happy to be on, and thanks for inviting me.

[0:01:03] SF: Yeah. I'm really excited for this conversation. I've been a long-time IntelliJ user. Of course, JetBrains does many, many things, but I'm sure a lot of people are familiar with the company through their rather famous IDE at this point. Let's start with you. Who are you and what do you do?

[0:01:17] JB: Yeah. I've been at JetBrains for about a year and a half. I'm the developer advocate in data science. But prior to that, I had quite a long career, about seven years as a data scientist. I spent most of that career working in natural language processing, which is obviously very relevant in today's chat. Then prior to that, I was an academic. My PhD is in psychology.

[0:01:41] SF: Oh, wonderful. You've made a couple of transitions. You went from academics to data science to now developer advocacy. How has that transition been for you, like moving from day-to-day, doing data science work to, I'm sure you're doing some elements of data science work today, but probably a lot of it's also focused on educating people about, how do you do data science with JetBrains Suite, or convey, essentially, new product updates and feedback cycles, and so forth?

[0:02:07] JB: Yeah. It's a question I actually get quite a lot, because I think, particularly developer advocacy and data science is relatively new. I actually didn't know what developer advocacy was before I ended up applying for this job. A friend of mine works at JetBrains and she suggested that I apply. The reason that I ended up applying for the job is back when I was an academic, something I was very passionate about was teaching. This felt like it was an opportunity to return to that a bit. Obviously, there's internal stuff that needs to be done, but I would like to think a huge proportion of the time that I spend on the job is just educating people, not just about our tools, but about data science and making maybe slightly inaccessible topics a bit more friendly.

[0:02:58] SF: Yeah. I actually think academics to advocacy, or develop relations is a great – very natural transition. I also made a similar transition. I know during my time at Google, they would actually, a lot of times try to recruit people that had some academic experience, because there is a lot of, when you're working as a professor, or teacher in a university, you're spending a lot of time trying to essentially, convey really complex technical concepts to an audience and have them understand it and be able to digest it. That's a lot of what you're doing when you're working as an advocate as well. It's just a little bit maybe more product and industry focus, but a lot of those skills, I think are transferable.

[0:03:38] JB: Yeah. It's funny. This job, I think, reminds me more than any other job I've had of my time in academia. It's also, because I've got a lot of freedom to do what I like, which I also had in academia. I'll be honest, I really liked all of the jobs I've had, but what I like is how long my rope is in this job. It's nice to have that discretion to make your decisions about what will be valuable.

[0:04:02] SF: Yeah. I also think that that is a common theme as well that I see with people who are working in development. You have a little bit more ambiguity, and there's a little bit less of a fixed structure where you're running maybe a two-week sprint on a software development cycle where you're really, I need to deliver this feature for this larger product offering, or something like that. There's a little bit more freedom to choose your own adventure, I think, in developed relations was probably speaks to the person who's gone through a PhD, or done academic work where you might be working on a project that's an undefined thing for three years and that nobody really is understanding except for you.

[0:04:39] JB: Yeah, exactly. It surprised me how comfortable it felt actually.

[0:04:44] SF: Awesome. Well, I think that given your background in academics and NLP, and now you're jumping to the world of JetBrains and coding environments, there's this huge intersection that's happening right now between AI coding assistants. That's where I want to spend a bunch of our time today on is around these AI coding systems. Before we get too deep into that world, I wanted to try to lay some groundwork for LOMs. First, can you start off by breaking down some of the relationship between AI, machine learning, general AI, OMs, and GPT? I feel like, there's a ton of confusion when I talk to folks that especially that are maybe not super well versed in this world. I know in my day job, we spent multiple months trying to tell our sales team that ChatGPT is not the LOM. It's GPT that's the model and stuff. People have a hard time dissecting what's actually going on here and what these different things mean.

[0:05:40] JB: Yeah. I think part of it, too, is with the current hype cycle, a lot of these terms that are established, academic terms have become conflated with marketing terms. Yeah, I think this is a really important place to start. Let's start with AI. AI is a very, very, very broad field. Basically, it's an attempt to create, say, let's artificial systems, or machines, whatever you want to call them, that can mimic human cognitive abilities. This can mean a lot of things. It can incorporate everything from, say, knowledge representation to social intelligence, to the one that everyone thinks of, which is artificial general intelligence, when you have a machine that has the exact same capabilities as a human.

The core idea is you're trying to create systems that can do parts of human reasoning. This doesn't have to be everything that a human can do. It can be part of what we can do, like translation between languages, or making decisions independently. AI is, I guess with the broadness of the field, you could arguably say that it's hundreds of years old. But formally as a field, it was started in the 1950s.

Then we move to machine learning. Machine learning is, I suppose, a subset of artificial intelligence. What happens with machine learning is algorithms are trained on data sets to automatically perform some specific task. They really tend to be optimized for specific tasks, the thing that they're trying to do. That might be, I don't know, classifying images, or being able to predict stock prices. These are all machine learning models. These models tend to show a little brittleness when you try to get them to do tasks that are too far outside of what they were trained on.

Then that leads to generative AI. Generative AI, I suppose the way that we're talking about it right now is a subset of machine learning. Basically, you have these types of machine learning models, which are called deep learning models or neural nets. They're used to create new content based on patents that they see in training sets. I've seen it mentioned that the first generative algorithm was this computer program from the 1960s called Eliza, which is it was basically designed to mimic Rogerian psychotherapists, like tell me how you feel. It was, of course, not a neural net. It was just a rule-based program. This is seen as the first system that could mimic a human and generate text and people thought it was acting in a human-like manner.

Generative AI is broad at the moment. It's mostly centered on text and image generation. There's also speech generation, there's videos. I'm sure you've all seen deep fakes, even things like music, or data can be generated by these generative algorithms.

Then we have large language models. Shall I keep going, or do you want to jump in? Yeah. Large language models are the latest generation of machine learning algorithms in natural language processing. Not all large language models are actually generative AI. Basically, they are mostly built on a particular type of deep learning model, which is called a transformer model. These models are very, very efficient at extracting the meaning of the words in context.

When you feed these models, when you train them on huge amounts of text, they get super good at creating internal representations of how language functions. This means you can use them for a whole range of natural language tasks. You can use them for things like, text classification, summarization and translation. Then that leads finally to the GPTs. The GPTs are a particular subset of large language models. It stands for generative pre-trained transformers. The way that they're trained is you basically get the model to predict a missing word in a sentence, or the next word in a sentence.

A GPT would be trained, say for example, if you fit in a sequence something like, the cat is on the, based on all of the examples it's seen of sentences before and trying to guess the next word, it would probably predict mat, or something that comes very commonly after that. Interestingly, actually, this is something I really like, as a fact about the GPTs, they weren't initially generated, or created as text generation algorithms. They're actually created as a way to scale up the amount of data that we had for training these models. Because normally with machine learning models, you have to do some pre-processing to get the data into a form you need. If you’re just feeding in sentences and breaking them into pieces, you can really scale up the amount of training data that you have. But the side effect, of course, is that we now have these really amazing models that have these really rich internal representations of language, and are also very good at generating text, because that was what they were trying to do. That's the landscape. It's a little complicated, but I hope that gives everyone a bit of a perspective of how we got to the point we're at the moment.

[0:11:09] SF: No, I think that's wonderful. I think it's a good reminder to even going back to what you started out with talking about the foundations of some of the AI research really comes back to the early 1950s, and some of the pioneering work by Alan Turing, around the imitation game, which eventually became the Turing test. He was really the first person to put forward this idea of like, could we build a machine essentially that mimics human behavior?

Unfortunately, he wasn't around. He lost his life early, so he wasn't able to continue. But a lot of people picked it up from there. He was also the foundation of essentially, the Turing machine, which is that what we're communicating on today. It's really a long history there. In terms of some of the early work, or not necessarily early work, but the things that we're used to doing with in machine learning around these probabilistic models, or classification, people have been using them for fraud detection, spam detection for years really since the dawn of the Internet. I remember building Naive Bayes classifiers for spam detection back in the early 2000s and so forth.

Has there been essentially a particular innovation, or something that happened that allowed us to make this leap forward in terms of what we're able to do in more of a generative AI sense? I think that what we see from LOMs and what seems so impressive is that we can really ask something as if you would ask another person and you get a response, as if it was written by another person, versus, I think, over the last 20 years of using things like Google search, we've programmed ourselves and how to essentially interact with a keyword-based search, which is performing AI to some extent behind the scenes, but it's not how you would ask somebody about the restaurants in Nashville.

You wouldn't type Nashville restaurants, or walk up to you and be like, “Nashville restaurants.” You'd be like, “What is this rude person talking about?” That's how you would ask, essentially, Google search to find results like that.

[0:13:03] JB: Yeah. There are actually three pillars of how we got to where we are at the moment. The first is we needed processes that are really good at doing the computations that you need to do when you're training neural nets. It's essentially matrix multiplication. The details don't really matter. CUDA was developed, I think in around 2006. This was a way of being able to turn GPUs into these matrix multiplication machines that we needed to create these big models.

Then there was data. I've already talked about the fact that the GPTs just use sentences. Where they ended up coming from was a data set called common crawl. It’s basically a dump of petabytes of web data. Common crawl is, well, we don't know exactly what ChatGPT and GPT forward trained on, but the earlier GPTs were trained on common crawl. Then the third was these transformer models I was talking about. With natural language processing at least, one of the biggest challenges that we had was learning how to represent words in a context. This is really, really challenging. There were a lot of earlier attempts at doing this. The simplest models just don't do it. They're called bag of words models.

Obviously, in order to understand even things like disambiguation, it's like, what does run mean in different contexts? What does bank mean? Up to things like, sarcasm or jokes. You really need to understand this context. We had an earlier type of neural net called an LSTM, long short-term memory models. Basically, they were the first ones to try and understand words in their sequences. They had some technical limitations. They were really limited in how big we could make them and how many words we could get them to process in a sequence.

The successor of those was the transformer models. They've really allowed us to scale up, because they're not bound to sequential processing. They're really good at processing longer sequences. They can take in huge amounts of data. They can grow really big. This is what's allowed us to end up with these all-purpose natural language machines, because you can just train them on so much data, that they learn so much about the world they've seen through that text, that they become almost like, well, I suppose the phrase is stochastic parents. Yeah, they become really good at imitating us and seeming almost human-like.

[0:15:36] SF: Yeah. I mean, one thing that had to happen before that was even possible was we needed to generate enough digital data to even have the corpus to train these models on. 15 years ago, we probably didn't have enough stuff online to even come close to where this is now.

[0:15:50] JB: Yeah, that's actually really a great point. I suppose, it's something that I often forget about. I grew up in not quite the edge of the Internet. I still used dial up when I was in high school.

[0:16:02] SF: Me too.

[0:16:04] JB: Yes, yes. It's a comfort granted how much data we have now. My previous job, I worked in programmatic advertising. It was not natural language processing. We were basically selling auctions for advertisements on applications. The amount of traffic, we had a 170 billion transactions a day when I was working there. That's just one little company working in this space. The amount of data that companies like Google, or Microsoft, or the big ones have, it's crazy. Yeah, the amount of text data is just readily available. It's all there for the taking.

[0:16:41] SF: Yeah. Then you've touched on this a little bit around what the original GPT models were trained on. Can you give a little bit of background on the history of the GPT models? What are the major differences between the different versions? I think, most people became part of a zeitgeist with ChatGPT and GPT35. It's where, I think, a lot of people's head naturally go to OpenAI and GPT when we think about generative AI and LOMs. Of course, there's a lot going on in the space, but I think that is probably the main one that maybe talk about before we jump into some other stuff.

[0:17:13] JB: Yeah. As I said earlier, the GPTs are not the only large language models. Large language models were, well, I should say, transformer-based large language models. Were initially actually designed for machine translation. You had this input unit called an encoder, which is designed to learn about the source text. Then you have the decoder, which is designed to learn about the target text and then generate the translations. Taking each of these units on their own, the encoder and the decoder spawned two different branches of large language models. Those of you who have any interest in large language models might have heard of a model called BERT. It's a foundational model, which is basically encoder based. It's designed to do a number of – I won't get too much into BERT, but it's designed to do a number of tasks, which make it really good at, again, general natural language tasks.

The other path that was taken was the decoder path, and this was the GPT tasks. The original GPT was really just stacking a bunch of these decoder units and feeding in a bunch of this web data. This was first developed by OpenAI in 2018. They're actually relatively old, I guess, if you're thinking compared to the recent zeitgeist. GPT2 was 2019. It was about 13 times bigger. I was around for GPT2. I was working with a bunch of computational linguists when that model came out. We used to actually send each other the completion of prompts from GPT2. Both of the models, they're good at imitating grammar, but they tended to produce just word salad. They did not sound realistic, let's say.

The real breakthrough was GPT3. GPT3 was 117 times bigger than GPT2. This was the point where we started having models producing convincing text. The big problem was is the models learned a lot about the world. Sometimes the things that it learned was not very nice. This is where the models really started showing problems with bias, toxicity. It also where you started seeing this hallucination problem, where the models really tended to lie.

This actually leads us to ChatGPT. It was initially a project called Instruct GPT. the idea was they were like, “Okay, can we design a secondary system that will steer the model into being more truthful, less biased, less toxic?” The first step was they took a GPT model, GPT 3.5. What they did was they got a whole bunch of people, gave them prompts, and got those people to write exemplar answers. These answers were designed to be truthful and non-toxic, non-biased. Then what they did was a process called fine tuning. Fine tuning is where you take a general purpose model, like a GPT, and you train it a little bit further on a smaller data set, which is more targeted. In the case of the GPT 3.5 model, it was fine tuned to mimic these exemplar answers more, to produce better quality answers.

Then they added in one final component. That's a reinforcement learning component. Essentially, what they then did is they took this fine-tuned model, they got a bunch of prompts, they fed those prompts through the model four times, then they got another set of human raters to rate how good each of those answers were, again, based on how truthful and non-toxic they are. They trained another model, a reward model, using those answers. Then that produced a model which could predict how good an answer was based on that score, one to seven based on what the raters had given similar answers in the past.

Then this whole system is glued together. What you have is you enter in some prompt into ChatGPT, it will output an answer from that fine-tuned GPT 3.5 model. That answer will be fed into that reward model. It'll spit out a score. Then that score will be used to tweak that fine tune model. That the idea is that over time, you're getting this model that moves more and more towards a truthful, non-toxic, high-quality answers. This is ChatGPT. We think that GPT 4 works the same way. It's just a bigger GPT model under the hood.

[0:21:53] SF: Yeah. It sounds like, just breaking down some of the history there, some of the things that happened that got us to the place where we are now. One is just bigger model, more data processed, better quality data. I'm sure there's some stuff that they're doing around improving the quality of data. Then also, factoring in a combination of other types of models to help fine tune and refine what's actually there, plus a human in the loop process to make sure that we're steering away from things that are toxic, or unethical.

There's actually a lot of interesting work, not to derail things too much around using fine tuning now, this is coming out of research, to delete parts of the model, too, because one of the challenges with all these models is they're designed to learn, not unlearn, so how do you unlearn from the model? How do I make a model forget what an apple is, or something like that, or maybe more relevant would be, forget my social security number, because I accidentally made it part of the training corpus.

[0:22:48] JB: Yes. Yeah, the selective amnesia. More selective forgetting. I'm really fascinated by this. The reason it's so important is because training these models is so expensive. Also from an environmental perspective, using GP use for that long has its cost. Rather than needing to trash everything, just because something was included in this enormous data that you just didn't spot, it's actually a really, I think, sensitive approach to being able to refine the models.

[0:23:18] SF: Yeah. I want to start to transition from some of this pure AI stuff that we're talking about, which is super fascinating. I think you did a great job of breaking a lot of that stuff down. To talk a little bit about the rise of AI assistance. There's, of course, GitHub's Co-Pilot. Salesforce is doing its own work in the space around Einstein. There's even people using ChatGPT as an assistant. From your perspective, how have these tools, I said, yes, what impact are they having to software developer productivity into the role of being a software engineer?

[0:23:52] JB: Yeah. It's a really interesting question. To be honest, even before, spoiler alert, JetBrains started developing their own AI assistant, it was a question I had a lot with people, because when the hype around these models was at its peak, I think it really kicked off when Co-Pilot came out. Then when ChatGPT came out, it really took off. People are really scared about losing their jobs. They're afraid that these models will replace them. Something I want to say is I truly believe we are nowhere near the point, where these models will replace developers. I believe it's just another evolution in productivity tooling.

It requires a partnership between the developer and the tooling to get the best out of the tooling, but also, the best out of the developer. Well, not necessarily, you don't need to use it as a developer, but this tooling by itself is not going to replace developers. The reason I believe this is the role of a developer is not to code. The role of a developer is to solve problems. Solving problems in a business context is really complicated. It requires not just dealing with the messiness of human requirements and things like that, but it requires architecture decisions that are not straightforward. It requires communication through code, which is potentially team specific.

Just in terms of where I think it fits, this is my preamble. Where I think these tools can be helpful, from talking to people, from reading discussions, giving feedback on our tooling and others, what seems to be the pillars of where these tools can be useful is speeding up boilerplate, or common tasks. Maybe you need to, I don't know, write an API. You can get it to do the skeleton, and then you can go through and refine the code.

Other people have told me that they really like using these tools to help with learning and debugging, sort of like an idea of being an interactive tutor, or a rubber duck, and then helping with things like, maybe refactoring suggestions. Particularly, if maybe you're new to a framework, or new to a language, it can be helpful with helping smooth that transition that you don't have experience with yet.

[0:26:21] SF: Also, I feel like, that the concern over potentially losing your job, because of some sort of technology innovation isn't something that's solely restricted to AI. I feel like, with any jump – I mean, there was a time when someone in the early parts of my career, this wasn't the only job they did, but you paid somebody to basically copy and paste files from your staging server over to your live production servers and create backups of those things. Then eventually, we developed real systems for actually automating that entire pipeline. Those people still had jobs, they just transitioned their roles to some extent.

There's always been fear as we've introduced automation, especially in engineering that it might remove someone's role. Now, this feels like a step function in terms of what's been there before. I also think that there's, for whatever reason, a little bit more fear around things like robotics and AI, things that feel uniquely human things. I guess, what are your thoughts on that in terms of just managing the psychology around this?

[0:27:24] JB: Yeah. I think firstly, I don't know, looking at the findings of where these models tend to be useful. There's not a whole lot of research yet about the interactivity between developers and these tools, but the literature that I'm seeing is really coming out with the idea that, well, yeah, maybe it can replace developers in terms of some basic coding tasks, but it also tends to get things wrong a lot. That means that it's definitely not at a point where you can reliably automate this, create an LOM-based pipeline, which will be an automatic developer.

I'm not really seeing, I guess, any convincing remedies for this hallucination problem. That alone, I think, is a deal breaker. Again, the way that I see these tools fitting in is it seems that where they really seem to benefit people is they help speed up, or reduce that gap between juniors and seniors. You'll have some coding task, and it really helps developers onboard a little bit faster with the basics of using a language. I've seen it also actually with non-coding tasks, can help people onboard faster with writing reports, or things like that. It's really just the fundamental parts of the job.

You're not going to be able to replace things like, I don't know, how to create a maintainable application. What I see is the large language models can help juniors learn this basic stuff, the basics of using the language by themselves. Then the seniors can better spend their time tutoring them in the more complicated, the art part of the job. I actually think it's probably going to be a net benefit for everyone, that there's going to be less time spent doing the boring crap, and more time spent doing the creative, interesting building type of the job.

[0:29:34] SF: Yeah, absolutely. I mean, I think that even in my own personal life, or working through a career, if I'm in full do mode all the time, where I'm just cranking through tasks, it doesn't leave a lot of cycle time for thinking a higher-level strategy perspective. In theory, if you can use coding assistance, relieve you with some of those, just rote tasks, then you have more free cycles to be thinking about things, like the deeper level problems, strategy, how do we overall make a better system.

I have seen people express concerns about an over-reliance on AI as well, where it could potentially lead to a loss of core coding skills. I think in the world of programming, and from my own experience, there's a lot of times two camps with how to learn to code. There's one camp that's like, hey, you need to suffer through school hard knocks of memory allocation issues and bit twiddling and really get to these low-level compiler errors in order to truly understand what's going on. Then there's other people who are like, hey you could skip all that stuff. Let's have a gentler entrance into this and start with something that doesn't require compilation, like JavaScript or Python, maybe a little bit more digestible in terms of reading, lower the barrier to entry, get people excited about it. Then maybe they can find their own way to bit twiddling and dealing with memory allocation errors.

If we extract that any way, even further, where we don't even need to do any of those steps, or we become over a line where we forget how to do some of these things, do you see concern in terms of those issues?

[0:31:09] JB: Yeah. I think, it's a very parallel argument to stack overflow. Copying and pasting from stack overflow, like we've all done it. It's fine. In the end, you are responsible for the code. Stack overflow can't do your job for you. In the same way, large language models can't do your job for you. You cannot get out of being responsible for the things that you put into production. Therefore, you cannot get out of having to at least understand what this function is doing, having to understand the side effects. I do also think that there are some, maybe some traps in these models. We've already talked about the hallucination problem. There are a couple of other problems. Our own evaluation of our own AI system, we found the incident will come up with good code most of the time. It can throw up issues like, outdated frameworks, or libraries, or outdated usage of APIs, or just general poor code. These are side effects of the fact that these models have a training base, or a training data set, which is locked at a particular period of time. It went until 2021. Maybe your framework came out after that.

I actually think this is – it really speaks to why these need to be a collaborative endeavor. You have yourself as a sanity check, but you also have team members who are going to be doing PRs on your code. You have your compiler, or your interpreter that's going to be doing that check. You always need your critical thinking. You need to be going, okay, it's produced this, but I'm pretty sure we're up to framework 12 or whatever, whatever your version is. Maybe I should just go back to the documentation and check that, because there could also be vulnerabilities, things like that. It's a collaboration in the same way that you wouldn't just take something from Stack Overflow uncritically and just put that in your code base.

[0:33:22] SF: Yeah. Ideally, you're not just copy and pasting from Stack Overflow and then committing immediately and pushing the connection. What you were saying on people have a responsibility at the end of the day of what they're maybe copy and pasting, or pulling from another source. I mean, I guess, it's the same as if you were using a GPS, and your GPS was steering you into a river, because it made a mistake, but you're still responsible for using your human intelligence to realize, “Okay, there's no road here and I should probably shouldn't drive into this water.”

[0:33:51] JB: Yes, exactly. Exactly. Like I said, you're not getting paid to code. The LLM can't replace you. You're getting paid to solve a problem. You also maybe just shouldn't accept the first code that the LLM outputs. Maybe you have a senior, or maybe you've seen it done a different way, which you think is more maintainable. In that case, refactor it. It's fine. It's just a starting point to maybe help you get over a little bit of a bump and not have that blank slate problem, where blank page problem.

[0:34:27] SF: I would think, too, context is a big part of it, too. If you're dealing with a you're working on a big code base, maybe you even within your own company have your own conventions and you're going to have your own knowledge that's probably not part of the model that you need to use to inform, or refine what you're actually generating.

I wanted to talk a little bit about what JetBrains is doing in the space. Can you talk a little bit about what JetBrains AI Assistant, how does it perhaps differ from the other AI coding assistants that are on the market?

[0:34:55] JB: Yeah. What I want to talk about is why we decided to do an AI assistant. We've been doing developer productivity tools for 20 years. We sat back and watched what was happening in this space and carefully thought about whether it would be a net benefit to the tooling. After seeing some of the evidence coming out, some of the papers coming out, trying it for ourselves, we decided that, yeah, we did feel that this could be an increase to developer productivity.

We really wanted to make it useful though, not something that takes developers out of their workflow and weave it in as much as we can to the workflow as much as possible. Something that's quite advantageous for us, I guess, is because we basically control the entire environment. What this means is obviously, when we're indexing the project, we have access to the full context. We have access to the structure of the project file. We have access to all the libraries you're using. We obviously know what programming language you're using. We also record some behavior, although I should say this is opt out, if you don't want to do this, you can opt out. Then obviously, we know which file you're using.

This means we can actually collect all of this information and pass it as part of the context to the large language models that we're using as part of our AI assistant. We'll come back to that later, I think, about when we get into details about specifically how we're doing things. This is, we realized quite unique opportunity that we have, which helps us really target the actions and get the best and most relevant output.

Then the other thing is, we wanted to be flexible about the models we're using, because things are moving very fast. I could not have really predicted we would be where we are a year ago. We also have a variety of models that we're using and what we're calling our AI service. Different actions are connected to different models. We're constantly doing research to try and get the best out of the models we're using, check if we could use a different model that might do things better. Yeah, it's a definitely an ongoing project at this stage. I think an interesting direction for us to go to and something we really see is quite promising.

[0:37:25] SF: One of the things you mentioned at the beginning there was, I'm trying to make this sure this feels like, not just something bolted on, it feels like natural. What are the some of the things that you did? What are the design decisions, I guess, that you made to incorporate it in a way that feels natural to a engineer's workflow?

[0:37:43] JB: Yeah. This was something we wanted to be super careful about, because the hype is real. We didn't want users to feel, I don't know, we were selling out and just jumping on a bandwagon. We've done our research. What we've found is that people find large language models most helpful when they can take that routine, or boilerplate tasks and make them be easier or faster. I talked about that a little bit earlier.

I talked about prototyping, or refactoring, something else that we found people really find a bit of a barrier when they're doing development, and they don't like doing that much is doing things like, creating documentation, or communicating clearly with their code. It can feel a little bit like an afterthought. We've tried to integrate that as part of the AI assistant workflow. We've got context actions for writing documentation with some of the languages, or doing variable name suggestions, or just commit message suggestions.

The way that we've integrated it is we've really tried to make it just another context action. When you're using the tool, you have your favorite context actions. This is really just another one. We do have an AI chat to the side, but we've tried as much as possible to get things to be in line and as part of the normal workflow.

[0:39:11] JB: Yeah I feel where IDEs can be really successful and really powerful is when it's basically, the hands never really need to leave the keyboard. Because that's where you're living when you're coding. If you have to go and click around to an external tool, or something like that, it just breaks the whole workflow. It almost looks like, someone, I don't know, play a musical instrument if you have somebody who's really, really tuned with an IDE and they can navigate the whole thing and speed through it. I think that makes a lot of sense ,if you can make it part of that workflow and someone gets used to it, they can really increase their productivity 10X.

[0:39:46] JB: Yes, absolutely. It was actually really interesting. We've released this as a limited preview. Some of the feedback we've been getting so far is, “Oh, I used to go and use ChatGPT and then copy and paste from the other browser.” Even just having the actions inside the IDE is obviously saving you a lot of time. On top of that, we've really tried to integrate it as much as we can within the tooling that people are already using.

[0:40:18] SF: Then you mentioned the right documentation directly. I think that's a really exciting thing. That's actually one of the things I've talked about since this whole explosion is like, hey, if you could take documentation off people's plate, then people would love you for it. Because it's really important, but people hate doing it and it can be painful to get people to do it. How do you go beyond just really basic documentation, where it's just re-explaining what's already there in code.

[0:40:44] JB: Yeah. Trying it out. I've mostly tried it in Python, because this is the language I am programming, because obviously, I'm a data scientist. What I found is, first of all, just give you my impressions of how it comes out and then I'll explain a bit of how we do it under the hood. Basically, what I found is the doc strings tend to be really quite rich. They're very much in the format you would expect a Python doc string to be. You're getting your definitions of your variables and your outputs. You're getting sometimes explanations of how to use it and they're accurate. It's actually super rich. I was really surprised, because it's better than the doc strings I normally write myself. That was really cool.

In terms of how it works is like I told you a whole bunch of information is collected as part of building the prompt. You'll have information, obviously, about the function that you want to write the documentation for. You're going to have information about the entire file. You're going to have every single file that is relevant to that function. You're going to have the libraries that are installed. You're going to have the project structure. You're going to have all this information that's included.

Then what happens is within the IDE, we have a bunch of, we did a traditional machine learning models, and they used to actually classify which parts of the context is most relevant to that function that you want to write documentation for. Then what we do is we use all of that to create relevant information for the prompt. Then we basically add to that instructions to create documentation in the style of this relevant programming language for this specific function with these specific parameters. We then call on our JetBrains AI service. We have a security and anti-fraud check, just to make sure that there's no malicious code being passed.

Then at the moment, we're using third party models. We call the API of this model, passing our prompt, retrieve the output, and then pass that in-line into the script. You can imagine, I'm really impressed actually at the upside of things. This is all happening in a number of seconds. I hadn't realized until recently how much was happening under the hood with the prompt building. Yeah, it's this whole process to even get the prompt. You have models determining the most appropriate context. It's really fascinating.

[0:43:20] SF: Well, how do you benchmark some of this stuff? You mentioned that you're trying to be model agnostic, because so many things are moving into space where you want to be able to potentially plug and play a different model of a different – if a better model comes along. How do you tell that you're moving in the right direction with something that is as broad as essentially, even from an experimentation standpoint of being able to recreate the same experiment is difficult, because when you give a prompt, you're not necessarily going to get exactly the same response. How do you know that what you're doing is actually better than what you did in the past?

[0:43:52] JB: Yeah. This is a really great question. It's something that we've been working on internally for a while, even with the ML for code completion, like work that we've been doing for years. Actually, what a lot of people don't realize is when say, you have your context actions, or your suggestions for particular methods, you have potentially different rankings depending on the context of the file that you're in. That's all powered by machine learning, but it's more traditional machine learning. It's something we've been thinking about for a while, because obviously, they're all non-deterministic models.

We haven't done, I would say, super extensive research in terms of, I would say, quality benchmarking in that area yet. I would say, the perceived quality. It's mostly been survey based, based on the people who are participating in our limited preview. What we have found from them, we got about a thousand responses. The overall perception of the code quality was really good. About 73% of the participants were rating at least a four out of five, or a five out of five. That was consistent across languages.

At least from that perspective, that's pretty solid. At the moment, we're really more focused on trying to refine the outputs based on feedback that we're getting from the beta. For example, people were commenting that our generated commit messages were too long. We're trying to get them shorter and test specific hypotheses. Can we do this with this prompt? We obviously want to move more into benchmarking other qualities down the track as this becomes more of an established tool.

[0:45:36] SF: Yeah, I would think that one type of metric that you'd want, that's particularly the spaces, like developer productivity does basically, what is the baseline model? Then if we introduce this new thing, does it move developer productivity in the right direction?

[0:45:50] JB: Yeah. I've been looking around at other studies in this space. Obviously, GitHub had been doing a bunch of their own. One that I saw that they released on archive found that people could complete a task about 55% faster using Co-Pilot versus not. Then there was this fantastic little study, it really made me laugh. Do you remember that ChatGPT was suddenly banned in Italy?

[0:46:15] SF: Oh, yes. I know about it.

[0:46:18] JB: Yeah. They did a research and they found that the number of software releases went down 50% in the two days following this ban. Then it increased, but correspondingly, the use of censorship bypassing tools also increased. Obviously, they couldn't directly connect everything. We down the track really want to have a look at this stuff as well, because we believe in it. We believe that it's something that will help developers. Obviously, it's not. It's not something that we want to invest in. I'm quite sure we'll be able to replicate those results.

[0:46:57] SF: Yeah. I mean, I think that there's strong enough signals that you're seeing in market right now. I mean, the fact that someone's willing to go to ChatGPT, put in a prompt, copy and paste it and bring it back over to their IDE, shows that there's some value there, because that's just adding a bunch of friction to the whole process, versus just spending time on keyboard and generating it yourself.

Are there specific industries, or use cases where you think that AI coding assistants are particularly impactful, or maybe even are not the right fit for those particular verticals, or use cases?

[0:47:29] JB: Yeah. Obviously, a big concern right now is privacy. What we've found is, well, I think everyone's found this. Large language model ops is hard. It's really hard. A lot of the value that OpenAI and other large providers is giving is they have best-in-class models. A lot of those exist as open-source models now. They have really managed to solve this problem of efficient inference. We can pass in a prompt. In a matter of milliseconds, you get back an output. These models are so huge. It's just so impressive that they can do that. It does mean that pretty much everyone, I don't really think I've seen anyone who's not is bound to these large third-party providers, or maybe they're doing their own in-house stuff.

If you have strong concerns about privacy, and I would say, particularly in fields like health, or medical, or financial, anywhere where there is very sensitive data, you need to be careful about the use of this. We're looking for our release to do on-premises models. It'll be an evolving thing as models in this space evolve. For now, obviously, we all need to exercise some caution about what's passed. If there are really sensitive pieces of information you're working with, you need to be careful about this.

[0:49:00] SF: Yeah, absolutely. I mean, this goes back to what we were saying earlier around, like these models are designed to learn, not unlearn. In my day job, this is a space that I work in heavily, like regulated industry, especially healthcare. We deal with a lot of people in the LOM space trying to innovate here. How do I do this with doctor's notes and other super sensitive data? It really comes down to, there aren't essentially efficient ways of deleting information for models today that's cost effective. You have to put a lot of time and thought into how do I build these training pipelines in such a way that I'm not accidentally sharing information? How do I control, essentially, who sees what, when, where, and what format, and so forth? There's some really hard problems to navigate and solve still.

[0:49:41] JB: Yeah. It's something we're really cognizant of. We have an agreement with our providers at the moment that they're not going to use our data for training. Anything that's passed won't be used for that. Obviously, for a lot of companies, they do want to err on the side of caution. Obviously, having come from a health and medical background, I totally applaud that. I think it's better to be safe.

What I'm hoping is, in the next six months or so, we can see ways where open-source models can be deployed and have efficient inference. I think that would be the ideal solution for everyone.

[0:50:19] SF: Well, as we start to wrap up, Jodie, is there anything else you'd like to share?

[0:50:23] JB: Yeah. What I would advise is, if you're interested in this project, keep an eye on our blog. We're constantly posting updates. We're just working on refining the features that we currently have. We're looking to implement highly requested features, but we're just really trying to do a good job with a small number of features. Yeah, just have a look out. We're hoping to release this in the coming months.

[0:50:48] SF: Awesome. Well, thanks so much for being here. I really enjoyed this conversation. I think we did, we covered a lot, going back to really giving, setting stage for the space and what's happening in it and then moving into the AI coding assistant world. I think that you're, in my opinion, taking the right approach to this, where it's, let's see how we can innovate in this space, or leverage these models, but do it in a way that's natural to developers. Not gimmicky. Let's not force this on them. Let's make it part of their workflow, so that we're constantly enhancing the productivity, which feels like the place that JetBrains has founded their company around for the last 20 years have been really working in the developer productivity space.

[0:51:31] JB: I really hope that when people try it out, they can see that intent, because it's how I feel about it as well, that it's, yeah, the create documentation thing. I also went crazy with this feature. It's so cool. Yeah, hopefully, you'll be able to see that potential as well.

[0:51:46] SF: Awesome. Well, thank you so much, and cheers.

[0:51:49] JB: Yeah, thank you so much.

[END]