EPISODE 1687

[INTRO]

[0:00:00] ANNOUNCER: LLMs have become one of the most important technologies to emerge in recent years. Many of the most prominent LLM tools are closed-source, which has led to a great interest in developing open-source tools. Antonio Velasco Fernández is a data scientist and Jose Pablo Cabeza García is a lead data engineer both at Elastiacloud. This episode recorded in 2023, they joined the podcast to talk about LLMs and the importance of community development for LLMs.

This episode of software engineering daily is hosted by Jordi Mon Companys. Check the show notes for more information on Jordi's work and where to find him.

[EPISODE]

[0:00:50] JMC: Hi, Pablo, Antonio. Welcome to Software Engineering Daily.

[0:00:54] AVF: Hello. How are you doing?

[0:00:55] JPC: Hello.

[0:00:55] JMC: So, we are here with Pablo and Antonio because they gave a talk just a couple of weeks ago, a bit more maybe in Open Source Summit, Bilbao, Europe 2023, that took place in Bilbao, Spain. And they talked about LLMs, the ecosystem around LLMs, and how, in particular, open-source communities can expand and make the capabilities of such powerful technology more open and potentially better. What are your concerns about the current LLM landscape? You mentioned a few things about certain, maybe the possibility of the monopolistic behavior by some players. Could us describe what the ecosystem looks like right now and what your concerns were about?

[0:01:42] JPC: I reckon this moment is a key point on the story of development of LLMs. That's because OpenAI has made very successful marketing stunt releasing GPT-3, and that's taking sort of the lead in the scene, in the industry right now. However, these developments has a lot, thanks to the open source community. There is a risk now that these companies on the lead risk of monopoly or even oligopoly with the most powerful developers, as they will claim that these tools are dangerous, and therefore they need to be regulated. The problem is they could be dangerous. That's fair. But the issue is, the regulation that they are advocating for is a revelation where they hold the power, or they hold the already developed tool, and they place barrier for new entries or for the open source community to actually be informed about how they work. I think that's the risk. The risk is where we, as the public, lose control of what these tools do, instead of a risk coming from, well, anybody could create one, which I reckon is a good thing and it's what is at risk for the business.

[0:02:58] JMC: Okay. So, first of all, were you not impressed by OpenAI? Because you said it was a marketing stunt in a way. I get it, of course. They probably do marketing really well. But for the layman, from trying yourself, who was not like following AI events closely. I was deeply impressed by LLMs and OpenAI in particular. But were you specialists in LLM? So, people that have been following AI technology throughout the years, were you not so impressed than you thought? Do you think that this is more of a marketing campaign rather than, "Oh, were you actually impressed by the release of GPT-3?" 

[0:03:37] AVF: I totally was impressed, to be honest. It's incredible. It was a huge leap. What I mean by that is the fact that they decided to go with for a public release, and make it as broad as possible to the uncommon or to the general public. That was genius in the sense of making it, going from the specialist to whole world, right? But if you go to what the technology they use, it's based on encoders, decoders, so it's decoder only at this point, that comes from Attention Is All You Need. That was like the milestone paper that created of all these, and that's a few years old. When was it, Pablo? It was 2017?

[0:04:19] JPC: 2017, yes. 

[0:04:21] JMC: Before we move on to the description of the technology, I know none of us is a lawyer or probably a legal expert, but what is it about the regulation that they are proposing if it gets applied, proved, and then applied would actually limit the ability for the open source technology to influence LLM technology development or just actually contribute to it? What is it specifically about what OpenAI and other closed source companies are proposing that would sort of like hinder the open source approach to it?

[0:04:54] AVF: Yes, so they have in the past promoted conversations to have the - this was in the US, but to have the Congress of the United States to limit who can develop this tool. Basically, limit the actors to some pre-approved factors, which will be them on these big companies. So, they stop new, like startups, small companies, or even like their business community to participating in these activities. So yes, it raises a lot the barriers of entry to this kind of tools, like developing new tools and investigation, if you actually need to go through all the legal hoops that they want to set up there.

[0:05:35] JMC: So, the foundations of this technology, mostly, as you said, open source, open science, right? Because you mentioned the paper Attention Is All You Need from 2017. Could you summarize the findings of that paper? And then if you wish, to be so kind, describe the encoder-decoder system, maybe move on to transformers. But at a high level, what are those three things?

[0:05:56] JPC: So, at the beginning, what the state-of-the-art was doing was an encoder. The way it works is to translate each piece of natural language to slice that into small pieces that you call tokens, and you assign certain numeric value to each one of them. That way, you are starting to create a way vectors or numerical representations in which an algorithm can manage these words in a way to understand them. So, the attention is all you need. The paper brought this new way of doing this, which was the encoder-decoder split. I think, it was not only about translating a one to one token from one side to another, for example, when you were doing a translation from one language to the other. You would translate one token to the equivalent in the other language.

In this case, what you will be doing is to take into account the whole context of the phrase, paragraph, or even text, which meant that the quality of the answer is much better now, as it will stop misunderstanding some words just as they are, but rather just running them in the context and then there is no better comparison.

[0:07:10] JMC: So, just for the record, point in time, I was not aware of that paper, but I used to work in the language technology industry, and specifically in machine translation. I remember vividly that the encoder-decoder system was actually really popular for machine translation. I would never have imagined that in a sense, it would be in the future down the line would be enabling just general LLM, right? Because LLMs can be used for so many things other than machine translation. It's so funny.

[0:07:40] JPC: Yes. I think the ecosystem has evolved a lot since then. So, there are some trivia - if you look at the different systems, you have, something I was saying, the encoder-decoder, decoder only, encoder only. It has evolved, seeing [inaudible 0:07:54]. There was bad, and then there is a - Google did a lot of the work on that branch. But I think, the genius from OpenAI was the approach they took with GPT. You have GPT-1, GPT-2, until GPT-3. So, it's that approach of being decode only being much leaner. I mean, Antonio is here, the data scientist. He can explain that better than me. But they simplified a lot of the approach made the approach much leaner, with it being the decoder only, which allowed them to do these clips.

[0:08:29] JMC: Can you explain the difference between encoder only, maybe the three types encoder-decoder, encoder only, and the decoder only would seems to be the approach.

[0:08:40] AVF: Encoder only uses - so, the encoder and decoder are neural networks, basically. So, the encoder is the one that is assigning numerical value to each of the tokens. Encoder-decoder was designed in the [inaudible 0:08:53]. You have sewn neural networks back to back. One is creating the numerical translations to tokens, the actors. And the other one is interpreting this. The decoder only is basically the same concept, but instead of using a neural network to create the encodings, what it's using is the hidden state of that decoder as the vector. The hidden state is inside the neural network, the current layer, where hidden numerical values, neurons of it. Using that as a new encoding makes it better in the sense of what is flexible instead of something fixated by the output about -

[0:09:31] JMC: What about transformers? Because GPT stands - the T stands for transformers, right?

[0:09:36] AVF: This is transformers. So, transformers is what came from this paper, is this technique of using a back-to-back neural network.

[0:09:44] JMC: So, you also discussed the - you talked about agents and expert systems and how these might be limitations of LLMs right now. Could you elaborate on that? Did I get that correctly? 

[0:09:55] JPC: Yes, sure. As you said, in the end an LLM a is an autocomplete system, right? That's what we always say. So, it basically, given the training it has, it will give you the answer with the most probability. It does that very well. So well that we actually cannot - like it actually confuses us and we think it's a person answering LLM model. But it's only that. So, once it is strained, that box is fixed, right? You would need to either fine-tune it, that is adding like layers on top of it, or do you need all their ways of working with that black box that is now your train model, right?

If you look at GPT, OpenAI and their models, you can see that they have snapshots. So, if you will to GPT-3.5, you will see that they have the ID after the model. If you, for example, go to the playground, click on the drop down, you will see that you find a snapshot. So, those are the different trainings that they have for that model. To add information to it, there are different approaches. The open-source community has taken the LLM, some using papers that were written to work with LLMs before GPT-3. A few of these were before GPT-3, [inaudible 0:11:11] is that you take that black box that is an LLM and you find a way to work with it to be able to persist information. So, there is the concept of memory. Because in the LLM, there is black box that takes input and output. It doesn't remember anything.

So, if you want to keep the LLM remembering, you need to give information back every time, which in the end, you have what is called the context window. So, you end up going over the context. There are a lot of strategies. One of those is agents. They basically have agents that there has been several papers like the React and the ML/KL systems. So, basically, they are systems where you have the LLM being a reasoning machine. The LLM is the one that is taking actions, making decisions. But actually, you have this other system on the side that is called the experts or tools, depending on which framework you're using.

But basically, this whole system is one that can actually give you extra information and remember information for you, and also interact with the world, right? Because at the moment, LLMs don't have that capability. So, if you need to interact with any XML system, like let's say - I mean, the usual example that you see in the documentation of length and time, for example. What's the current population of London, right? So, the LLM will make an assumption, but it doesn't really know the answer for that.

So, if you put an agent on top of it, what you can do with the agent is you will ask the LLM. How could I answer this information? The LLM will tell you, "Oh, you need to do a Google search for this." And then you have a tool that is actually the search tool that will go to this Google search and give back the results to the LLM. So, the LLM now has that information and can make extra assertions. This is just a simple example.

[0:13:10] JMC: Yes. Let's elaborate on this, because my knowledge of LLMs is very limited in general, as you can see. If I ask in an LLM, OpenAPI's 3.5 or whatever, any GPT model, what's the population of London? Basically, what it's trying to do is guess the next word, right? So, it will start with the first word or the first sentence, let's say that the LLM knows that the answer should start by the population of London is, right? But each one of those words is a guess, provided the question, the prompt, the context window, I guess, although there was none there. I literally started the conversation with one question. What is the population of London? Then, this thing will be guessing the next, the probabilities of the next word. The population of London.

So then, when it comes to giving the figure, how much can we trust that figure? Is the LLM knowing that the training data that it's been trained on will be providing is probably having enough data points of the real population of London so that the LLM can say, "Okay, among my training data, I've got eight million as the most probable data point. So, I should tell this person that the most probable population number of London is eight million." Is that how it works?

[0:14:31] AVF: This in a sense, so there are two things play here. First thing is this model is so massive, that with [inaudible 0:14:40] is not with a bias population of data. They are trained with all the data available in the way literally. So of course, when you ask something so massive, or so common knowledge as the population of London, you can sort of trust the answer without providing any insight into that for any context, as the model has probably been training sensitive data or the Wikipedia, et cetera, et cetera. Caveat is maybe only the data in what's training up to 2021. So, the data is going to be either we don't, but probably it's going to have the exact answer in its mind, since it was trained on that.

As you said, it's just a stochastic model. It's going to be trained to guess the next most potentially correct word, and the most correct word in this case is going to be London population, the actual one, quite probably. This can lead to issues as it can obviously hallucinate and create information that is not true, just because it's maybe misunderstanding city for another one, or if we get into more complex fields, it can be starting to make up sentences, which are the most probable answer, but they are just built from all the similar contexts, not that one. The way the AI agents work is your LLM is no longer having to come up with ideas as in a charlatan. And an examiner would know I'm talking about. But rather, is at the helm thinking, what do I need to do? I need to Google for it. I Google for it. I get from Google the actual number and I include that as a context and then I answer. I'm answering now having the actual context. So, my answer is now 100%. Well, as much as the search engine is going to contain the core event.

[0:16:28] JMC: Yes, exactly. You then trust Google instead of stochastic parrot. I came across a tweet the other day that said that everything that an LLM produces is a hallucination. Everything is just a solid - hallucinations are truths, are right, but all of them are hallucinations. I thought that was kind of - it's kind of true, because they are not self-conscious. But obviously, for questions that are common knowledge like these, they tend to be correct, or right, because their dataset is based on data that is well-known.

[0:17:00] AVF: Yes. So, the other issue with LLMs and answering this question is about tying all that to the surface, right? So, if you like to ask the population of London and then, "Oh, where did you find that number?" It won't be able to answer that. Because in the end, everything is together in the model, in the soup, that is a model, so it won't be able to trace back to any sources or anything like that. I mean, I have seen a lot of people trying to use it for research and things like that. But it's pretty hard to trust that source. So, that's another issue with LLMs.

[0:17:29] JMC: Thanks. The example that you've given about agents, I love it, right? It's sort of like ground, the potential hallucinations of an LLM. In the case of the population of London, it would probably go out, fetch the information and provide it as context. But I can imagine very easily, agents having too much latency, right? Being able to write prompts for an LLM, just keep going on and going on. So, has anyone explored the scenario in which they run wild because they've got this, as you said, this agency and this ability to run the task in the real world, right? The combination of LLMs and so forth might run wild. What are the chances of that happening?

[0:18:12] AVF: Are you talking about the singularity where it's starting to -

[0:18:15] JMC: Is that the singularity? Did I just capture that? Okay. No, but really, you've mentioned the example of LangChain. Can it run a bit wild? I'm not talking about like life-threatening scenarios.

[0:18:25] AVF: I wish it would go somewhere that would just create that Skynet or something like that. I think we are far from that. The first thing is, yes, it can be a risk issue since some agents capable of actually writing prompts for a console. A basic Python agent, for example, is where an LLM will be prompted with a task or a problem, and the LLM would write a piece of code that would solve that problem, and then it did have a tool that allows to execute that piece of code into your machine. Then from the output of that, it will reflect and say, "Oh, well, I have this error, so maybe I need to install this library."

Well, two things here. Yes, it's a risky issue. It's running code on your machine. It can delete files. It can style malware or anything that could potentially be an issue for generating your data, et cetera, et cetera. But it's not able to do anything that is out of their knowledge and scope, in the sense of all the code that LLMs, GPT, et cetera are always providing to the users, is code that has been created by humans before, because it needs to have been trained on assumption. Maybe it's a combination of a lot of really small problems that humans solved before, and then it's just able to put them together in a clever way. But that's the scope of what it can do. Any developer that uses Copilot or GPT-4 to help in this day-to-day coding as we do, will tell you that as soon as you get to a complex problem, or in the edge of what is already been done, or a new technology, it fails to help you, or it just hallucinates, and give you a piece of code that says, "Yes, this will do." No, it doesn't. It doesn't because it's something that is not common knowledge or not very easily find in the Internet. So yes, it's unable to start calling itself until it gets me - well, but yes, risk issues, which are not as exciting as the science fiction we would like to have, that they are there are there. Of course, they have purpose.

[0:20:38] JMC: It can happen, right? Down the line, let's see.

[0:20:41] AVF: The first thing I felt was, well, why doesn't anybody in OpenAI just brought a brand that is improve yourself?

[0:20:49] JMC: Exactly, exactly. The next GPT-5.0 or whatever.

[0:20:54] JPC: Something that I have seen theorized that is going to happen in the near future, and it's going to be an issue for LLMs and for us that are going to try to use our LLMs, is that as far as more content is being generated by LLMs, they will try to use that content to train themselves. Then, you start in this feedback loop of basically everything becoming very homogeneous, right? Because the same output that they generate, they're using it to train themselves. So, I think we will probably see that being an issue or we will need to deal with that in the future where we need to actually filter out to increase the quality of LLMs, because more and more percentage of the content of the world is going to be generated by LLMs.

[0:21:37] JMC: Yes.

[0:21:37] AVF: That Internet was that curl, right?

[0:21:40] JPC: It had something like that.

[0:21:42] JMC: So, we've already represented a bit the closed source. Let's call them this the closed source camps position. They're concerned about potential issues. They proposed regulation. The way in which that regulation is at the minute presented, impede, might add friction layers that you guys think that would stop innovation coming from open source. Can you represent the open-source community's take on this? I presume the open source community wants it regulated but once AI or LLMs, in general, to be regulated in a different way. Could you describe that? And also, how do you see personally or the open source community see the collaboration between these two camps, right? The open source LLM camp and the closed-source or commercial camps.

[0:22:29] JPC: Yes. So, I think in terms of the open-source community, you need to think that there are multiple approaches to it. If we think about the models themselves, there are companies, there are private companies that are doing work with opens source. So, I can think about Mosaic is one that we are closer. So, Mosaic is, it was bought by Databricks a while ago. Basically, they have this hybrid between open source model with commercial license. So, you can I actually use their model commercially, because one of the issues that are with the open source community right now is that models are either open source or commercial, right? So, you cannot do this journey, starting from an open source model and then building something with it, and deciding, "Oh, actually I want to make money with this open source projects, trying to make money with it." Because that's most of them others like Llama, Llama 2 have this limitation.

So, there are companies that are trying to that. I think, that they are able to do that, because they have this flexibility of being able to train their models like MPT models are the ones from Mosaic, but there are others, to be able to train their model, fine-tune their model, and work with the LLMs themselves. I think that's one of the parts working with AI. I think there is another that this all the tools and frameworks that are on top of it, like LangChain, LlamaIndex, all these. These are frameworks and tools that are built on top of LLMs.

Of course, I think the more democratized LLMs are, the more these tools will thrive, because the problem is not if you build a tool that only works with ChatGPT or OpenAI LLMs, then you are tied to that. So, there is an aid of building like an open ecosystem of LLMs. So, these tools can actually - so you are not tied to a single LLM. You can actually take your agent, for example, that was built on top of one of OpenAI, and move it to, I don't know, MPT or File Kong or any of all that LLMs in the server.

There are more parts here on the legal side, that is allowed about legislating on the data itself. So, there is a lot of issues right now with the copyright of the data, the training data, access to it. We are seeing more and more companies closing their data, so data that was open to the public, like Reddit who has one of those. So, Reddit [inaudible 0:24:55] because it was being basically used by a lot of LLMs. So, they now have their full control over the data and who can use it for training. We are seeing that everywhere. So, we are seeing more and more silos, as a reaction of this.

So, I think all these are issues that are happening with LLMs and open-source community that are making it more close. It makes harder for open source communities to participate.

[0:25:21] JMC: Yes. I agree. In fact, this is a really interesting problem. More for lawyers. I've got an interview with Van Lindberg was an expert on this. But it's a really interesting problem, because by definition, and you just explained it before, LLMs create everything new, right? They are guessing the next word. But it's true that they can reproduce - if the training data of a specific use case, of a specific technology, of a specific implementation of anything. If the training data of that is very narrow, let's say, that there's just one project out there that has this specific implementation of C++, wherever, it's very likely that the LLM, when asked about that, is going to tell you exactly that, right? That it's going to reproduce in a way.

So, that on the one hand. Then you've got Microsoft, and by extension, GitHub, that belongs to them, that are saying, "We will pay the legal costs of anyone challenging your code or the code that GitHub or Copilot has generated, that they claim is a reproduction of their code." In case it's licensed restrictively. So, they are assuming that this will happen. If they say, "We will protect you with our money and so forth." Then, the you've got another camp of LLMs that are saying, "No, we are only training our LLMs with permissive license." So, that code, that training dataset, that doesn't care about their code being reproduced, because it's permissively licensed. Which is a bit of a shame.

I can see Stack Overflow also reacting to this because they've got, in the software engineering world, they arguably have one of the best datasets of question and answers pair. So, it is a shame that things are getting closed. But I can also see, again, Stack Overflow or the users of Stack Overflow, the community of Stack Overflow, Reddit whatnot, another saying, "Hey, we've put a lot of value into this a lot of time and I would not like this to be, not to take a piece of the cake from you. I don't know if you have thoughts on that, too. Antonio or Pablo?

[0:27:23] JPC: Yes. That's basically the now in the mind of anyone trying to do any project. If anyone wants to build a data set out there, the moment you make it public, it's now fodder for training LLMs, right? So, you are now having the issue of, "Okay, how do I control who can use this data?" As I mentioned, if you ask ChatGPT where did you get this figure, it won't be able to tell you, because that's now lost. That's the issue that even if you want others to be able to use the information, the problem is that they won't be able to trace that back to you.

[0:27:57] JMC: By the way, AWS's CodeWhisperer claims that they can connect the - I can't remember the name of feature they call it. But CodeWhisperer is their Copilot, their Cody, whatever their code assistance for software engineering. They claim that they can know - the way this topic came from and the type of license that that snippet of the training dataset is licensed under. So, it's a bit weird, but they claim they can do it.

[0:28:23] AVF: That's a good thing and I think that ties to what we are discussing about what the issue here is. If they do that, they probably have some strategy behind it. Right? So, we have trained with only this dataset for this precise language. So therefore, when you are getting a suggestion for that language, or for that problem, we know where it comes from. I'm talking my mind now. Maybe it's not that strategy, but something similar. The thing is, when we talk about revelations is those are the kinds of things that need to be open or regulated under an open license, so that we as the public, know what we are dealing with, because it's not the same as a promise that you need to trust, as a promise that you can go and actually test or reproduce.

The issue is not about whether this is led for you to defend your value creation or your code. Of course, that's your time and that's your - you should be able. You should be protected against that. The issue is the protection going to mean that only a few companies are going to be able to keep making profit and claim that the protecting you from the castle or glass castle? Or are those revelations going to ensure that the general public is able to already see the process and the development of this very powerful tools. We have a public understanding of what they do and how they protect our rights. Because I think, the main issue here is not whether they are regulated or not, but whether the fees regulation is going to be some common knowledge and they are going to belong to the general open source developers, or whether these LLMs are going to end up being some silo hidden in a bunker tool that a few very powerful entities can develop and manage.

I myself seeing that the second scenario, it's very much, it's a huge risk of accountability. The first scenario they claim is at risk of security in the sense of well, anybody could make up information or even go and create harmful things for society. I think that's more like trying to scare the general public into, "Oh, wow. Skynet or any evil robot is going to come and harm me and my family", when the actual real risk is who is in control, this powerful tool, is going to be in control of the next - of the wealth that is going to be created by it. Maybe the public is not even going to be able to held them accountable. So, that's the discussion and that's what was defended in this open source summit. The main message was, they need to regulate it, but they need it to be an open [inaudible 0:31:05], and all that sort of open regulations that, yes, that belongs to you. But I can see what works and yes.

[0:31:14] JMC: So, you've mentioned already a couple of specific libraries and platforms that the open-source community contributed to expand and extend the capabilities of LLMs. LangChain being one of them. I think you also - Pablo mentioned LlamaIndex. So, what did these things do? LangChain as an agent. You've already explained that unless you've missed anything or left anything behind. What is the LlamaIndex? Is it another type of library? Another type of platform? How does it expand LLMs? Are there more?

[0:31:44] JPC: I mean, there are loads more. But yes -

[0:31:48] AVF: Way too many.

[0:31:51] JMC: That's the beauty of open source, right? It's creative chaos. You don't know out of where a new project will crop up and just add new capabilities.

[0:32:01] JPC: Yes. So, LangChain does more of that. They are only agents, right? Agents is one of the tools, I think they started as kind of aggregator of all the new research development to be able to create a framework to do those implementations. Now, they are a bit more structural, so they have agents, but they have this. I mean, they're called LangChain because they build chains of various - to-chain prompts. As if you were having a conversation with LLM. But the [inaudible 0:32:28] is more programmatic. But they have lots of tools. In the end, it is giving you a framework, so you can do this, what I mentioned about implementing this agent or this tool, but is not tied to a single LLM, because you are not using it specifically.

So, LangCain has connections to different implementations of different LLMs. So, you can actually change the backend, let's say. LlamaIndex is one that we have used a lot, because it was one of the first that did an implementation of this approach of using a vector database to be able to build an index on top of an agent, right? To be able to store a lot of information there, and then be able to retrieve that information based on how similarities to what you are trying to ask. So, if you think about how an element works, an LLM built this numerical representation, which basically is a vector. You can actually use vector operations and see, are these two vectors close or are they far? So, you can actually find the pieces of text that are closer to one another, and they tend to represent similar meaning. So, in that sense, you can now use LLMs and their vector representation with these embeddings. You can use this embeddings to search for information.

So, let's say that you have, I don't know, a confluence, a space, or a wiki, where do you are storing all the documentation of your project. You can, let's say that you loaded my pages into LlamaIndex. LlamaIndex will build the embeddings for all those documents. Now, once you ask a question, let's say, I mean, what's the code for this error that I'm having, right? Or what's the solution for this error? It will go try to find the closest pages on your documentation and then use those to fit the LLM. Because as we said, the context of the LLM is quite as smaller than all the huge corpus of data that you have. So, that's where LlamaIndex and other similar tools come, that they allow you to do this retrieval of information. LlamaIndex is one of those. It's one where I have worked a lot, because it was one of the first. But there are a lot of other tools like say, [inaudible 0:34:44] that is quite big.

[0:34:47] AVF: We have [inaudible 0:34:46]. We have SkyPilot. We have Psychic LLM. There's a lot. We've forgotten a lot of them, but any of those tools and they cover a different area. They go a different issue, a different problem that eventually you will have to solve if you want to apply one of these powerful tools in the industry. The good thing about the open-source community is that you don't have to go and reinvent the wheel. It's already been done. It's very efficient. You are not going to be better than another programmer. Just leverage on that.

[0:35:19] JMC: So, it seems like some things come across really clearly. Context window is a very big limiting factor for LLMs to know state, to have state, right? You have memory and so forth. What about the prompting itself? I think you guys in the talk that you gave, insist on test-driven development for prompt engineering. Did I capture that correctly? What did you mean by that? What is it about prompting itself or prompt engineering? Or the art of prompting, that is limiting, and how do you suggest it could be less limiting?

[0:35:55] AVF: Yes. So, of course, the prompt is, at the end of the day, what instruction do you give to your LLM? It affects everything. But that's the thing. These high stochastic models, and as such, the answer is never going to be so excited is never going to be the same. You can play with different parameters and so on, try to get always the same answer. But there is a lot of things that are outside of our control. So, even development of prompts. This was - we first read it from the creators of Promptify. So, another open-source library, which advocates for that. The idea behind this is you could write a prompt, get that resolved, and then you're able bail in your answers. I'm thinking, well, this is good. Maybe if I change this word, then you check the answer again. Maybe it's a little bit better. But you are not being rigorous about that. You are sure that when you develop any code, you follow a test-driven development, because you want your solution to meet that standard quality. So, that's what you need to do with prompts as well.

You're introducing a stochastic piece on your solution. You need to account for it and make a test, prior integration test, not only a unit one that checks that it is actually answering and working, but an integration where the checks it's good. Why is that necessary? Well, the prompts are somewhat as the nature of literal language is something that can change. But there is also prompt if you have an API, like most of the models that come from OpenAI, they are being retrained behind the scenes, behind these API that's out of their control. So, you need to have something in place that accounts for that. 

Maybe something enters your system, like it changes the context, or it changes the fine-tuning that you might be doing. Then, that's going to affect the quality of LLM. Definitely, mess up your prompt. Then you have to have in place some check about that. The techniques are very simple. Are basically sequence matches or deck and similarity. So, a percentage of course, is present. But to have in place some sort of nice framework, which allows you to see how good are my answer? What's my quality? That should be what gives you feedback on how good your brand is being used. And that's the whole point, right? Your development of that prompt, your prompt engineering should be test-driven. If you want to know [inaudible 0:38:18] for that, I'm recommended [inaudible 0:38:20] and Promptify.

[0:38:22] JMC: I'll finish the conversation with we're very close to the end, with how to get involved in open source contributions. So, you can you can actually highlight those then. But yes, so what about context window? Do you reckon - I mean, this is a something that the user cannot work on. The LLMs come with it, with a certain context window? I believe so. So, is this true? Do you reckon that getting a wider, deeper context window will be something that the future technology will bring? Is this a very limiting factor that companies are working on? Or projects LLM projects are working on expanding? Or is it not that important?

[0:39:01] JPC: Yes, I mean, conducting those far more consumer models, you cannot change that, right? That's basically, you need to fine-tune your model, basically retraining it to be able to increase that window. We have made this for example, to take the - when they released the 8K model, and the 4K model that they have with GPT-3.5. We had an application where we were using the 4K model. We decided to increase the window because it will be better for us. I think it was like 16K. But the thing is that now the results were completely different from the 4K model. So, it's basically a different model now.

The issue is that because you need to increase a window that changes the weights that are used in the model, it needs to track more information, it needs to potentially track more information so it basically changes the way it works. So, it's not as simple as increasing on a slider and you now can put more information. Do a better ML to retrain. It's also a computational issue. These are the window, the more competition resources you need. I mean, we already have seen that there is a lack of GPUs and there is a lack of computer resources due to everyone trying to use things with LLMs. If you try to increase a window a lot, you are going to run out of GPUs at some point, I think.

[0:40:22] JMC: Okay, yes. That's true. It's funny that, I mean, the most obvious thing is what you just said at the end, like it will increase the demand for compute in the backend, and that I get, is a physical imitation. But I didn't expect that increasing the bandwidth of the context window from 4K to 16K, would actually provide different results, instead of just more accurate, or more precise -

[0:40:46] AVF: It depends in your resolution, right? So, in our case, we were looking for an exact piece of information, even in a text. If the text is [inaudible 0:40:56], you're just making it harder to find the needle in the haystack. So, in this case, this specific case, using LlamaIndex to pinpoint to the exact part of the text where they need the least, was easier for an LLM to use focus on that than, trying to leave the LLM to do the whole thing. But that's completely liable to your solution and what you need to do. You have the tools. Some people are going to find that better, to have more context for our summarization task, for example.

[0:41:28] JPC: But the other thing is, in the end, when an LLM is trying to give you an answer, it's going through the possibility tree, right? So, 4K tokens has much less possibilities and 60K tokens. You have a much broader spectrum of possibilities. For example, something that works very well, when working with an LLM, is this chain of thought process, where you actually tell the LLM before answering this question - so for example, let's say that if I called the task to an LLM, and I asked the LLM, how can I go with this code? Actually, if you ask the LLM, but before doing that, think about the steps that you would take to solve this action and then provide the code. This gives you much better result, because you are basically limiting the space to give you a cold answer, right? So, you are basically telling the LLM. Okay, first, if you think about the usual success strategies, instead of you can go deep on the tree, and you can go wide, right? What you are doing by going firstly, is that you are limiting the tree, and then you have much less possibilities. That's kind of similar to what is happening with this context window. By increasing the context, a lot, you are diluting the context, and you have much more possibilities.

[0:42:46] JMC: I see. I see. Okay, well, that was brilliant. I will link the talk in which you go through and over and deeper into these topics that we've just touched upon in this interview in the show notes. It's in the Linux Foundation's YouTube channel. Very easy to find, anyway. But yes, you mentioned a few libraries to test your prompt. But in general, how does any individual interested in getting more actively involved in open source LLMs, or the LLM community, get their hands dirty? What would you suggest? What projects? What communities? What technologies do you suggest one with this mindset, with this intention to do?

[0:43:30] AVF: I recommend start it with LangChain. It's not only an open-source library. It's more like a marketplace of ideas where people just go and add new things. From there, you can - you're probably just going to bump into any other libraries that you need.

[0:43:43] JMC: What about you, Pablo? Any other suggestions?

[0:43:46] JPC: Yes. I mean, if you go to our talk, one of the links that we have is called Awesome LLM. There is always an Awesome GitHub that is basically a list of all the resources, papers that are linked to it. There are a couple of those. So, we have a pile of those. There you have lots of resources to start. You have papers, you have libraries. There is a lot of things that you can start. I would say just something you said, LangChain or any other tool and just start playing with it, and then you will find the space that you think is most interesting to you.

[0:44:19] JMC: So, if anyone has questions about this, can they find you on Twitter, Mastodon, Bluesky, LinkedIn, anywhere?

[0:44:28] AVF: I think I'm more present in LinkedIn right now. But yes, any doubt, any question, I'd be more than happy to.

[0:44:33] JPC: Yes, same for me. I'm also in LinkedIn, so anyone can reach me out there.

[0:44:37] JMC: Okay. I'll link those profiles in the show notes too, in the summary of this conversation. Thanks so much. It was a really interesting, it's a fascinating topic, and this was just the beginning of it. Although, this technology or the theory behind it has been, as you said, I mean, the paper, the seminal paper came up in 2017. But again, this AI technology again, the theory of has been laid out for ages. But anyway, it seems that it is exploding now. So, it's a great time to get involved. Thank you both for being in Software Engineering Daily and I look forward to meeting you again.

[0:45:11] AVF: Jordi, thank you for having us.

[0:45:12] JPC: Thank you for having us.

[END]