EPISODE 1615 [INTRODUCTION] [0:00:00] ANNOUNCER: Vespa is a fully-featured search engine and vector database. And it has integrated ML model inference. The project open-sourced in 2017 and since then has grown to become a prominent platform for applying AI to big data sets at serving time.  Vespa began as a project to solve Yahoo's use cases in search, recommendation and ad serving. The company made headlines in October when they announced they're spinning VespaAI out of Yahoo as a separate company.  Jon Bratseth is the CEO at Vespa and he joins the show to talk about large language models, retrieval augmented generation, or RAG, vector database engineering and more.  This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.  [INTERVIEW] [0:01:00] SF: Jon, welcome to the show. [0:01:01] JB: Thank you.  [0:01:03] SF: Yeah. Thanks so much for being here. I'm excited to get into the details about augmenting LMs today. And actually with somebody who is a real expert in the area and actually has a – the CEO of an AI company. Necessarily some of the, I don't know, best way to say this, but the AI charlatans that are out there spieling things online. But before we get there, let's start with some basics. Who are you and what do you do?  [0:01:25] JB: Yeah. My name is Jon Bratseth. I'm the architect of the platform, open-source platform form we call vespa.ai. I was the architect of that for 20 years. And now, finally, we have spanned it out as a separate company and now the CEO of that company. [0:01:41] SF: Amazing. And then how did the company start?  [0:01:43] JB: You probably need to be above 40 to remember. But at one point, there were multiple search engines competing for dominance, right?  [0:01:51] SF: Yeah. Lycos, Alta Vista.  [0:01:53] JB: Exactly. Yes. We were one of those here in Trondheim, Norway. We started with – or I wasn't around at that time. But it started late 90s. We built a search engine called alltheweb.com. Eventually, in 2002, we were sold to Overture. The guys that invented sponsored search and got completely loaded from that invention. And then they were acquired by Yahoo. And Overture also acquired Alta Vista at the same time. We were sort of in the middle of merging those two search engines when we were acquired.  When we went to Yahoo, we were tasked with building a general platform for solving, initially, all their search use cases. We did that and took over those gradually. At that point, Yahoo was a significant fraction of the entire internet and about as chaotic. It was fun for a while.  They eventually took over all they did on search except web search, which they eventually outsourced to Bing. And then we started moving into doing recommendation, personalization, ad serving, all those kind of things. That if you squint a bit, they are kind of like search. Right?  We did. That was fun for a while as well. And then we started looking at doing similar things for the rest of the world as well. We started with open-sourcing the platform. We did that in 2017. And now, finally, we have spanned out as a separate company around this platform.  [0:03:16] SF: Amazing. Yeah. I think for people – depending on your age, and I'm probably dating myself here, it's like hard to convey how big Yahoo was at one point where it really was like a destination website. And they ate up a lot of those early search engines with acquisitions and so forth. And then at some point, they sort of transitioned into a mobile company.  And I think some of their challenges was around – they were really like a – sort of positioned themselves as a media company. Not necessarily like a pure sort of engineering company that saw themselves a little bit different than like the Googles of the world that came in and kind of started to dominate search after Yahoo really established itself. [0:03:52] JB: Yeah. All that is correct. They used to be really, really big. It's kind of hard to remember now. But –  [0:03:58] SF: Exactly. Yeah. In the moment, sometimes it seems like these giants, there's no way that they could ever sort of disappear from being the giants that they are. But it does happen to Yahoo. It's happened to a number of other companies.  [0:04:10] JB: Yeah. But never quite the way, right? Yahoo still have 800 million monthly users.  [0:04:17] SF: Yeah. It's like once they get a foothold, it's very hard to disrupt, especially when you have, I think, essentially like network effects for a company. And as they grow, they get more and more sort of lines of businesses, business. They don't go away necessarily completely. But they might not be the hottest company in the world to work for like they were at one point. [0:04:33] JB: Yeah. Exactly. [0:04:34] SF: What is VespaAI today with this spin-off?  [0:04:38] JB: Yeah. It's a platform for – sort of if you generalize from search, and things like online personalization and so on, you realize that all of these use cases have something in common. It's you want to look at lots of data and you want to apply AI on that data. And you want to do it online. Meaning low latency. Typically, less than 100 milliseconds. And typically, high throughput. Because there are actual users on the internet waiting for that response or some other system on the internet that is taking actions based on what you get back or something like that. It's a platform for all these use cases where you combine lots of data, AI and doing something online.  [0:05:18] SF: Okay. I get it. I want to talk about retrieval augmented generation or RAG modeling, which is I think a big topic right now. I'm sure if you search the Googles of the world for RAG, you're probably going to get 100,000 articles back at this point. Just everybody's kind of interested in this topic. But for those that are listening that maybe are not super familiar with this concept, maybe a good place to start would be what is RAG? How do you sort of describe that to somebody as a tool for working with LMs?  [0:05:49] JB: Yeah, the acronym stands for retrieval augmented generation. It doesn't really tell you that much. But once we started with the large language models, the generation here is the stuff done by the large language model, which generates text. Typically answers in ChatGPT and so on. And they have a lot of internal knowledge. But there's also lots of stuff they don't know.  Sometimes you want to retrieve information that is useful for this large language model to generate the answer or whatever text you want it to generate. Retrieval augmented generation is just retrieve information somewhere and put it into the prompt so that you can have your language model come up with a better response.  [0:06:37] SF: I think kind of what specific use cases or problems is this solving for people?  [0:06:44] JB: Yeah. One typical use case is you want the model to handle some data it has never seen before. Hopefully, it hasn't see in your emails, for example. That's one use case. You want it to actually look at your email and find the answer to a question you have about your emails or something like that. And you can have the same kind of problem in a company. Hopefully, all your company information is not on the web, so the large language model hasn't seen it. But you may want it to deal with all your internal documentation, or HR information, or whatever it is and come up with an answer. Then you need that as well.  And then you have time-sensitive information. It doesn't know the latest news or what's happening. Right now, that is relevant to whatever application you're building. That's the third case. And the fourth case is even if the model has the information in there somewhere, it can be very, very hard to elicit that information from the model. It could be – in some cases, it's a lot better to just find that information doing a search and feed it in as part of the prompt.  [0:07:55] SF: Yeah. Essentially, you're providing some potentially up-to-date context for whatever particular problem you're solving. Because, generally, these foundation models are fixed at a specific epoch essentially. And they have training material up to a certain time. And then they're not necessarily continually trained. They're sort of fixed in time. And then this is a way to keep it up to date, but also potentially augment it with specific information for the type of problem that you're solving. Could this also help with essentially like specific areas of interest? Like disambiguation of particular concepts?  For example, like the word CRO – or sorry. The acronym CRO means chief revenue officer in certain context. But if you're talking about healthcare, it can mean contract research organization. Is it a way to also help with essentially disambiguating specific domain, specific terms like that?  [0:08:49] JB: Yeah, I wouldn't say that RAG itself is used for that. But that's a problem you quickly run into when you are doing RAG. How do you disambiguate and separate these kind of things? But if your only problem is that you want to disambiguate to the model, you can just do it like you would with a human. That's one of the neat things about these models. You can just say in parentheses what you mean by some term or whatever. [0:09:15] SF: Right. You could make it potentially part of the prompt to add that context. It's like in this context or for this prompt, consider CRO to be chief revenue officer or something like that. [0:09:24] JB: Yeah. Usually, this kind of dumb solutions work as a person with an LLM.  [0:09:30] SF: All right. Well, I'm glad I can contribute to the dumb solutions. How does RAG compare to something like fine-tuning? Why would you use one versus the other?  [0:09:38] JB: Yeah. Fine-tuning typically can't factual information to the model, right? It can just influence its behavior. For example – or the common example is you go from a model that just generates text to a model that is sort of biased towards providing helpful answers to questions. Or it can be more specific than that.  Fine-tuning can do that kind of quality behavioral changes to the model. But they can't really insert new factual information, which is what you typically use RAG for. That's the distinction there, I think. [0:10:14] SF: And then in terms of how you would actually go about you know fine-tuning versus creating a RAG model, can you kind of walkthrough I guess the first – maybe starting with RAG. What is the process to actually create a RAG architecture versus – and then maybe we can take a look at the fine-tuning and sort of see what the difference is. [0:10:35] JB: Yeah. When you do create the RAG model, it really has three components. It's the language model itself and some physical in sensation of it. Either you use the API and give somebody a lot of money or you run your own copy and then you need some machines with GPU to run it.  And then you have the retrieval part, which is really a search engine that has some kind of information that you think may be relevant to the problems at hand. And typically, people host that themselves because it's particular to the problem. But it could also be just a web search. That's also a RAG if you use a web search engine to come up with results.  And then you need a third component, which is the thing that orchestrates this and put these two pieces together, which is typically stateful kind of similar to a web application that just talks to these two things based on what he's trying to do right now and put everything together.  [0:11:30] SF: Basically, you're combining standard sort of information retrieval methods to provide this additional context that's going to be fed into the prompt. And then sort of orchestrating both the information retrieval system and the LLM through some sort of orchestration tool or orchestration layer to add this context and the prompt to whatever it is that you're trying to retrieve from the model. Is that right? [0:11:51] JB: Yeah. Exactly. And to get some names on this, typical language models are things like ChatGPT. Or you can run your own Lama model or whatever. For the retrieval part, it's either something that's traditionally called a search engine like Google. Or if you host it yourself, maybe based on Elasticsearch or something. Or it can be one of these newer vector databases like Pinecone or something. Or it can be Vespa, which sort of can do both of those things in combination for you. And then for the orchestration part, the most common one is LangChain.  [0:12:28] SF: What value do you get from using, say, a vector database for storing your embeddings versus some of these other methods?  [0:12:37] JB: Yeah. When you do information retrieval, you have the traditional methods, which are text-based. You put in your documents if it's text-based and you create an index, which is based on this word is in this document and so on. And typically, you do some kind of normalization so that you understand that cars is almost the same as car and stuff like that. But it's on text level.  And then a couple years ago we started getting this new vector methods in addition where you don't do any of this. But instead, you take your text and create a vector embedding, which technically is just a list of numbers, right? But the list of numbers has a meaning in a sort of semantical vector space. It kind of semantically indicates what a particular piece of text is about.  You take your documents and create vectors from documents. And you do the same with the queries in the same vector space. And then you can retrieve by the similarity of these vectors. And that's what you get from one of these vector databases. [0:13:43] SF: And then what does the pipeline look like to essentially transform some of these documents into the vector storage? Essentially, we need to be able to take these documents. Convert them into vectors. Store them in the vector database. What's that typically look like?  [0:13:58] JB: Yeah. It's really quite simple if you look away from scaling issues and stuff like that. You take your text. And similar to what you do when you generate text, you give that text to a machine learn model and it spits out a vector for you. A list of numbers. And then you put that list of numbers into your database.  [0:14:20] SF: What is the difference between using a purely – like a vector index versus using a vector database? What value am I getting using the vector databases above and beyond using something like a vector index?  [0:14:31] JB: Yeah. You can host your vector index yourself if that's what you mean and using a library or something. Yeah, if you do it on small enough scale, that's good enough at least for a demo. But at some point, you want to scale it higher and make sure that this service is always running and you can feed new vectors all the time and stuff like that. And then you need something on top of your basic vector index usually.  But what you really want to do here is I talked about the text indices and the vector indices. And both have strength and weaknesses. If you do text indices, then you can get very precise matches. If you're looking for the name of somebody in your mailbox to use that example again, it will find that precise name. That's the strength of that approach.  While with vectors, you sort of get a kind of blurred thing that represents the concept. And that's a strength sometimes because it allows you to find stuff that describe sort of the same thing but with a different vocabulary. Stuff like that. But it's a disadvantage in other cases because it's never very precise.  What you really want to do – and we see that in all these academic competitions where people try to come up with strategies to get the best responses. What you really want to do is to combine both approaches. Have both the text and a text index and a vector index on the same data at the same time and retrieve using both methods. And then look at the information you get back from both sides and combine them into your final opinion, how relevant this document is to that particular query. And typically, vector databases can't do that really well. But Vespa can. [0:16:25] SF: Yeah. The vector database is giving – essentially abstracting some of that complexity around of combining sort of standard keyword search with this vector similarity search. Because I think my understanding of essentially vector search or vector similarity is you have these points in space and you're kind of trying to figure out what other points are similar or occupy a similar area. And that tells you something about their semantic similarity based on the encoding of the actual vector. But you're not going to get necessarily an exact match as like a fuzzy sort of approximate matches as you mentioned.  And then on top of that, there's also I think challenges around from a performance perspective of how do you do really high dimensional vector similarity. Can you talk a little bit about the sort of under the hood how is vector similarity measures performed? And then are there tradeoffs where you might sacrifice some accuracy in exchange for better performance or vice versa?  [0:17:19] JB: Yeah. Yeah. This is a big topic that gets really, really technical. It's what is vector indices, and databases and so on all about? The first thing you need when you want to compare vectors like this is a vector similarity metric. We need to decide how do you compare them? Do you compare the angles? Or do you compare the distance between where they are pointing and stuff like that? And that is really something you need to decide when you train a model that outputs these vectors. Because you can get vectors as a service from things like OpenAI. But they're not that good really. What you really want to do is to train vector. Train the vector embeddings for solving your particular problem.  And once you have these vectors that are trained to distinguish between what's good and what's similar and not similar using a particular distance metric, you can put it in a vector database or vector index. And the sort of naive approach to find the most similar vectors to that vector you have that represents your query is to just look at all the vectors you have and calculate how similar they are. It's very simple. Brute force.  And that works pretty well up to couple of tens or a couple of hundreds of thousands of vectors. But at some point, it gets really expensive. If you have a good system that can distribute your data or multiple nodes automatically, you can still do it fast enough, but it becomes expensive. What you want to do then is to have a smarter approach. And that's what really has happened over the last 5 years, I would say, that you have gotten these smarter approaches to this problem.  And the most common one is something called HNSW, which is you create a graph of vectors where these points in vector space points to other points that are close in in vector space. You start by just picking a vector and then you can walk this graph to find vectors that are similar. And the neat trick about this particular way of doing this graph is that the graph has many levels. So you can walk fast as when you are far away and then walk faster to that graph – slower to the graph when you are closer to where you want to be in the entire graph. That's what people are typically doing. And that can be really cheap and work great up to billions of vectors. But the advantage is you never find all the close neighbors, right?  Our in intuitions about spaces are up to 3D spaces. And if you walk into a room and it's very easy to see what's close to you. To determine what is close to you. But once you get to many dimensions, that property of space goes away. And these vectors typically have hundreds or thousands of dimensions.  And then there's a lot of stuff that is close in some dimensions. So you can never find all of it, right? And this can be a problem in some applications. For example, if you are searching your personal data and you're using one of these approximate approaches, then you may miss that particular one email that you're looking for, or asking the model for. This is something that people, in my opinion, don't think enough about how many of the vectors do you actually retrieve when you're using a particular approach.  [0:20:52] SF: Yeah, I would think that it's also probably like a hard thing to test, right? You're going to have these certain corner cases like you mentioned where it can't find the email that you would want based on a specific sort of query and whatever sort of the rules are of walking the graph.  When they're doing the graph walk as well, I'm imagining it's essentially, you're sort of doing like some version of like a breadth-first search across these different dimensions. But you have to cut that off at some point too because then you're going to be pulling back too much information. How is that cut off point set? Is that something that's like the vector databases are providing as a default? Or is that something that I would have to take into consideration when I'm actually doing one of these similarity searches?  [0:21:34] JB: Yeah. It's something you should take into consideration. Typically, the vector databases set is default. But you can tune it. That's one of the key tuning parameters. If you have a good engine, you can also just tell it to do the brute force version. And it's not that expensive or slow. When you're testing, you can use that to compare against approximate method that you will actually use in the real production system.  [0:21:59] SF: Yeah. That makes sense. In a lot of projects, any of the sort of machine learning projects that I've done over the years, they've all been small enough that I can basically do like a brute force of comparing all vectors. And I didn't need a really sophisticated approach. But obviously, as you start to scale to billions of potential vectors or points in space with lots and lots of different parameters, you're going to need essentially a specialized database or technology that's designed for indexing and walking those graphs and getting back a result in a reasonable amount of time.  You mentioned this idea of sort of figuring out how to tune the vector database or how you're doing similarity search based on the type of problem that you want to solve. Because there's different ways of comparing vectors as you mentioned. You can compare the angles. You can compare magnitudes. Distance between the points and so forth. Is that something that really comes down to just like experimenting with to see what works for your model so you're kind of just like trying different things out to see, "Okay. Well, this particular similarity measurement for whatever reason works really well for my use case. So I'm going to stick with that."  [0:23:01] JB: Yeah. As far as I'm aware, there's no clear-cut answers to this where people say this method is best for this kind of problem or something. I think we are still a bit early how people are using this kind of things. People just try different things as far as I can see. [0:23:16] SF: It's been a while since I was sort of engrossed in the world of like academics and machine learning. But back when I was, it was a lot of just sort of experimenting with whatever parameters you had available and trying to get the best sort of precision and recall on the model that you were testing. And there wasn't always necessarily like a lot of like rhyme or reason as to why a particular configuration worked better than another. There's still a lot of guesswork. And it feels like we're still in a similar place even though it's been a number of years. [0:23:45] JB: Yeah. All the modern numerical AI is like that. Nobody knows what they're doing. They're just trying lots of stuff and something like that.  [0:23:53] SF: All right. That's comforting at least for my own self-awareness. Going back to essentially performing inference using a RAG model. Stepping away from some of the nitty-gritty diesels around the vector similarity search, can you kind of just take me through what is the information exchange that's happening when I prompt the model with some sort of question or I'm asking it to perform some sort of task? What does that inference cycle look like that takes into account the information retrieval piece as well as the LLM piece?  [0:24:25] JB: Yeah. That's a good question. And here, people are also doing lots of different things depending on the application. In some applications, you have explicitly pointed to some kind of data that is relevant. While in others, you sort of try to prompt or find you in the model to know about search and care about search and have your machine learn model tell you, in this case, it's a good idea to search for these terms. Stuff like that.  Lots of different approaches there as well. But at some point, you figure out you want to do a search as part of generating the next prompt for the model somehow. And then you just – as part of this orchestration component we talked about, which could be LangChain or something, you generate a query by either taking, as I said, some data from the last output of the model, or from your user interface, or something you provide as context always, or whatever.  And then you do that search and you get results back and you take the N highest of those. Could be one. Could be 100. Whatever. And then you extract the actual text from those or some of the actual text. This is also a problem, right? That you get a lot of text. What pieces of text to take? And then you stuff that into the prompt and you usually want to eliminate that from the other part of the prompt so that the model knows these are results, these are the instructions, things like that. Right? And then that becomes your next prompt to your model and you give it to the model and you hope you get something good back. That's basically it. [0:26:06] SF: And then you mentioned there's limits essentially to how much we can kind of stuff into the prompt. How does the token limits come into play when creating a prompt? What are some of the strategies people use to make sure that they're actually providing the right instructions or right contextual information based on all the information they could potentially be stuffing in there?  [0:26:27] JB: Yeah. This all comes down to relevance. Right? It's sort of exactly the same problem that we have been working on in search or information retrieval for the last 50 years, which is you get a lot of data, or you have a lot of data available, but you want to find the most relevant among that data. Right? And that's the core search problem that Google has a thousand people working on and –  [0:26:54] SF: At least. [0:26:55] JB: Yeah. Probably a lot more. And I had that for a long time. That's a hard problem for sure. And really, you can improve. Go a long way from the basic approaches like just comparing vector similarity. That doesn't give you that good relevance. But then you can just keep investing more and get more results back if you have somebody that knows what they're doing with basically no limit.  What people typically do here is once they get beyond the demo stage, they don't just do the vector similarity. They also use the text information and do a text search and use signals from both sides, both the closeness of the vector and lots of text signals. Usually, they start with BM25, which is a text similarity metric that somebody came up with in the 70s. But you can go a long way beyond that by looking at more signals that says that looks at the proximity of the words in the text and stuff like that.  And in addition, as we haven't discussed yet, but is also important, is you typically have metadata about your data. Maybe you know something about the quality, or the source of the data, or how recent the data is. Lots of stuff like that. And you also want to take all of that into account, right?  You end up with this situation where you have lots of signals. The closeness. The text similarity metrics. Various metadata signals and stuff like that. You have lots of signals. And you want to come up with a single number, which is how relevant is this to the particular query? And what to do when you have lots of signals and you want to come up with one number? Use machine learning.  Here we get another instance of machine learning with a different set of models to the large language models that are figuring out the relevancy for you. And here, typically, also gets just fractally more complex. You may have a cheap machine-learn model that you run on a couple of thousand per node of your data. But you have a very expensive model that typically a deep learn model that just look at very best candidates. And so, if you do all this, you can get really good relevancy for your results. And therefore, better results when you give that to your large language model. [0:29:14] SF: Yeah. I talked to someone from JetBrains recently. And they're building out like a coding assistant integrated into IntelliJ. And one of the things they talked about was using traditional machine learning models to figure out what is the right context to essentially feed into the LLM. And that's essentially what you're talking about here.  I think on the surface, building a RAG model, it sounds like relatively simple. We take some documents. I feed it into a vector database. I got my embeddings. And then I do a search based on my prompt. Pull in the embeddings. Add some context. And then I'm off to the races. But there's a lot of nuance I think to doing this well that you're sort of I think starting to unpack in terms of areas where you might need to apply traditional machine learning to really differentiate yourself and give really high-quality relevant results. Do you think it's this sort of nuance and like the hard things that are kind of hidden beyond under the hood of making sure that you're feeding in the right context that's going to be like the differentiator of people in the space today that are applying RAG to solve for different types of problems?  [0:30:20] JB: Yeah, I think so. I think it's an important differentiator at least depending on the problem and how much data you really have that is relevant in your problem. And even if you have these models that can take really long context windows, the research shows that this becomes overwhelming for the mode.  If there are stuff in there that is not relevant, they easily get distracted and they also mostly look at stuff that is at the beginning and the end of the prompt. It still matters what you put in there even if you can put in a lot.  [0:30:53] SF: And I would think too that getting this right is not the only problem but is one of the keys essentially moving beyond demo, where I was at Snowflake Snow Day the other day and they made a bunch of announcements around their new Snowflake Cortex, which allows you to do a lot of LLM stuff directly within Snowflake. It looks really cool. And you can create an embedding model really simply. And they showed in 15 seconds essentially creating like an embedding model and running that. And it looks great in a demo. But I'm sure that to move beyond this 15-second demo that they packaged up to something that's actually going to serve real users, that's where the sort of hard work comes into play.  Outside of all the things around getting the right context for relevance, what are some of the things that someone needs to be thinking about or problems that they need to work through in order to move beyond demo where to being essentially like ready for production?  [0:31:46] JB: Yeah. Some of the other problems that people run into that we see a lot is, first, scaling. Works well in your demo where you're the only user or you and some other guys in your company. But what happens if you have 10,000 queries per second or 100,000? Then you need a lot of architecture to make it scale, right?  [0:32:08] SF: Or a big bankroll for OpenAI.  [0:32:10] JB: Yeah. But even then, you have these other components that you need to scale. And you need all of these working together at scale. The other problem there, which is similar, is if you have lots of data. If you have – as you mentioned before, if you have a couple of thousand documents, it's kind of trivial. You can just build your index on your local machine whenever you start up. And that's it.  If you have millions of documents or billions of documents, then it becomes a complete different game. You need a distributed system. Feed the data to virtual nodes. Replicate the data in multiple copies over those nodes. And then you run into all the interesting and fun stuff about distributed system. Some of these nodes will die at random times and you will lose discs and you need to redistribute the data and route around problems and all of that fun stuff. To solve that, you need a platform really that handles all this distributed problem kind of stuff for you.  And that touches into the second problem, which is high availability. You don't care about that when you're doing a demo. But in a real production system, you need high availability. We need redundancy, which is always fun when you combine it with the state, which you do here. You need to be able to upgrade your system while you're running without going down and lots of stuff like that.  [0:33:30] SF: Yeah. You can't really escape all the sort of standard infrastructure challenges that you run around performance and scale with any kind of cloud-based system. You're going to run in these distributed system challenges as you scale and try to meet certain performance metrics. How are people also looking at actually like testing performance and looking at ways of improving performance? Because repeatability in machine learning AI is a challenge. And how do you know essentially that you're moving the model and the responses that you're getting the relevance in the right direction as you essentially iterate on it?  [0:34:07] JB: Yeah. That's a tough question where I don't think we have good answers as well. For the relevancy part, it's something the information retrieval component community has been working on for 50 years. There we have standard methods. Evaluating how good your relevancy is and so on.  We don't really have that, I would say, for large language models at this point. There's lots of evaluation metrics and so on that people are trying. And you have leaderboards on Hugging Face and so on. If you actually look at them, they don't really tell you much about how this performs.  There, I think people that want to do actual practical system needs to come up with their own approach to evaluate in their context. But at least they can separate out the relevancy part and just use the standard approaches there. [0:34:54] SF: One of the things that I've seen with essentially both the growth around LLMs and generative AI in the last year. But also, I think the hype cycle that we're in is there's lots of talk around like how quickly we're going to reach artificial general intelligence, AGI. And there's people who are predicting in the next seven years. If you look at research from like the 1950s, it was like Marvin Minsky and Claude Shannon, they also predicted in like the next seven years that they were going to have essentially reached general human intelligence. And now we're fast-forward 70 years and we're still not there. Do you think – based on all your experience in the space, are we actually getting close to that? I mean, just even in this conversation, there's a lot of problems that people need to solve beyond just essentially hitting an API endpoint to get something that's even relevant to solve specific problems. Do you think that we're going to see that in our lifetime?  [0:35:48] JB: Yes. I actually do. But I was one of the very many people who predicted that we would not get this far at all with the current approach or just predicting the next token. I'm probably not the right person to ask here. But I still think we do need something more than what we're doing currently. But I guess somebody will come up with the right stuff that we are missing.  And I think we probably just by scaling up what we are doing currently, we can get to general AI that is smarter than most people on most mundane tasks at least. They won't be reliable. But lots of people aren't reliable as well. We have social mechanisms for dealing with that. And I don't see why they wouldn't work well for these models as well.  I don't think will get to super-intelligence at all with these approaches. I think there are real fundamental issues with these models. They are computing really shallowly if you look at what they're actually doing. They are computing broadly. They're doing lots of stuff at the same time, but it's all really shallow. You can't really use them to carry out a long multi-step chain of reasoning. And this is what this LangChain and orchestration layers try to achieve. But I don't think that will really work well before we have something that is trained to do this kind of multi-step reasoning by training. And nobody really knows how to do that yet. But hopefully they'll figure it out. [0:37:20] SF: Yeah. I mean, I think one advantage that we have right now maybe in comparison to other points in history is, one, we have a lot of digitized information that didn't exist even 15 years ago. Essentially, that is the theater for the training of a lot of these models.  And then, also, I think now the sort of speed of learning or the speed of like execution is accelerating really, really fast. That's where you get this kind of like exponential leaps and potentially performance that could lead us to a place where we're seeing AGI in the next seven to 10 years or something like that. I'm not putting a stamp on that. That's not my prediction necessarily. But I can understand why people, experts in the space might be thinking that even though we've kind of made predictions like that in the past and been grossly wrong.  But around some of the challenges that exist around like super-intelligence, I think that you're right. There's probably something fundamentally different that we need to come up with. And maybe it's AI and some of the methods that we're using today that we can actually leverage to figure out what that next thing is. We can essentially use these models and these techniques to help inform wherever we're going to go in the future around AI.  [0:38:30] JB: Maybe. [0:38:31] SF: In terms of RAG, do you think that it's here to stay? Or do you think that there will be other strategies that come out of industry or research that are likely to replace the exposure? Is this a Band-Aid for where we are right now?  [0:38:43] JB: No. I think it's a fundamental method that is here to stay. And it works pretty well with people as well. Even though we are still, some of us at least, a bit smarter than these large language models, we still need to search to find information that is not stored in our heads. And that's basically what you're doing with RAG. And so, therefore, I think it will always be here. [0:39:06] SF: Yeah. It's kind of like external memory essentially. The same way that a human might use external memory. Look something up in a book. Leverage technology to essentially jog a memory or search for something to essentially augment our own understanding or decision-making.  [0:39:22] JB: Yes. One thing that might happen though is – it's kind of funny right now. Because with this large language models, text is the universal interface. You just talk to them and they talk back. And we use the same approach when we do RAG. What we put in and get out is just text-based. But for some reason, people think that when you are looking up some data that you are putting into the model as text, you need to convert it back and forth to vectors first and do a vector search. But you can also just do the standard old-school text search, right? Which can be in many cases a better approach to start with and definitely a lot simpler. Because you don't need to do the embedding and so on.  But where embeddings could be a fundamental augmentation of these models in the future is where rather than using them to retrieve text and input text, you use the embeddings to either add in or replace some of the embeddings in some layer inside the model. There's very interesting research on doing that. I'm not a machine learning researcher. I can't really judge these approaches. But it seems it can be an interesting approach that augments the kind of RAG that we have today. [0:40:42] SF: Interesting. And then as we sort of come up on time and we start to wrap up, is there anything else that you'd like to share in terms of the work that you're doing at Vespa or in relation to people better understanding what's going on in the world of RAG?  [0:40:55] JB: Yeah. One wrinkle or whatever we call it to this that we didn't get into, which I think it's kind of important, is a lot of these use cases with RAG are around some kind of personal data, right? Your documents. What's on your laptop or your mails? Or something like that. And typically then we're dealing with pretty low number of documents, right? Do you really need these advanced vector indices and stuff like that then? No. You don't.  If you're only searching your personal data, you can stick with the simple approaches. The kind of stuff that you were talking about that you were doing before, right? You only then need the distributed systems that handles all this complexity for you once you want to scale this to many, many users. Because then you're back to handling billions or, even in the email case, typically trillions of documents. And then it's again a very complex distributed system.  But still, then, you don't want these vector indices. It's much faster to just store all the vectors on some disk. And when you need that user's vectors, you just read them from disk and do the brute force approach. And that has an additional advantage, is that it's not approximate. If you use it for cases like the email or your personal documents, you are guaranteed to find the most relevant ones. That's also something Vespa can do for you, which I think is quite interesting. [0:42:25] SF: Yeah, I think that's a really good point. I think we sometimes, especially with the advancement in different technologies and Toolchains, we can get a little bit involved with the new technologies and we want to build out this sort of complex pipelines and systems because we want to try all these different things. But a lot of times, just sort of looking at the actual problem that we're trying to solve, a simpler method can work for us. We don't necessarily need to go to the most complex toolchain or most complex even machine learning methods to solve some of these problems.  If you're dealing with thousands of documents, there's no reason why you couldn't just brute force the vector search and actually end up with a higher relevance or essentially be able to create more relevant responses from your LLM than you could using some of these more complex approaches.  [0:43:12] JB: Yeah. That's exactly right. Do the simplest, stupidest thing first and then only make it more complex if you need to. [0:43:18] SF: Yeah. And it's going to be probably less expensive to operate that way as well. Well, Jon, thank you so much for being here. This is super interesting. I really enjoyed the conversation and discussion. I think we got into a lot of interesting sort of nuance around actually building RAG models and essentially being able to get to something where you might be able to move this production. And hopefully, those listening that are kind of exploring the space understand better now kind of like all the challenges and nuance that's actually involved with doing this for production systems. [0:43:48] JB: Yeah, I hope so too. It was a pleasure talking to you, Sean. And lots of interesting discussions for sure. [0:43:53] SF: All right. Thank you. And cheers. [0:43:55] JB: Thank you. [END]