EPISODE 1928


[INTRODUCTION]


[0:00:01] ANNOUNCER: Vector search has risen to become a foundational tool in modern search and retrieval systems, including the RAG pipelines that power many AI applications. However, the demands on retrieval systems are growing more sophisticated, which is revealing the limits of relying on a single vector similarity score. Vespa is a popular open-source search and data-serving engine. Central to Vespa's architecture is tensor-based retrieval, which is an approach that represents data as tensors, rather than simple vectors. Tensor-based retrieval enables richer mathematical operations and more flexible ranking functions that can surmount the limitations of a single vector similarity score. Radu Gheorghe is a Software Engineer at Vespa with the background spanning nearly 12 years of consulting and training on Elasticsearch and Solr.

In this episode, Radu joins Sean Falconer to discuss why vector similarity alone falls short in production, how tensor-based retrieval generalizes to support richer ranking functions, the trade-offs in chunking and multi-stage re-ranking architectures, and where AI search is headed next.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]


[0:01:32] SF: Radu, welcome to the show.


[0:01:33] RG: Hi. Thanks for having me.


[0:01:35] SF: Yeah, absolutely. I'm glad you're able to be here. I interviewed your founder, Vespa and CEO, probably a couple of years ago, so it's great to catch up again on everything that's happening over at Vespa. A lot has changed in the world of AI and I'm sure in the world of Vespa over the last couple of years.



[0:01:50] RG: Yup.


[0:01:52] SF: You've been working in this space for a while. I guess, what's your origin story? How did you end up working in search infrastructure and ultimately get involved at Vespa?

[0:02:02] RG: Yeah, this was two jobs ago, working at the antivirus company, and we needed to centralize logs, and that's how we got into Elasticsearch. Then I moved on to a company that was at the time, at least doing mostly consulting on top of Elasticsearch and Solr. I've been doing for almost 12 years at that company, consulting, training, that sort of stuff, for Elasticsearch and Solr, and then OpenSearch.

[0:02:26] SF: Then what ultimately would you to Vespa?


[0:02:29] RG: Well, I guess mostly curiosity, because Vespa not being based on the scene and having different internals, different distribution model, different tradeoffs that it makes. I'm intrigued a bunch of people at conferences was more and more curious. Yeah, that's how I got into it.

[0:02:48] SF: Yeah. I mean, Vespa is a company, at least the origins of it has been around for a long time. Over 20 years has been working on search related problems. Can you share a little bit about some of the origins of the company? What was their original problem that they were focused on? And how much of that originating DNA is there today?

[0:03:08] RG: I know quite a few things, but only from other people, because I wasn't around, of course. I've only been here for a couple of years. As far as I know, the origins are pre-Yahoo.
There used to be a company called FAST, which is a recursive acronym. Comes from Fast, Search, and Transfer. They've been doing web search and search in general. I think there were a few other things, but the idea was large-scale search. Then through a series of acquisitions ended up in Yahoo. In Yahoo, they were serving lots and lots of use cases. Vespa still does serve lots and lots of use cases within Yahoo. Not sure which of them can be told publicly, but the idea is you have a bunch of verticals that you can serve some smaller scale, some really huge scale.



I think this implies that a lot of the problems that Vespa needed to solve were quite generic, as well as large scale. I think you'll see this in Vespa today. A lot of the solutions we adopt tend to be over-engineered, if you will, because we expect them to be used and then pulled in all sorts of directions.

[0:04:18] SF: What do you mean by that, in terms of over-engineer? Can you give an example?


[0:04:22] RG: Well, tensors, I think, are a good example of that, because it's like, you don't only support vectors and distance functions and all that, you support all sorts of maths on top of all sorts of numerical structures. So then, when you come up with a new use case, it's just that much easier to add it up, because there are lots of things that are already supported and already thought through to be scalable and fast.

[0:04:50] SF: I see. You're talking more about having some first principles thinking around search that generalized all sorts of problems, versus attacking everyone as a narrowly brand- new problem that you have to go and engineer a specific solution for that particular problem.

[0:05:04] RG: Yeah. For me, coming with my consulting background, I'm used to solving specific issues. At Vespa, when I look at an issue and I'm like, okay, how do we solve this? People are like, "Wait, wait. Let's make sure we don't bump into something three months later, where we have to do this all over again." Is there something generic that we can do? Will this perform at scale and all that stuff? Does this align with how Vespa is used in general and things like that? That for me is a bit of a shift.

[0:05:32] SF: Yeah, I think there's always a trade-off between moving from a consultancy, or even forward-deployed engineer type of role, where you're really trying to help solution specific things for a customer and unblock them, versus being part of core product in R&D where you're thinking beyond just a singular customer, but how do we generalize this thing to maybe all of our customers?

[0:05:55] RG: Yup.

[0:05:57] SF: Vespa's been fairly vocal about writing about vector search and how it's reaching some of its limits. For some people, maybe that's a bold claim. Vector search, vector databases is something that's been around for quite some time. I think it's really got a lot of traction certainly in the last few years. Can you talk a little bit about the core argument behind the dialogue coming from Vespa in terms of vector search reaching its limits? Are there certain things that vectors are good at and where are things starting to break down?

[0:06:25] RG: I think the general idea is that people want good relevance, right? You have a corpus, you're searching in it, you need the most relevant things to surface. For that, vectors are only one - in general, vector distance is one signal, right? You may have N other signals like, is this document recent? Does this document match, well, lexical search? Which chunk is more relevant? Do we care about the top chunk? Do we take care of the average chunk, or average of top 10 chunks, or whatever business rules we may have?

In practice, what we see is that a lot of people end up having really complex algorithms for measuring what ends up being a relevant score. Having flexibility around this, I think is important. I don't think there's anything wrong with vector similarity. I just think that vector similarity in itself is not enough. Even if you look at what are traditionally called vector databases, they add stuff onto it. It's not just vector similarity that they care about. I think it's just a natural trend that people when they start to use these things, they just care about multiple signals and also how to combine them.

[0:07:40] RG: Yeah. If you're only looking at something like vector similarity, what are some of the things that you might end up getting wrong in some of these use cases, or scenarios where you're limiting yourself to the singular signal?

[0:07:53] RG: Well, one thing that comes to mind is lexical search. I think there's a lot of, let's say, memes in the search world about BM25, which is, let's say, the most popular algorithm behind lexical search. That BM25 actually performs really well as time goes by. It doesn't seem to die. We had a recent blog post where we benchmarked a lot of embedding models. Most of them in most of their flavors would outperform BM25.

First of all, actually, let me take a step back. When I say most of those models outperform BM25, I mean, those model off the shelf outperformed BM25 off the shelf. In reality, nobody actually uses that. Most people will tune both their embedding models and their BM25 implementation. But for argument's sake, okay, most of those models would outperform BM25. But hybrid search, so just BM25 combined with all those models, would outperform the models themselves. Lexical search is a signal that I think is here to stay and has been proven over and over again. That is just one example.

Another one that comes to mind is if you have long texts, you can have - I mean, vector similarity on the whole text becomes quite meaningless, because you can't capture that meaning of a blog post, or a book in one vector. I think that's where chunking comes in, and then how do you combine chunks and stuff like that and metadata. Again, in practice, what we see is people end up adding up on more and more stuff that becomes their compound signal.

[0:09:34] SF: Can you talk a little bit about how this vectorization process works? What do you lose when you turn a document into a vector? What are you giving up by using that representation?

[0:09:45] RG: Compared to what?


[0:09:46] SF: I mean, there's other ways that you could potentially represent a document. I'm sure you could do something, but you could just store the text of a document, for example, and do some sort of text-based search over it.

[0:09:58] RG: Then you do lose things like exact filtering. That's one of the main complaints people have about vector searches like, where's my threshold? Where's the cutoff between a relevant and irrelevant document? With lexical search, this is usually quite easy to figure out. A lot of people in the Lucene world use some minimum match. It's like, okay, if you have three words, then all three need to match. But if I have 10, then seven out of 10 is good enough. Then I have a reasonable cutoff. Okay, it's not perfect, but it is somewhat intuitive, explainable, and most of the time it works.

With vector search it's like, you don't really know, because you can still have for something like cosine similarity, you can say, okay, decent similarity score is 0.7. I'm going to cut it off at that. Then that depends that that similarity changes when you're running different queries. In other words, it's very hard to figure out what that cutoff point is and what that cutoff point means. That messes up with faceting. If you want to analyze your results set, then you're looking at what, because vector search, unless you have this artificial cutoff point, then you're going to match everything.

[0:11:15] SF: Yeah. I mean, I think that would be certainly true of if you're not doing any chunking. The whole point of, if you have a large document of breaking it down in smaller chunks is that you have more tightly coupled, semantically meaningful chunks. If I take an entire book and I turn it into one singular vector, then I'm creating a single point in high dimensional space that represents this whole book. There's no way I can encapsulate all the meaning of that book and have all the points in space that are similar to it in some reasonable way. I'm going to end up losing a lot of the specifics, I would think. But if I break it down by paragraphs, or sections, or chapters or whatever it might be, then at least I have more tightly coupled dots in high dimensional space that are going to probably have a sphere of similarity around it to other dots in that space. They're probably more semantically meaningful. The more data I'm essentially trying to stuff into the vector, the more I'm generalizing, essentially, the ultimate meaning of that thing. Is that fair?

[0:12:16] RG: Yeah. I think that's a fair thing. It's like, you have a limited number of data points effectively, the dimensionality of your vector that you can store. The more meaning you have in something big, the more you're going to compress and the more loss there's going to be.

[0:12:31] SF: Yeah, exactly. Yeah. It's a very lossy format, especially as you're getting more and more text stuffed into the singular vector representation. There's also, if we look at things like, the use of RAG over the last couple years of vector databases, typically the RAG systems are getting more and more complicated, where we have the pipeline where we're breaking up, we're chunking these things, we got different chunking strategies, we're indexing it in a vector database, and then when we're actually retrieving it, we're also doing multiple steps where we're maybe retrieving relevant documents, and then maybe we're using a re-ranking model as well to re-rank the results from it.



In terms of this two-stage architecture where we're decoupling the search from the re-ranking, is that problematic? Are there challenges around not tying those things together and decoupling them?

[0:13:21] RG: I think it is problematic in the sense that if you have a lot of data to re-rank, then you're going to have a lot of traffic coming in and on. That can become a bottleneck.

[0:13:30] SF: It's primarily a efficiency problem?


[0:13:34] RG: Yeah, and efficiency is really important, because if you're more efficient, then you can essentially afford to do fancier stuff. Right, let's take this re-ranking example. If you have a really good re-ranker that is performed really badly, you can only throw a few results at it, because otherwise, you're going to have an acceptable latency. If your re-ranker is super- efficient, then you can throw all your results at it and you're going to have great results. But this, I think, applies all over the place, right? If you can, in general, have a base relevance function that performs well and does a lot of stuff, then you're going to have a really good baseline to work with.

[0:14:20] SF: I guess, all these problems that we're talking about are perhaps even more amplified in the multi-modal world. It's one thing where we're talking about compressing text into a vector. What happens when we compress images and video and things like that? Do we lose too much in using this lossy format, especially when we're talking about rich media?

[0:14:40] RG: I don't know that I have enough experience with things like audio or video, but I know for things like PDFs, it's going to be really hard to put all that information in a single vector, because you can have end pages and on those end pages. I think it's enough if you extract the text. We have the problem that we talked about earlier, which is how we cram a lot of text into a vector. Now if you have diagrams in it and other things, good luck.

[0:15:09] SF: Yeah, and tables, which you're probably going to, I would suspect even the approach to doing similarity measurements between those things is probably going to be quite different than you would do for traditional text.



[0:15:21] RG: Yeah. What I've seen with PDFs are people doing vectors per page, or rather per patch per page. You have models ColPali and so on that can do this stuff.

[0:15:35] SF: Do they handle tables and images differently though?


[0:15:37] RG: They don't, actually. They're pretty generic. You just throw the image of a PDF page at it and they give you a vector per patch. You'd have typically 128 patches, so 32 by 32, and you're going to have one vector for each of those patches. It's all very well coordinated. In the end, because you can throw a text query at the same model, so that they live in the same vector space, it can actually figure out whether something's on a table, or on a graph. You can have a graph of, I don't know, energy consumption by month, and you can say, what was the consumption in July? It can highlight that for you.

[0:16:16] SF: Okay. I want to get into a little bit of this topic around tensor-based retrieval, essentially, for the listeners that perhaps at this point, know what a vector is. I think, generally, because of all the everything that's been happening in AI over the last couple of years, I think people who didn't know what a vector was three, four years ago, perhaps know what a vector is today. They might not be super familiar with the concept of a tensor. Can you explain, essentially, what is the difference from vector to tensors, and why does it matter for search?

[0:16:46] RG: Yeah. A vector is a list of numbers, right? The data type can differ, can be a float. Normally, it's a float natively, but we can quantize it, so basically compress the float into let's say, a 16-bit float, or an integer, or even a bit. That is a vector. A tensor is a more flexible way to represent numbers. Simple thing could be just represent one number. You can have an array, which would be a vector. You can have named dimensions. Like I mentioned, patches earlier for ColPali, you can say we have a patch ID. For each patch ID, we can attach a vector. Now we're going to have a map of vectors.

Or you can have a sparse tensor where we can say, let's say for personalization, right? I go to a clothing store and I prefer a black pants and blue t-shirts and stuff like that. Those could be named dimensions in my tensors. Based on my preference, I can store numbers. Let's say, a heavy preference would be closer to one. Maybe if I hate them, they should be a negative

number. I can perform all sorts of math on top of these numerical structures, these tensors, and I can get the results I want. For example, with vectors, we can do the similarity search that we all know and love. We can do personalization, for example, by doing some dot product between my preferences and what a specific item of clothing would be. Or we can do ColPali and we can sum up things. We can do maxim. All sorts of things can be done on top of tensors. I'm not sure if that answers your question.

[0:18:30] SF: Yeah. Every feature, or thing that you want to describe needs to map into a numeric representation, right? That could be a vector, it could be a singular value, but some numeric representation.

[0:18:40] RG: Right. In the context of tensors, yes. I mean, with Vespa, you can do much more with ranking than just using tensor math. But tensor math is a really flexible way to represent a lot of things and then do those interactions quickly.

[0:18:55] SF: Right. By representing things as tensors versus just purely vectors, you have a whole set of tools, essentially, that you can use to perform these different types of searches using tensor math that you wouldn't be able to support using, if you were just doing, essentially, cosine measurements between two different vectors.

[0:19:14] RG: Correct. Yeah. Also, I think most importantly, we are very, I wouldn't say completely, because nothing is complete, but very future proof. For example, when ColPali models came in, we could just natively support that, because you can have these patch vectors modeled in a tensor and then you can implement maxim using tensor math, and there you go. You have all the maxim stuff. You don't need to come up with a whole new feature of how do we deal with this? How do we deal with multiple tensors? How do we combine them in the way that they're supposed to be combined?

[0:19:47] SF: This was something that Vespa was already supporting?


[0:19:50] RG: Yeah. I mean, not before the model and the technique came into existing. It's not like we were supporting it. But yeah, we were supporting it from day one, because all the plumbing was already there. You just needed to write the correct expression and there you go.



[0:20:03] SF: I see. Okay.


[0:20:04] RG: Another good example is Bayesian BM25. There's a new technique to normalize BM25 scores, because one of the main problems with BM25 is that you don't have a predictable score that you can use to then combine with other scores. It's ideal if you can normalize it between zero and one and then you can treat it much more uniformly. When that technique came out, we were like, okay, how do we implement this in Vespa? It turns out, pretty much everything was already there. All the sigmoid calculations, we could already do in the rank profile math. This was impressive even for the author.

[0:20:40] SF: Can you walk me through what is the process for doing a tensor-based search in Vespa?

[0:20:48] RG: The process would depend on exactly how the tensor looks like. If you have a vector, you would define - I mean, any type of tensor, in fact, you will define it in the schema. It's like, okay, this is the shape of the tensor. This is the data type. Then you feed the data, which should match that shape, right? If it's a map of arrays, or whatever that is. Then when you run the query, you typically also have a query tensor. You can construct tensors at query time from the signals that you may have, like I don't know, chunk, similarities, you can construct tensors from that.

Then you would have something called a rank profile. In the schema, you would say, this is my rank profile. The rank profile expresses how the similarities, how the score of the document should be computed. Let's say, we do a dot product between two tensors, or we do a similarity between a bunch of vectors, and you can iterate, we can take the average of that similarity, or whatever you want. The top N vectors similarity and average that. Whatever math you can think of should be there, or a lot of the relevant things are already there. You can construct your relevance function that way.

[0:22:06] SF: How do I know what relevance function to use?

[0:22:10] RG: I think that is very much up to you and how do you, let's say, tweak your relevance. It would depend on the use case. I think most people would just start from something simple, like lexical search, then, okay, we can find a decent factor, like an embedder model to work with my data, then I can think of, okay, what are other business relevant signals that I want to incorporate, all sorts of metadata. People typically iterate. I think it's very important to have some golden set that you can evaluate and see whether my quality is going up or down. That's a very generic approach.

[0:22:52] SF: It sounds like there's maybe some additional complexity involved with getting this set up and working. The advantage is that you're trading off some level of maybe technical investment and complexity upfront, but the tradeoff is that you get better results.

[0:23:08] RG: This is value for any system. I don't think it's particular with tensors. That if you add more signals and you want to combine them, it's going to be just that engineering investment that you were talking about will happen everywhere. Maybe tensors require a little bit more understanding of some math. Not crazy. I mean, my math stopped at high school and I can still grok it to some extent, so it's not too, too scary, but it's a little bit more than just, at least what I'm used to.

[0:23:37] SF: How much is Vespa abstracting away some of that math for you?


[0:23:43] RG: There are some helper things. For example, we talked about ColPali. There are aliases. You can just multiply two tensors, for example, X*Y, and then that's going to do a dot product for you. You can also do the unfurled thing. I think more interestingly, we have a bunch of helper, let's say, frameworks, if I can say that. There's something called Tensor Playground where you can go and click around and you have some examples and you can also come up with your own and you can fiddle with tensors and see what the results are. We also did in a couple of years. In December, we had this Tensor Advent challenge, where the idea was, okay, let's have some thematic challenges. You have to solve with tensors, like how much Santa has to pack and how much the elves have to travel and stuff like that. That you would just solve with tensors just to get a feel of that math.

Then there's quite a big repository, which is called Sample Apps in the Vespa GitHub, which has lots and lots of examples of use cases. You can see the rank profiles there and you can see the schema. A lot of people will take one of these sample apps and just change it to what they need, and I think that's useful. It's rare that you just start from scratch on a path that nobody went to before.

[0:25:18] SF: You mentioned this a little bit earlier on this concept of named dimensions and Vespa's tensor framework supports named dimensions, like token and region, timestamp. What does that give you? Why does that design choice matter?

[0:25:32] RG: It matters because it's very quick to - Let me step back here and try to come up with an example. One of them is, you can have attributes that you care about for ranking. Let's say, you're searching for cars. You may have things like, is this car expensive? Is this car cheap to insure? Does this car use a lot of fuel? Is it new? Whatever. things that maybe I care about when ranking. Even if you don't have tensors, you can still take those into account when ranking. You can take the mileage, you can take all those dimensions and you can come up with a formula that takes all those dimensions and comes up with a final dimension, which is the score of my document. But it is quite expensive to get, assuming that you store this in multiple fields, you need to get the value from all those fields, do whatever math you need to do in some sort of high-level math.

Hopefully, you don't have to bring it all the way to the application, because that's going to be horrible. Even if you have to do this at some high-level script, like with Elasticsearch is painless, for example, that can be very slow. By contrast, if you have this natively in a tensor, then you can simply take the user's preferences, take those car attributes and do a dot product, which is super, super fast. This will scale a lot better than taking those attributes manually.

I think this is what it gives you in essence, because it comes back to what we discussed earlier about efficiency that allows you to do fancier things. Because at some point, you will not be able to do things in other search engines, even though the capabilities are there. If at your scale, they don't make sense, you're not going to use them, right? It doesn't help you that they're there if you can't use them. But with tensors, it's different, because a lot of those tensor operations are super-fast. They will scale and people do use them at their large scale.



[0:27:41] SF: Yeah. I mean, one of the things that I think seems unique about Vespa around some of this efficiency stuff that you're speaking to is that with the tensor computations happening on the content node where the data lives is the idea of like, do you bring the data to the computation, or bring the computation to the data? Data is expensive to move around. If you can bring the computation to the data, then it's going to save you some cost in terms of time of moving this data around, which then gives you probably more compute cycles that you can spend on trying to get good results out of the search.

[0:28:15] RG: Exactly. I think this comes at two levels. One is the computation that you do on all the documents. I think it's just unfeasible to bring all the documents somewhere outside where the data lives, right? You will have to take some top end. Unless, you have a tiny data set. You just cannot afford to take all the data out of the content nodes and into something external. That is one thing. If you can do some tensor computation on the first level, then you're going to save a lot of cycles. But then, the other thing is, you can do the first reranking phase of basically, the second phase. The first phase runs on all documents, second phase runs on top end. That second phase will run on the content nodes as well.

You can bring a more sophisticated model. Could be GBM, a tree, could be on an X model. It's usually not something super big, but it can be complex enough, for example, to handle multiple signals of multiple ranges, such as you may have similarity, and then we discussed BM25 and recency and all the things that maybe matter to you and come up with a coherent score. That still happens on the content node, without moving data. Only later, you can maybe move a much smaller set to what we call the global reranking, which happens on a stateless layer. That can, again, have its own model, maybe a bigger, more complex model, can also run on GPU that can do the final reranking. There's stages to that.

[0:29:55] SF: I guess, one of the things I was wondering about too is there's the concept of RAG, vectors, or even if you're using some other search technique, like the tensor stuff that we're speaking about is very, very popular a couple years ago. Then now with AI agents, I think there's some dialogue around how relevant is this today. Can you talk a little bit about where do some of these concepts fit into the agent world?

[0:30:21] RG: Right. For agents, this would be just a search. I don't think they care all that much about what happens under the hood. If what happens under the hood gives them good results quickly, then that is, I think, even more important than it is for humans, because agents would typically run multiple searches. The problem, I think, would be compound with latency, or with bad results, because it's latency definitely.

[0:30:50] SF: If you're 90% accurate in isolation, and then you do that 10 times, then it's 0.9 to the power of 10, which means that you're successful. I don't know what the math is, but it's probably going to go up 10% success in that compound factor of searches, right? The more you can get the accuracy up on the search in isolation, the more the accuracy goes up in the aggregate as well.

[0:31:13] RG: I think the other thing is that models, at least to my knowledge, to this day, aren't as good as figuring out how to filter the context. If you give them bad results, they will tend to hallucinate more, because now they have bad context to rely their hallucinations on.

[0:31:35] SF: Yeah. Or if it's too much, right? All models degrade in performance, the larger the context that you give them, do the context right. I mean, just based on the way the attention mechanism works, they can only pay attention to so many things. They might pay attention to the thing that you don't want them to pay attention to if you give them bad results.

[0:31:52] RG: Yeah.


[0:31:53] SF: How does Vespa handle updates to data? If you have a knowledge base that's changing every minute, there's news, there's pricing, there's inventory. How does the index are re-indexing in that information work?

[0:32:07] RG: To talk in general terms, Vespa is real-time. Meaning, when you make an update, the moment you get your acknowledgement as the application, that thing is searchable. Most engines would be near real-time. Meaning, there has to be some commit happening, which there's always a trade-off. There's no free lunch, right? This is the trade-off that Vespa does. It assumes that you need your data to be available right now, so you won't have some caches that you have with other engines. But the upside is that you can, for things that are moving quickly,

such as pricing for e-commerce, that is a very frequent example, or how much you have in stock, that you can change a lot.

If the data you're changing is an attribute, so effectively, the price or the in-stock thing that is kept in memory, that is super, super quick. This contrasts with other systems where you have a commit, and then you would effectively need to re-index the document in order to change one value from it, which can be prohibitive. But in Vespa, that's the advantage that you can quickly update things.

[0:33:27] SF: How does that technically work? I'm not sure I'm following. If I have a new update, how does Vespa handle that in real time? A continuously flow of new information, how does Vespa handle making that available in real time?

[0:33:40] RG: If you have an in-memory attribute, like a price, and you want to change it, I mean, it will be backed by this, right? You have all the persistence, write the head log, all that stuff. But you send the update, it's changed in memory, it's also replicated to all the other nodes. When all the other nodes got the update request, you get the acknowledgement from the client. Also, this happens at the operation level. If you want to update three, let's say, product prices in one go, the way you typically do this with Vespa is with HTTP2. You're going to send - we have libraries that do this. You're going to send, effectively, 10 updates individually, and they respond individually. Each of them, at the moment, they respond, you know they're already flipped in memory, so you see the new price. Every searcher that runs after that will see the new price, or whatever you updated.

[0:34:38] SF: Vespa has been in this search world for a long time, like we've talked about. Over a 20-year journey. What's next for search? If we fast forward ahead three, five years, what are some of the problems that need to be solved that haven't been solved today?

[0:34:54] RG: I don't know, to be honest. There's so much work in the short term that I find it hard to look. Because things are moving so quickly, it's hard to tell. I do have a feeling that multi- model search will become more important. It would have visual cues here and there that will become more important, depending on the use case. I would think that the ability to explore data in real time would also be increasingly important. I think people are, and even agents, are

not necessarily happy with seeing top end results. They may want to know what else is in that result set. That brings, yet again, the question of what is the result set? Where is that threshold between relevant and irrelevant results?

Yeah. I think there are also problems that have been there since before I got into search, which was 15 years ago, and are still not really solved. Which is like, how do we get a good golden set? How do we measure search effectively? How do we get that feedback loop going? How do we improve performance? Not performance, in the sense of latency, but relevance without breaking other things? Yeah. I think if those have been around for more than 15 years, I would assume they would be around for the next five years as well.

[0:36:20] SF: I think the golden data set problem is a huge one, even outside of search, just in AI in general. If whatever AI system I'm building, if I don't have a good data set to essentially test against, how do I know that the investments I'm making are moving in the right direction? I see this a lot of companies and projects skip that step, probably because it's hard, but it's really hard to know whether the things that you're doing are actually useful if you don't have any way to test against it. People skip that step, because there's not an easy way to achieve it, essentially, right now.

[0:36:55] RG: Yeah. I feel like it's also a chicken and egg problem. Even if you do it, which as you said, not everyone does it. But even if you do it, it's like, how do you know your testing thing is good? How do you make sure that - because that's, I think, the main difference between what we see on the Internet when people publish, oh, this is the new state-of-the-art model, this is the new state-of-the-art technique, this and that, or academia, they have a golden set. That golden set is the benchmark. The assumption is that the golden set works. But if you're starting your e- commerce shop, or book search website, whatever search use case, and you start from scratch, like now what? How do you know?

[0:37:37] SF: I mean, I think that's the advantage that some of the stuff around coding has is that typically, companies have a history of things that they can build benchmark data sets around. There's issue trackers, there's prior code that engineers have built. There's been, essentially, a history of creating stuff that they can mine for creating these golden test sets. But if you're starting brand new in a new field, where the measurement of what good is is far more

subjective than just compiling and running something against a unit test, it's really, really hard to create those data sets. Even if you put the work into creating it, to your point, how do you know whether it's good or not? Radu, thank you so much for being here. It was a great conversation.

[0:38:20] RG: You're welcome. Thanks for having me.


[END]