EPISODE 1709

[INTRODUCTION]

[0:00:00] ANNOUNCER: DataStax is a generative AI data company that provides tools and services to build AI and other data intensive applications. Ed Anuff is the Chief Product Officer at DataStax. He joins the show to talk about making Apache Cassandra accessible, adding vector-supported data stacks, envisioning the future application stack for AI, and more.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:00:40] SF: Ed, welcome to the show.

[0:00:42] ED: Thank you.

[0:00:43] SF: Yeah, thanks so much for being here. I'm glad we were able to pull this off. Maybe we can start with an introduction. Who are you? What do you do?

[0:00:50] ED: Sure. I'm Ed Anuff. I'm the Chief Product Officer at DataStax.

[0:00:55] SF: I think DataStax is a really interesting company. It's been around over 10 years now. As an outsider looking in, I feel like, the company itself has managed to reinvent itself, maybe a few times along the way, which makes sense for, you want to stay relevant in the technology company. You always got to be iterating and moving with the times, but starting with building services on top of Cassandra, and to real-time data, and now, I think a heavy focus in AI. Can you give a little bit of background on the company, and some of the product history there?

[0:01:25] ED: Sure. We are the company that has been the main sponsor of Cassandra, which is the scale-out database, originally created by Facebook that's used by a lot of folks who are just trying to handle very large amounts of data in real-time. From that perspective, it's made a ton of sense for us to be really keeping track of, or keeping abreast of where people are using data and figuring out how to make it relevant to whatever sorts of applications that they're building. Over time, that's involved a lot more machine learning and AI type use cases.

[0:02:04] SF: Yeah. I mean, data is really the fuel, or I say, sometimes the love language of AI, so it really starts there. In terms of when Cassandra spun out of meta and was released as this open-source project, if I'm a company and I want to just run Cassandra on my own, what is hard about doing that on my own, versus going and using, working with a company like DataStax?

[0:02:26] ED: First of all, many folks do run it on their own. That's the great thing about open source. I'd say, the first thing that becomes really challenging is the fact that it is a distributed database, right? Most of us in our experiences with databases are not actually using distributed databases. Whether we are using PostgreSQL, or MySQL, or even, God forbid, Oracle or you're using something like, MongoDB. These are mostly fairly monolithic databases. I got a server and I install the database in it and I run it and everything's good.

Cassandra is a distributed database fundamentally at its core. What that means is that you have more data, or you need to serve more requests, you add more nodes, which is a really powerful thing to have. In fact, you're not going to run an Internet-scale service without that. Operationally, now you're talking about distributed database, right? Most developers may have had experience in doing distributed compute. Most folks, if they can help it are using something like, Kubernetes, or what have you. In fact, most developers actually sit on top of that and use some platform as a service that handles all that complexity.

That's a pretty big step function for database. Because of that, particularly for companies that weren't used to doing DevOps, or cloud operations, they really saw a lot of value in going to a company like ours, that just packages all that stuff up and gives you this turnkey control plan. In fact, in many ways, well, a lot of this preceded things like, Kubernetes. In fact, over the last few years, we've done a lot of work to make Cassandra play really well with Kubernetes and other types of cloud management tools. If you think about it, when Cassandra came out of meta, they put the database out there. Think about where you were 10 years ago, this idea of having a pharma servers that, where handling state was a new idea for most folks.

[0:04:28] SF: Yeah, absolutely. Even with that analogy around distributed computing and Kubernetes, as you mentioned, most people are probably not doing, managing their own Kubernetes clusters. They're using a managed service on top of that to abstract away some of that complexity. Then you throw something even newer, harder concept that a lot of people are unfamiliar around, essentially, a distributed database. Then it makes a lot of sense to go with, essentially, someone who can abstract away some of that complexity.

[0:04:56] ED: Again, the way I look at it is there are. There's plenty of folks who are deep into that. Open-source Cassandra is thriving. We just shipped 5.0. It's great. A lot of options for people. It's still, it's been about 15 years since the open sourcing of it. It's still like, I talk to startups all the time that are like, "Yeah. No, we have massive data set. We're using using Cassandra." It's the only game in town, if you don't want to be locked into something from one specific cloud vendor. Healthy community, really robust technology. We're not just the only contributors to it. Folks like Apple and Netflix and others, really heavy contributors into keeping this thing state of the art.

[0:05:38] SF: In terms of company's evolution and adoption of either DataStax, or Cassandra directly, are people building with that level of scale in mind as a scale up startup? Or do they hit a certain point in maybe the growth of their company where they start to hit scale issues? Some of this complexity around scaling maybe more of a traditional monolithic database. Then they start to go and look for alternatives that allow them to do this, essentially, in a distributed fashion?

[0:06:07] ED: Sometimes, it depends on what they're doing, but definitely scale is a day to concern, right? What we've tried to do, and it's been big part of our product strategy over the last four years now, and I'm really happy on how far we've taken it, is we've gone and said, "Look, let's not make it a tradeoff." Let's not go and say, "Oh, well. Only use this if you've got a lot of scale and otherwise, go use something else." We went and looked at all the things that made it hard to use, because I think that was the historical thing, was that we were really very focused on going and saying, how can we go and handle - you're going to launch a new social network, you should be using this handle, right? That was the way we were thinking about it, right?

That's great. But then a lot of people would be like, "Well, maybe someday. But right now, I'm just trying to get this thing out the door. I just want to build an MVP, or have you." What we said was, that's perfectly reasonable. Let's make this thing as dead simple, as turnkey as possible, for people that just want to do it wherever they are, wherever they're building around. The good news is that when you look at it today, if you have some inkling that you're building something that a lot of people are going to want to use, you don't have to go and say, "Well, I'll cross that bridge when I get there." You can get started. We've got the cloud service. Open source is a lot easier now. All the stuff is in there. The tradeoffs, you're really not having to make a hard trade off.

Yeah. I mean, the way you put it was exactly right. That's the way a lot of people, when I would talk to developers, and I've been using Cassandra for a long time now, even myself, where I would go and say, okay, is this a simple app that I'm building, or is this something that I want to be a big deal sometime? You can't always tell up front, which is which. The goal is right now is just, yeah, it's we're just as easy to use as anything else out there, and then you won't outgrow it.

[0:07:54] SF: Yeah. I guess, I think, if you look at something like, containers and distributed computing to the movement towards Kubernetes, I think it's gone through a similar transformation. It's basically a little bit ahead of probably the same transformation that's happening in data. This might become more or less, you fast forward five years, 10 years, everybody's building this way from day one, if you plan on building a growth stage startup at all.

[0:08:17] ED: Yeah, I definitely think so. I mean, I think that compute is always a little bit further ahead from data. Scale out compute now, five years ago, everybody was like, "Well, why do I need Kubernetes, or whatever, actually, more than that?" Now it's like, "Oh, I've got this thing and I've created a container image and let me throw it into elastic compute service, or something like that." It's very simple. It doesn't get much easier than that. You also don't worry about scale.

The same thing is happening with data. I like to say, we're part of that, which is to say, can we make the scale out part work in a way that doesn't have too high of a operational complexity? I think a lot of folks are coming from that direction, because ultimately, what you want to do is have this very elastic approach to whatever you're doing, whether it's computer data. You just need the same underpinnings of both.

[0:09:10] SF: I think that vision of things makes a lot of sense aligned with, now the investment in large language models and the things we're trying to do there, because those that - you're talking about massive amounts of data in order to train these things. I guess, what is DataStax doing in the context of AI? Where do you see it sitting into this new emerging LLM stack?

[0:09:31] ED: It fits in a couple of different places. Obviously, we look at a lot of folks use Cassandra as a staging database for things like, training, and so on. Where we're predominantly being used is as part of these RAG applications. To some degree, that's true of all of the - well, I wouldn't say all of them, but all of the databases that are worth using are seeing that they're playing a pretty important role within RAG applications. Cassandra has the advantage in that a lot of the data that people are serving at their user experience tier that is also the same data that they want to use within the RAG applications. We're seeing a lot of that. We've definitely seen it over the last year and a half.

For example, for a cloud service, more than half of the people that sign up for it are doing some form of RAG application that's using vector retrieval, right? You can actually do RAG without vector retrieval, but just more than half of our signups, when I go and look at the data, they're actually using vector retrieval, which is pretty cool. That definitely was not the case a year ago. It just shows how much activity is around this in terms of people trying to build these types of apps.

[0:10:48] SF: Yeah. I mean, I think a year ago, most people didn't know what a vector was.

[0:10:54] ED: Yeah. The timing has been pretty interesting. I mean, people started really, I think the year before last, most of us that were following this stuff pretty closely were like, okay, vector embeddings and doing vector-based retrieval out of your data sets, some similarity search is proving really, really interesting. Now, of course, it took the LLMs to be able to generate, because you need these embedding models to generate embeddings that then allowed you to do a reasonable search.

The thing was, was that even the publicly available embedding models back in 22 were already good enough that you could build a better search engine with vector search. A lot of folks, us included, were already building towards vector search retrieval, even without the idea of the chatbot interface. That's why you had folks like us doing what we were doing. We've seen HNSW implementation and you had various folks.

It was the reason why you had even these folks like, Pinecone and Weaviate, and so on. It wasn't like, everybody got up and said, "Hey, I'm going to go and build databases for building RAG applications with large language models." But this idea of doing a vector-based retrieval of data in the database using embeddings was already something people were thinking a lot about it. In many ways, when RAG appeared, that was more of a adapting that work. Like I said, a vector search was something that had been going on for a few years, a lot longer than that, but in practice for a couple of years at database companies, and so on, like 21 and 22. That was always a big part of our roadmap. I'm sure it was the same for other folks. That's why as we got into 23, you saw that in addition to the people who only did vector retrieval, you had folks like us come out very, very early in the year with our vector retrieval stuff. Over the course of the year, you had others, like Mongo and open search following. It's just because a lot of this activity was already underway with that convergence.

[0:13:15] SF: Yeah, it makes sense. I mean, there's now over, I think, at least 60 different databases that offer some form of vector support from the specialized vector databases to your PostgreSQLs of the world, and so on.

[0:13:27] ED: I would hope so. I would hate that it did. The funny thing is there actually are some that like, I marvel at that. But I have seen a couple of these databases that just recently, like in the last three, six months since the start of the year, added vector as it was this new thing. I was like, "Well, I'm glad to see that." I can't imagine having a database of people who want to use that didn't have a really strong vector native data type. That's one of your - would feel things. Now, that said, it's not the only way to retrieve data. If you look at what a lot of people are doing with RAG, they're combining vector with additional terms and other columns and stuff, and we could talk more about that. Certainly, it's a foundational capability that you need to have.

[0:14:05] SF: Yeah. I mean, it's going to be the standard, basically. Just like you support a Varkar in a daytime field. One of the things that you said a few minutes ago that I want to go back to was, I think you were talking about how you saw this pattern, essentially, with your - the people using Cassandra, where they maybe are using that as, essentially, their transactional database for the application. Then the data that they're collecting, they also want to leverage for LLMs, and other, probably AI applications. Rather than thinking about those two different worlds as separate, completely separate things and having a completely different data pipeline, if you have one platform that can manage all that, then it's probably, you can keep the data where it is, basically, and it's easier transition. Is that essentially the idea based on that pattern that you're seeing from these folks?

[0:14:50] ED: Yeah. I mean, I'd like to imagine that everybody's seeing what we're seeing. Hopefully, we're seeing it a little more clearly than others. That's the usual thing about infrastructure and architecture is you want to make sure you're getting the patterns right. We're not trying to go and say like, you should build it this way and the rest of the world is over here. The thing that we're seeing is that not everybody is doing RAG the same way right now. What I mean by that is that a lot of people are doing very first-generation RAG stuff. What that largely consists of is, and a lot of folks have the shorthand for that. They just call it chat with PDF, PDF documents.

What that is, that's basically your whole world of building a RAG application, which is to go and - and it's not a bad place to start. If you're a developer and you're trying to figure out like, what's this RAG thing all about, that's what I would recommend, because you probably have a whole bunch of PDFs that you have, or that your company has, or whatever. You just throw a bunch of them in a folder and you use some off-the-shelf stuff to go and read those in, chunk them, because you got to turn them into chunks of data that will, one, fit within the context window of the LLM, but two, will represent a discrete semantic conceptual block, meaning that you want all of the text that you're grabbing, ideally be like a paragraph, because when you reduce that down to a vector, you want that vector to encodify the concept that best describes that chunk of text.

When you talk to people who are doing RAG, you hear this chunking thing a lot, and that's, frankly, a lot of trial and error data engineering, which is, I'm taking my source data and turning into these chunks. That's great. People can spend like, we talked to businesses and stuff, and they can spend six months learning the tools of the trade to just get that basic chunking and knowledge retrieval. When you're done, what you have is, you basically have a better search engine, and you can put a chatbot UX surrounded. By the way, in some circumstances, that's extremely cool.

What we're finding is that in practice, you want to do a little bit more than that. What you want to do is move to the second generation of what you do with RAG. That starts to involve multiple sources of data beyond just let me chunk up this knowledge content.

[0:17:24] SF: I think, too, back to the chunking thing, even that gets really complicated, especially if you have a table, like how do you chunk the table in a meaningful way? I mean, and there's companies that basically, are structured, and a handful of others that -

[0:17:37] ED: Yeah. Structure is awesome. LlamaIndex has been focusing a lot on that particular problem, a whole bunch of other folks. Like I said, that's one where, in fact, I think for people that are doing RAG, the chunking problem ends up being the most unfamiliar piece just new to what they're doing. You lose a lot of time in just learning, how do I do chunking RAG?

The thing about RAG is most people have this idea of, they use the prompt and they use context interchangeably, right? We generally have this mental model, which is I've got my LLM, and I put this input into the LLM, and that's typically called the prompt, although properly speaking, the prompt is a component of the context. You put that into the LLM and you get a response. What RAG says is, I can take that question, and maybe, I actually use that question itself that I go and ask like, "What's the best place to go to lunch?" I take that, and before I give it to the LLM, I actually do a separate thing, which is I take that question that you asked, Ed, and I turn it into a vector, and I do a query against a known set of data. That known set of data is those chunks that we just talked about, right?

Now, I maybe have all of these restaurants, and I've got restaurant descriptions, and I've taken every one of those restaurant descriptions, and I have turned it into a vector. I've taken that question that I asked, which is where should I go to lunch if I want a really great burger? I've turned that into a vector, and I run through the database and I compare those two vectors and I get the ones that are most close to it. Now, what I'd have is I have a set of relevant data that maybe is this set of great restaurants here in San Francisco that I've gone and done, and I've done that vector comparison. That's a similarity search.

Now I have this set of data, before I even get to GPT4, for example. Here's this set of stuff. Further, when I go and do that retrieval, because I'm doing it off of a vector database, like DataStax, that allows me to combine that vector similarity with maybe that geo - with a geo query. It's not just like, hey, what's a great place to get a great burger? But within half a mile of where I am now. I have that set of data, and I've got the original question.

Now, what I do is I take all of those and I put them into a context. That's what I go and give to the main LLM, what I give to GPT4. With an additional thing, which is my system prompt, which goes and says, "Hey, the user asked this question, and when you prepare the response, use his question, as well as this set of data that we retrieved, that we've pre-retrieved and we know is relevant to his question." Now, the LLM is going to prepare that response. What it's going to do is it's going to draw really heavily on the content that we supplied it in that context, right?

Then, the LLM knows everything it's been trained on, but it also has this specific data that is now personalized, because remember, we did that geo retrieval, and it's also, perhaps, some special content that's not on the web. Maybe we've got this database of restaurant reviews that we could draw on. So now, the question that the response that LLM gives back is grounded, because this data came from the database, it's personalized. That's where the retrieval augmentation of RAG comes from, right? That pre-retrieval that I did, I'm now using to augment that generation. That's what RAG is all about.

A lot of people are just doing that very first-generation stuff, which is like, let me just return some stuff from the PDF. But those additional pieces that I threw in there, like, my geographic location, or maybe my personalization, right? We have people that are building recommendations for vacations, for travel. You start to do that. Now, you're getting something that isn't just going to the memory of the model, but it actually is now really knowledgeable in it, or at least the responses that are generating are really knowledgeable about a specific domain.

A lot of that comes from pulling in more data than just those - the chunk data is important. That's the knowledge piece, but you can add personalization and you can add a lot more data from other sources. That's what we're seeing is that the more advanced folks are doing. I think you go a year from now, and that's where the majority of folks are going to be, because doing a thin layer or just on top of the model, like retrieving some static content is interesting, but it's not super compelling.

[0:22:39] SF: The advantage there as well, as some of the things that you mentioned is it's up to date, where a foundation model is basically fixed in a certain epoch. If you want to do anything like this for in a e-commerce retail use case where products are changing all the time, then you need to, essentially, combine traditional database type of search, so I know that the products are actually in stock, available for my location, all that type of stuff, along with being able to combine that with similarity search, so that if I'm looking for blue shirts, but maybe there isn't blue shirts available, I can get the next best thing, which is a purple shirt, or something. I want to have that level of flexibility and the results that I'm feeding into the LLM.

[0:23:18] ED: Yup. That's a little bit of like, why we talk about a lot of this stuff, because a lot of companies use Cassandra for that. It just ends up being one of the types of things that Cassandra is really good for is, let me have all this stuff, but let me have it, like all my products, but in inventory and what's close by. For example, when you go to Home Depot and everything in the app and it tells you that there's 25 of this thing in ILH7, that's all a Cassandra query.

[0:23:46] SF: Is Astra DB, essentially, the vector version of -

[0:23:48] ED: Astr aDB. Yeah. Astra DB is our cloud service. What we've done is we've taken all of these things, we've made them really as easy as possible to use. It's a vector based. We have vectors as a native data type. We built that at a very low level within Cassandra, so you have a lot of performance. We, in fact, open sourced that is something called JVector. We're really happy on that. Astra makes it very turnkey to use. You can just sign up as a cloud service. You can have your database ready to go in 10 seconds. All the good things you would expect. But behind the scenes, you have all of this power and scale.

[0:24:25] SF: What was involved with introducing that vector type? What was the scale of that project in order to be able to extend Cassandra in that way?

[0:24:34] ED: A couple of things. First, we had been doing a lot of work for a number of years in improving the indexing system of Cassandra. The reason why is because we are NoSQL database. A NoSQL database, like when you think about, how do I find data in a database? That's really one of two ways. One is the relational model, meaning that like, oh, I have a customer ID and I've got the customer record and I want to find all their orders. I go from here, look up the customer, get the customer ID. I just do that as relational query, right?

The other option is you have a very robust indexing system. That's the approach that NoSQL databases, like Cassandra, Mongo and others have taken, which is to go and say, "Hey, we can have a much more performant database with a much more developer-friendly data model." But what we're going to do is we're going to give you a more powerful indexing system to make that possible. We've been rebuilding the indexing system really for the last five years, this major effort in the project. As part of that, we made the whole system where we had data types extensible. What this allowed us to do was very quickly go, and so, this was early last year, we wanted to add vector capabilities.

What we did was we did the same thing, by the way, that most of the other vector databases did, which was our first generation was using the HNSW hierarchical navigable small worlds vector similarity search that originally was created, at least the vector comparators and such, were created as part of the Lucene project. That was the same thing that pretty much all the other databases did, whether they were so "pure play vector databases," or others. The first generation that we put out there was based on that, which by the way, it was a good answer, but what we found was, wasn't a great answer.

We followed that up in fairly short order, relatively speaking, but I think it was - so, we did that first iteration. We had that in March of last year. Then around September of last year, we switched to JVector, which is our implementation of DiskANN. DiskANN, the advantage in this, the name would imply, is designed to do much better where you're actually doing this storage-based usage of indexes. There were a lot of things within HNSW implementation, we're just assuming that things were all in memory. We needed to be able to do a scale out implementation. That's why we adapted that. We released that as open-source standalone open-source project. That was done by Jonathan Ellis, who was one of the main founders of the Apache Cassandra project. He was the co-founder of DataStax. He personally worked on that. This was a pretty big deal. This release of JVector was one of the trending projects on GitHub for a long time. It's being adopted by open search now and others.

This is what gave us the, really, the performance and the relevancy and recall that we were looking for. Now, we're actually in the process right now of taking that and actually layering something on top of it called Colbert, which actually does even better retrieval. I mean, it's just been an ongoing thing for us. You're going to see continuous iteration on this stuff. Vector-based retrieval has been a thing within databases for a while now. People have been talking about this stuff since the nineties and probably at least, I've been directly involved in that stuff. I'm sure it goes way back before that. Just major, major activity is a primary data type. You're going to continue to see that for some time to come. 

Now, one of the things that was really important that has been an area of some debate, which is do you just use a vector database, or do you need what's called a hybrid database? We obviously think you need a hybrid database. The idea is that, going back to my example of the geo query, I want to be able to go and say, "Here's this question asked, give me all the restaurants that have a great burger and I do that vector query around it." I also want to say, and within half a mile of this latitude-longitude coordinate, right? I want to combine those in a single query.

Now, by the way, in my example, I just happen to give you, it's interesting, that second thing is also a second vector query happens to be a two-dimensional vector query rather than a 1,500, or 1,536 coordinate vector query. You're actually doing two different vector comparisons and merging those results. That's the type of thing that you want your database to be able to do. At a certain point, this idea is it a vector database, or is it a hybrid database, which is, I'm not sure is the best term, but that's what people are calling. Meaning, to do traditional database results and vector database results, I think most people will lean to this idea that I'd rather have more of this data in these records and be able to query by multiple, essentially, multiple criteria.

[0:29:40] SF: Yeah. I mean, I think there's certain use cases, like some of the ones that we spoke about that are pretty difficult to solve for and much you have this hybrid approach, because sometimes you need exact batch, or a bounding box match, not necessarily the nearest neighbor type of search. In terms of the HNSW versus JVector, what was the limitations of HNSW that you ran into and how did you, essentially, move beyond that? What is the advantage that you're getting using JVector?

[0:30:09] ED: Well, I won't do all of the justice to this, but we started to see the breakdown on larger data sets. What we found was that everybody did and does a reasonable job when you're dealing with small data sets, to the extent that a lot of folks that could read on Hacker News and they'd be like, "Why are you even using a database? I'm doing all my vector retrieval in memory, right?" Which was true. It's like, if you had a couple of thousand vectors and you were just trying to do a very fast, similar research on them, you could do it in memory, it'd be just in a much better place. What we started to find was that people would come in and they'd start off and they'd get really great results, and then they would start to load much larger data sets.

What we started, what we found was that you very quickly, at least within HNSW, and again, your mileage may vary around this, but what we found was that we started to get into, at least in the stuff that we were seeing. This was mid last year, we started to see that you'd hit a wall that would directly start to impact the relevancy of your retrievals. Not to mention, of course, there were performance issues that were also pretty significant. That's why we started looking at DiskANN and doing an implementation around that. It worked out very, very well for us. I'm pretty sure at this point that most of the databases that are fairly serious about vectors use case have moved beyond the original HNSW approach.

Again, it was our starting point for everybody. I don't want to name names, but everybody in the industry was doing some variant of that, because it was the main public domain open source - not public domains, but open-source reference implementation that people would look to. That was your first order implementation of it. Now, that's where most people are. I forget the name of the one that - Google's using a slightly different one now in their databases. For example, I mean, I think Microsoft has moved to DiskANN as well for their stuff.

Then, RJ vector implementation is just, we said, look, from an open-source standpoint, people look for high quality reference implementations that we wanted to - for one thing, we wanted to make sure that open-source Cassandra was - I mean, again, yes, we want people using our cloud version, but we wanted people going and saying, "Oh, which [inaudible 0:32:36] should I use?" We wanted Cassandra to have the reputation for being a really great vector database. That necessitated us open source in that implementation in any way. We chose to package it up in a way that we said, let's make it easier for other projects that are struggling with this set up.

[0:32:53] SF: In terms of when people start to run into performance issues, is that typically down to how they're building the vector indices, or is it related to what they're using potentially for, essentially, vector similarity comparison? Or maybe it's a combination of these types of things. What is that, essentially, bottleneck that people run into?

[0:33:13] ED: That's a good question. Let's split it out a little bit. Right now, I think that there's definitely a performance continuum that people see in terms of just their data set size. That is a problem. Most users experience that in terms of cost, because that's at this point, how, I'd say, most of the - the issue is, well, let me explain why that is. Depending on your implementation, it's going to translate into your compute cost, which since the majority of folks doing this stuff, or using some form of cloud database, what that means is that different size data sets and different vector implementations are going to show up compute. I'm not even talking GPU. I'm just talking about that. Because most of the databases are not using the GPU for the vector comparison. They're using it in different places for embedding generation and such.

What ends up happening is that different implementations, because they have different levels of compute, intensity, and obviously, the cost gets passed to you as a developer, right? Like, your cost per query. What will end up happening is, I've got a very large data set of vectors, and I throw it into three different databases, and the bill ends up coming back very differently. That's your first problem. I'm not trying to do a commercial. My recommendation would be like, try these things out and see what happens, and you're going to see a different result. I think we will do pretty good in that, right? That's why we have a business, because we do pretty good in the comparison.

My recommendation would be like, you as developer building these things, you should test these things out. The second piece, though, let's assume you get within your cost envelope of what you want to see. What you start to see is, particularly with the size of the data set, you start to see that your relevancy ends up falling off a cliff. Because what ends up happening is, and I was looking to this earlier, which is that the different algorithms will do a better, or worse job with matching what you're doing to other clusters of close by vectors. What will end up happening is, I'll ask that question, and I may get a bunch of, I won't really get the ones that are the closest. Or if I do, because typically, I'm asking for five at a time or something. Maybe it'll be something better that's not in that set that I got.

There's different ways to score that, and you've got your F1 scoring and such that the people use, because you've got your relevancy and your recall, like how relevant is it, and then just did it fit into the - did it just work - did some elements that should have been in there, just not get recalled. You're going to start to see those differences. Then the knobs for turning those may have different cost considerations at the database tier.

Now, the other knobs that you have to turn around that tend to be two other things. One is how effective was your chunking strategy, because a lot of this is garbage in, garbage out. If I do very naive chunking, I'm also going to have really crappy relevancy, right? Because if that chunk has more than a very small number of central concepts, then maybe I don't get the right piece, right? Then the second thing that will also screw me up, or want to screw me up, but can have a lot of variability, ends up being what embedding model I use, right? Those end up being different factors.

All of those, so going back to original question about performance. Your performance becomes one issue. How well did it perform? That typically is like, how well did it perform at a certain cost level? In terms of just, let's say, raw, wall clock performance, queries per second against a certain data set, right? Then, what you find very quickly within these projects is it's not just that. It's like, and then how good were the results? I did this thing, and whereas, I get in crap results back, or was it like, "Wow, it did a great job and really nailed it in its response." Those two end up being related to each other to some degree. Then, of course, there's the third piece, which is whatever I do there, I throw into this language model, which can also have a lot of variability in what it does with it. When we talk about performance, and I went a little long-winded there, but it's just like, this is what trips people up on these projects is that performance is multi-dimensional.

[0:37:43] SF: Right. Yeah. In terms of given that Cassandra is a distributed database, then how does, essentially, the sharding work in the world of vectors? What is the partition key and how do you figure out how to group similar vectors in a way that's going to satisfy you? You're not, essentially, having to visit multiple nodes.

[0:38:02] ED: You essentially are essentially creating a hyperring. I mean, you've got, each node has to go and each individual node ends up having its own vector space, and you've got that, and you still end up having to go and do a lot of merging and stuff that the coordinator node that handles the query. It's a variation of the same problems you have with distributed queries, but it does get a little bit more complicated. We actually will be in a future version, actually, doing vector aware partitioning. Right now, we don't, and it's not a problem, but it's an opportunity to actually do something that's a lot more interesting.

[0:38:36] SF: Yeah, especially as the requirements around vector storage scales. You're talking trillions of vectors, or something like that.

[0:38:44] ED: Yeah. The issue becomes that. Remember, there's three databases like, Cassandra ends up. Repartitioning and rebalancing based on certain operational conditions, nodes being added, and so on.

[0:38:56] SF: Right. Like a RAG battery.

[0:38:58] ED: Yeah. What you don't want to do is have - vector comparisons are expensive. You don't want to have to go and trigger thousands of vector comparisons every time you rebalance the node. One of the reasons why we're being a little bit careful on how we think about these things. You're going to see a lot of cool stuff around this stuff. Again, I'm not the person to do it justice. In fact, Jonathan is - Ellis is probably like, he's going to call me up after this and be like, "Ed, you completely misstated that." You may want to do a follow up, have him on and he can get into very low-level detail on some of this stuff. The point is, suffice to say like, this is an area that anybody that's doing a database now, it's quickly become your primary consideration as you're thinking about this stuff.

[0:39:42] SF: No, this was super fascinating. Of course, I'd want to nerd out with Jonathan and talk about more in depth even about vectors. As we start to wrap up, is there anything else you'd like to share? What's next for DataStax?

[0:39:54] ED: Yeah. Well, so a bunch of stuff. One of the main things that I think we could talk more about and we probably should talk a little bit more about is the application stack. Because everything that - a lot of this, and I love geeking out on the vector level. The reason why all this is interesting is because people are building a new type of application, and they're building these AI RAG applications, or knowledge apps, or whatever you want to call them. People are still figuring out the term for it.

Last time we had something like this was when people discovered 12-factor apps back in the Heroku days, right? Or mobile apps, right? It's a new application template. Databases are interesting, but what makes databases interesting beyond all the things we just talked about and actually, what really makes the difference in databases is what's happening at the application tier. Everybody uses Mongo today. Love for them to be using Cassandra, but why are a lot of people using Mongo? Because what ended up happening was about 15 years ago, suddenly, people started using dynamic applications that were built on JSON data and documents and stuff, right? You ended up having these JSON document, you had JavaScript on the client, you had JSON on the wire, you had JavaScript on the backend, you had this data type, and so it made sense to go and say, "Let me have JSON on the database, right?"

10 years before that, or 15 years before that, you had people going and building job applications that were all these ORM systems, object relational mapping that needed SQL databases. So, people using PostgreSQL and MySQL, and then they were building client server and all that stuff, right? If you want to know where a database is going to go, you got to look at this application stack. For all the stuff that we talked about right now, we spend as much, if not more time, going and saying, what's the application stack for AI? Working with folks LangChain, and we have an open source. We recently acquired a startup called Langflow that makes a flow engine for LLM apps with visual tool and all this stuff. This whole AI stack is really the most important thing to pay attention to.

All the underpinnings of how do we do the retrieval and stuff, a lot of important stuff there, but that is all influenced by how are people building these apps, and figuring out like, how can we figure out what is it that they're building and how can we optimize, how we organize the data, but how we make it possible access to that data in a way that naturally maps to these AI app patterns? That is an equally important topic. The vector database piece, I won't spend time on it, because people are still learning what that means. Yeah, this is where most developers are trying to wrap their hands around is like, "Okay, I want to build one of these things and I've got this framework and I've got and downloaded LangChain. What do I do?" That's what we're trying to solve.

[0:42:49] SF: Yeah. It's also, I think, the way that you differentiate at the database level is by showing, are they possible at the application level. Are you able to essentially enable something at that application stack that no one's been able to do before? Because when it comes to demoing a database, what is the demo? You run a SQL query. Wow, I saw that 30 years ago against an Oracle database as well. It is not a compelling story there. The compelling story is like, what does this enable you to do that you couldn't do previously in terms of creating an application that actually serves a real need for people?

[0:43:23] ED: Yeah. No, exactly, exactly. That's the piece that, again, I'd love to come back, talk to you a little bit about that stuff. I think that's, again, where a lot of action is.

[0:43:33] SF: Yeah, absolutely. Sounds great. Well, Ed, thanks so much for being here and hopefully, we can do a follow up to dive into that topic as well.

[0:43:39] ED: Awesome. Cool. Thank you.

[0:43:41] SF: Cheers.

[END]