EPISODE 1766

[INTRODUCTION]

[0:00:00] ANNOUNCER: DataStax is known for its expertise in scalable data solutions, particularly for Apache Cassandra, a leading NoSQL database. Recently, the company has focused on enhancing platform support for AI-driven applications, including vector search capabilities. Jonathan Ellis is the co-founder of DataStax. He maintains a technical role at the company and has recently worked on developing their vector search product. Jonathan joins the show to talk about his passion for being in a technical role, where AI fits into the DataStax platform, developing vector search. And he also reflects on his gradual adoption of AI into his workflows and where he thinks AI development is headed in the coming years. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:00] SF: Jonathan, welcome to the show. 

[0:01:01] JE: Thanks, Sean. Glad to join you. 

[0:01:03] JE: Yeah. Absolutely. Thanks for being here. You've been working on DataStax for nearly 15 years. What's kept you engaged all that time? What's kept you excited? 

[0:01:15] JE: Writing code. My career arc transversed across going in the executive direction and then realizing that did not spark joy for me. And then coming back to writing code pretty much full-time. Most recently, on our vector search products. That's what makes me happy to get up in the morning is looking forward to taking code. And then at the end of the day, it does something that it couldn't do before. 

[0:01:40] SF: Yeah. I think it's good that you're able to recognize that. Because I think it's like a typical path that a lot of people - especially if you're an entrepreneur or you start a company. But you're in technical positions. And then if you're successful, a lot of times you end up getting promoted away from doing a lot of the technical work and end up doing a lot of more people management. And there might be technical aspects. But it's definitely a different skill set. And you have to find joy in different ways than actually day-to-day building something.

[0:02:06] JE: Yeah. I tried. And I think I might have said this on a Software Engineering Daily episode. But I really tried to convince myself that it was just as fun to build companies or build teams as it is to build code. But for me, it isn't. At the end of the day, I had to stop trying to pretend and embrace my inner code monkey, I guess.

[0:02:29] SF: Yeah. Still, I feel like I those inner battles every six months or so myself as well.

[0:02:34] JE: Oh, shoot.

[0:02:36] SF: You mentioned your latest work with adding vector support to DataStax. What is DataStax doing now in the context of AI? Where is it sort of thinking about itself fitting within this modern world of the LLM stack? 

[0:02:49] JE: You would use the word stack. And that's in our name. And that's what we're trying to provide is a one-stop stack for building generative AI applications. We built vector search for Cassandra. We acquired a company called Langflow. We are partnering with Nvidia to provide embeddings computation and other models through Nvidia GPUs. 

And so, we kind of approached this a bit at a time where we started with the database and we started with the vector search. And we saw people saying, "Well, okay. But how do I host the rest of my application? You've got this hosted database. But how do I deploy the rest of this? And how do I get the embeddings into the database instead of having to pull together OpenAIs, embeddings model, and manually stitch that with the Astra database? 

You can just tell us, "Here's my OpenAI key." And we'll go and integrate with them. Or you can say, "Hey, I want to use these Nvidia embeddings." And we'll compute them on the GPUs for you. We're really trying to remove the complexity as much as possible and let you focus on building your application.

[0:04:03] SF: Yeah. It's more of a platform approach than being essentially like a point solution for vector storage. 

[0:04:10] JE: Exactly.

[0:04:11] SF: It's like serve all needs essentially. And there are a lot of things that you have to stitch together to build one of these applications and actually put in front of users.

[0:04:18] JE: That's what we're seeing. And not only is there a lot of things to stitch together, but there's kind of a lot of unspoken knowledge. Or it's not necessarily clear what the best practices are when you have a LangChain that offers five zillion ways to chunk your documents to embed them. What should my default be? What should I start with? What am I most likely to succeed with? And so, we're trying to bring those into the mix as well through the LangFlow platform that we're offering.

[0:04:49] SF: Yeah, that's interesting. Because I think you're right. If you're building something like a RAG pipeline, there's so many decision points that you have to make. And each decision point, essentially, you're sort of potentially giving up something in terms of accuracy. You're making some sort of compromise. And those series of compromises could lead to kind of a disastrous result. And it's very hard to trace back what part of that decision-making process led to the bad result and going back and trying to figure that. It ends up being a lot of tinkering and massaging the various inputs and stuff like that. And for somebody who really just wants to like, "Hey, I just want to build this application and integrate AI into it," having a great default experience would be hugely beneficial to those people to be reducing the friction and barrier to entry. 

[0:05:34] JE: Right. And that's like the dark side of generative AI is the good side is it's a magic wand and it just works. Except when it doesn't, how do you debug that? And often, it's because of garbage-in, garbage-out. That you made a mistake in your pipeline and it's not getting the context that it needed to give you useful answers. Yeah. That's why we acquired LangFlow is that it's this visual environment that gives you best practices in reusable components that you can easily connect together to build your application. 

[0:06:07] SF: This direction with the product, did it feel natural or did it require some kind of rethinking to come up with realizing that this approach made sense for DataStax? 

[0:06:18] JE: I think the rethinking was primarily around we were a database company for 13 years. And so, realizing that we needed to embrace a wider role in providing infrastructure for gen AI applications. That was the main rethinking, I think. 

[0:06:36] SF: Okay. I read this really interesting article that you wrote about how AI helped us add vector search to Cassandra in six weeks. And one of the things that you say in that article is that you're never going to go back to writing everything by hand. First of all, prior to that project, what was your level of skepticism or enthusiasm for using AI to actually help you code? 

[0:07:00] JE: By nature, I'm a little bit of a late adapter. I've told so many people, "I spend enough time debugging my own code. I don't want to debug other people's code too. Let let other people shake the problems out. And I'll be happy to use it once it's stable and production-ready." 

And using AI for coding has been a real exception to that for me because it's just so useful that it's worth putting up with all the sharp corners and rough edges. When OpenAI launched ChatGPT in October of '22 I think it was and people started throwing, "Hey, can you write some code to do this?" This was GPT 3.5 at the time. And it could solve small problems pretty well. And so, that was really a big light bulb for me that, "Oh, wow. This is going to change my job." And then when GPT4 came out a couple months later, it went from being, "Okay, this is going to be useful someday." To, "This is useful now." 

And so, I think I wrote that article that you're talking about in June or July of 23. At that point, I've been using AI to help me write code for around six months. GPT4 in January of that year was when it started getting real for me. 

[0:08:17] SF: Yeah. I feel like I was pretty skeptical about these things at first. And now, I mean, it's pretty undeniable how valuable they are. I remember over the summer, I wrote my first program ever in Go. And I certainly could have struggled through getting that program to work by reading through all the documentation and references and using Stack Overflow and the sort of traditional places you would go. But it would have taken me way longer to get that program working than it did leveraging just ChatGPT and throwing prompts at it for what I needed or even taking some of the code that I had written and asking for it to help to improve it and stuff like that. Or even you can write in another language and say translate this over into this other language. And it'll do that with a reasonable output. 

[0:09:04] JE: I think I see it being useful in two areas primarily. One is to get up to speed in a domain that I'm not familiar with. I've been writing Java code for - oh, man. Longer than I want to think about. Actually, it's 25 years now. But I recently wanted to experiment with hybrid search in Python. And so, having Claude write most of that code for me just really helped me get up to speed in terms of it's importing all of these packages that I would have had to read up on manually. And it definitely sped me up by at least a factor of five and maybe 10 in that use case of I'm getting up to speed on code that's not very complex. But it's in this language or using libraries that I'm not familiar with. 

And the other is even if I am familiar with something, if it's a bunch of boilerplate code or scaffolding, it's really good at taking that out of my workday and making code more fun. And so, that's part of what I talked about in the article is that not only am I more productive but I'm having more fun because I've got this AI intern to do kind of the boring parts. And I can concentrate on the interesting parts. 

The interesting parts are still there. Just today, I was writing some code to fine-tune an open source embeddings model. And Claude got me 90% of the way. But then it misunderstood the dimension of the tensors it was using and it couldn't figure that out. I tried a couple three times and saying here's the error it's getting. And it couldn't resolve it. It's like, "Okay, at this point, it's time to just dig into the code and solve it the old-fashioned way." It's a good mix. I'm really, really happy with the challenge and the intellectual puzzles that programming in 2024 with AI looks like. 

[0:10:53] SF: One of the other values I've heard from people who are kind of new to programming, too, is that it gives you sort of a non-judgmental assistant to ask questions to where you don't feel someone's going to be mean to you because you're asking a question you perceive is not an intelligent question to ask or something like that. It kind of makes for a psychologically safe zone to ask questions that maybe otherwise you would hold back on if it was a person that you were trying to ask.

[0:11:19] JE: Yeah. And that's something that - and it's not just questions about like how do I write this code. But also questions about a code base, right? Cassandra is - off the top of my head, I'm going to guess that it's roughly probably between 1 and 200,000 lines of code. Not huge. But it's big enough that it's tough to wrap your mind around all at once as a new developer or even as an experienced developer. I haven't touched the Cassandra compaction code in 12 years, give or take. 

And so, it's really, really useful to have an AI assistant where you can say, "Hey, how does Cassandra make sure that it doesn't compact away a data file that's actively being used by the read threads?" And so, you can't just paste your whole data set into GPT4 or Claude. But what you can do is you can use one of the tools that uses vector search to provide appropriate areas of the code to the LLM to answer your question from. 

Cursor is very good at this. Augment Code is also very good at this. And the open-source free text mode AI code authoring tool called Aider. I believe the author is French. that's A-I-D-E-R. It's not quite as good as those other two. But it is better than nothing. And all of those will get you an answer in seconds. And if that's 60% of the time, that's good enough. Great. I've saved my time that much more often. But then if it's not, then you can still do it the old-fashioned way and ask a co-worker. 

But having that assistant to answer questions, the non-judgmental thing, that's great. Absolutely. But, also, just the speed, the latency of getting the questions answered now. And this is part of the category of things of like, "Yeah, it's not perfect. It gets it wrong 40% of the time." But it's much easier for me to verify the answer than to try to generate that answer in the first place manually. And so, that's still a big win even if it's only 60% accurate.

[0:13:24] SF: Yeah. I mean, I think those are kind of like the perfect use cases for where gen-AI is right now. If you're looking at something like sort of the needle in the haystack search or summarization. And there are tons of industries where this is widely applicable. If you look at like legal, there are whole teams of paralegals whose job is to go and look stuff up in case files and stuff like that. And to be able to pull that back in seconds, to your point, even if it's only 60% accurate and you could verify it, well, great. Now you've reserved your resources for those 40% of cases where you can't actually answer that or your own personal time to dig in and so forth. 

In terms of the project to extend Cassandra to support vector search, what were some of the big challenges that you ran into with actually integrating vector search into Cassandra? 

[0:14:11] JE: The first one was just adding a vector type to start with which isn't something that Cassandra had before. I guess there's two components, I think. There are two interesting and challenging components. One is just the vector index in the first place. How do you map vectors to each other to their nearest neighbors and do that in logarithmic time at search time? That's a problem that has a fair amount of prior art around. And in fact, that's what we reached for. We reached for first HNSW and then DiscAnn, which is a more advanced index type that lets you scale outside of memory. 

But then the other piece is how do you wire that into the rest of the database? How do you build a query execution engine that can do a vector search but also say restricted to documents that contain the word red or restricted to documents that Jonathan authored last week. And that's something that Cassandra hasn't traditionally been good at is doing multiple predicates like that and saying, "Okay. Now take the result and order it globally by this other index." 

And so, just building - we built a cost-based query optimizer. We built models of how expensive a vector search was going to be versus a keyword search, versus a numeric predicate. There was a lot of work on both of those sides. Both the vector index itself and then integrating it with the rest of the database.

[0:15:43] SF: And then what's this do in terms of sizing? Presumably, a vector, an embedding could be a couple thousand parameters in length. All floating point numbers. It could be larger than the entire record associated with it essentially.

[0:15:59] JE: Yeah. And especially with indexes, you're not normally indexing values that are four kilobytes or more in size. And so, I mentioned that we started with HNSW. And HNSW says, "Hey, all my vectors fit in-memory. That makes things simple." But that doesn't work so well when your vectors are that large and you run out of memory relatively quickly. 

And so, the DiscAnn design that we move to says, "Hey, let's leave those raw full-size vectors on disk. And then we'll keep a compressed version in-memory. And we can push the compression up to 64x." A lot of people are really excited about binary quantization which gets you through 32x. You're taking a float 32 and turning it into a one or a zero. But you can actually get to 64x with product quantization on a lot of these real-world data sets. 

The OpenAI embeddings, the Cohere embeddings. I believe, if I remember correctly, Google's Gecko embeddings, you can compress all of those at 64x and still get accurate enough results that reranking them from disk gets you good results. And so, now you've turned a problem of, "Hey, I need expensive memory for these large vectors." To, "Hey I need cheap disk for these large vectors." And so that's a much more tractable problem. And we're pretty happy with where we ended up on that.

[0:17:14] SF: Okay. In the article that you wrote that talks about this project and the different AI tools that you use to assist you - you used a lot of different things that you mentioned in the article. And I'm curious about in terms of your experience with things like GitHub Copilot, what was that tool good at? And what are some of the limitations? 

[0:17:33] JE: Yeah. That article is over a year old at this point. And so, there's newer tools that we can talk about. But GitHub Copilot is still part of my arsenal. And really what they've targeted it at, and I think it's really good at doing it, is guessing what you're going to type on the current line or maybe like the current line and a couple more. But you're not giving it instructions like you do with ChatGPT. But rather, it's looking at your code. It's looking at what you're typing and inferring from that what you might want to type next. And so, it's autocomplete on steroids is what it is. 

And so, this is another place where, "Hey, maybe wrong 40% of the time." But verifying that is just like few milliseconds, a few tens of milliseconds. It's very, very quick to read what it proposes and decide whether to hit tab to complete it or to say, "You know what? That's not what I wanted." And just keep on going on your own. 

But I have noticed, and I've seen other people comment on this as well, that you kind of get used to working with an AI partner like Copilot. And so, I'll start writing a line and then I'll just think to myself, "Okay, Copilot should be able to take it from here." And so, I'll pause for half a second to see if it jumps in with a suggestion. And sometimes it doesn't and I'm disappointed and then I have to keep going. But you kind of get an intuition for what it's good at completing and what it isn't. And so, when it's an appropriate time to pause. 

And so, I realized on my last plane trip, which was two weeks ago, that I don't like coding offline anymore because I don't have Copilot. I don't have Aider and Claude. I don't have Augment Code to ask questions about my code base. Yeah. Fortunately, that's happening at the same time that airlines are getting better and faster internet connectivity. But, yeah, it's just I can still write code the old-fashioned way. It's just not as much fun.

[0:19:22] SF: I think for myself, there are certain languages that I only learned to use ever using like an IDE that had term completion and things like that. And you get so used to those things that you don't build up necessarily the muscle memory around like, "Oh, I know exactly where this library is in order to import it." And there are other languages that - my main coding interface was like Vim or Emacs or something like that. And syntax was not a problem. I could write it from scratch. And that I'd be more comfortable in offline mode even though most of those languages I never used outside of, I don't know, programming competitions and school projects essentially. I totally understand. And I think AI is just kind of supercharging that dependency on some of these tools. 

[0:20:03] JE: Yeah. And it does become a dependency, right? If I went in to find the first occurrence of a substring in Java, is it string.indexof? Or is it string.substring? I couldn't tell you off the top of my head with certainty what that is. I would hit string object and then I would put dot in my IDE and look at the methods that it proposed and say, "Oh, okay. That's the one that I want." 

And in the same way, I do think that the part of my brain that used to write code manually all the time is atrophying a little bit. And I'm relying on the AI to do that for me, which makes me a little bit uncomfortable. But at the same time, I remember I had a coworker who was writing Java in Vim for years at DataStax. And I kept trying to convince him, "Use IntelliJ. Use IntelliJ. It will make you more productive. And, finally, he did. Finally, he started using IntelliJ. And he was 30% more productive, as brilliant of an engineer as he was, to write code the hard way, write Java the hard way in Vim. Using a good tool did make became more productive. And I'm sure that Vim using part of his brain atrophied a little bit. And he doesn't have that mental Encyclopedia of Java methods quite the way that he used to. But it's probably the right trade. I don't see how it's not the right trade. 

There might be a point where I'm just giving instructions to the AI to the point where it's like, "Oh, now it's like I'm managing a team. And this isn't fun anymore." At which point, I'll start a movement for artisanal handcrafted code. But right now, it's a productivity enhancement. And it's a fun enhancement.

[0:21:39] SF: What are your thoughts on there's the potential impact, I guess, to people who are students and trying to learn? Even when higher level languages like Python and stuff like that have been introduced, there's been - that's incited certain riots within the world of engineering and computer science where some people feel like, "Oh, you have to suffer through memory leaks, and compiler errors, and kernel dumps and stuff like that in order to earn your stripes and really understand what's going on." And that stuff gets abstracted away by certain languages. And now, if you start to become very reliant on some of these AI tools, you're even further away from necessarily understanding the guts of it or having - at least having to understand the guts of it in order to be successful. 

[0:22:24] JE: You know, yeah. That's a good question. I think right now, one of the things that junior engineers struggle with, at least in my limited experience coaching junior engineers - and, actually, I taught high school CS for half a semester. I've got a little bit of experience with the very, very junior end as well. One of the biggest mistakes they make is just kind of looking at a problem and saying, "Oh, maybe this is the solution." And then, immediately, they start trying to implement that instead of slowing down and thinking a little bit harder of like, "What's my world model here? And what would have to be correct for this to be the actual solution?" 

And AI can make that worse, right? You can just say, "Okay. Hey, AI, try to make this change. Okay. That didn't work? Try this. Okay." Or the extreme, you're just pasting stack traces into ChatGPT and then pasting what it says back into your IDE and hitting the test button again. It can definitely exacerbate bad habits. But at the same time, there's never been a better tool for, like you said, non-judgmentally saying, "Hey, this is happening that I don't expect. What's going on?" And I hope that junior engineers make the right trade-off there. But I could see that it would be possible to overuse it. The reason why I'm optimistic on that is that, if you are overusing it, then it's self-limiting. You hit those limits where the AI gets stuck and can't solve your problem. At which point, you do need to be able to solve it yourself. 

[0:23:54] SF: Do you think that there's certain aspects of coding that should never be AI-powered? For example, writing unit tests or something like that? Should those be handcrafted or other parts - 

[0:24:03] JE: Oh, man. Unit tests are the first things on the chopping block for me. Like, "Hey, Claude, test this code for me." I do find that if I just say test this code, it does a terrible job. But if I say write test that exercise this path or write tests that use this kind of data, then it's much better when you give it a little bit of direction beyond just write tests. 

[0:24:26] SF: Yeah. I mean, I think to get value out of most of these tools, whether it's from a coding aspect or even other things like just writing a blog post or something like that, you have to have enough knowledge to give it very clear instructions in order to get the thing out. I don't think like my mom has no engineering background is going to go and be able to like build an app with any of these tools right now that would actually do anything useful. They're not there. You still need enough knowledge about how systems actually work. How to program things in order to get value. 

[0:24:57] JE: Coming back to what I was saying earlier about being optimistic, if I were to be pessimistic, I might be a little bit pessimistic about the next generation of engineers having that intuition of how to direct the AI appropriately. In other words, the AI's first inclination is often to write the simplest possible solution to the task it's been given. Which means that a dozen tasks in, you've just got - kind of you've got what we used to call spaghetti code. 

I had a situation today where I was directing Claude how to refactor some code that had gotten a little bit out out of hand. And because I had these 25 years of experience writing code the hard way, I was able to say, "Here's what the API should look." Now go make it conform to that." But how successful is it without that experience to say, "Here's what it should look like?" I don't know. But I guess over the next couple years we'll find out. Maybe there is a way to get that experience coding with AI all the time. I hope there is.

[0:26:06] SF: You mentioned Claude a couple times. Is that your main tool that you're using these days to help - 

[0:26:11] JE: Claude is my go-to LLM. And they just released an update two days ago that I don't think I have enough data points personally to say that it's better. But the Aider benchmark that Paul Gauthier puts out says that it's better and that's a pretty high-quality benchmark. Yeah, Claude is better than GPT4 in general for writing code. And, yeah, that's my go-to. And then I usually use it through this tool called Aider which lets you say, "Here's the files that I want the LLM to edit." And then I'm going to give it this - what Aider calls a repo map, which is basically it makes a graph, a network of how your code's connected. And then looking at the file that you're editing, it says, "Okay, here's the classes that it calls and the classes that call it." And so, I'm going to give those connected pieces to Claude as context rather than trying to give it the entire codebase. It's a really powerful approach. And that lets it edit the files in place without having to copy back and forth into a browser window. That's my go-to today. 

[0:27:18] SF: Is that the main sort of AI tool stack that you're using? Or are you using other stuff as well? 

[0:27:22] JE: Yeah. I mentioned the others. Copilot for the autocompletion. And Cursor. And/or Augment Code for asking questions of the codebase. And so, the reason I put that footnote there is that my understanding is Augment Code is still in closed beta. Datastax is talking to them. And so, I have access to that. And it integrates with JetBrains IDEs like IntelliJ, which is the killer feature for me but also very good at answering those questions about the codebase. But if you don't have access to Augment Code, then Cursor is also good at that answering questions use case. 

[0:28:00] SF: How do you think we get to a place where some of this stuff is like a more consolidated unified experience rather than having to use three, four different tools? 

[0:28:10] JE: Right now, it's early enough. This is the industry pendulum, right? When there's something new, then you need to use best of breed tools for each individual use case because it's new. Nobody's consolidated them into one tool that does everything. And then gradually over time, you do get that consolidation until someone figures out like, "Oh, here's this new aspect of the problem that I can deliver an order of magnitude better benefits for." At which case, you start over. We are at that beginning stage right now. I do think that we'll get to that consolidation stage over the next couple years. How quickly that happens? I don't think I could guess.

[0:28:50] SF: I think you're exactly right. That's like a sort of normal path for all like new technology innovation. People forget that there was a point when - even like with early messaging clients, like peer-to-peer - you had like ICQ, AOL. And then people would build these super apps that aggregated them all together. Eventually, now you have maybe like a handful of those things that all provide kind of a similar level experience. 

[0:29:15] JE: Certainly, Cursor would tell you that they can do all three use cases today. I personally prefer mix and matching. But, yeah, that's where we are.

[0:29:25] SF: I want to switch gears a bit and talk about this project, ColBERT Live! Your team introduced this, which my understanding is it aims at making vector databases smarter. Could you explain a little bit about the project and how it actually helps or enhance - 

[0:29:39] JE: There's a little bit of background here. The problem that ColBERT - there's ColBERT Live which is the name of the library that we open sourced. And that's based on a project that I'll call Stanford ColBERT, which is a series of research papers and an open source library produced by - I'm not going to risk saying his name because I'm probably going to get it wrong. But he was a grad student at Stanford when he wrote it. 

And so, the problem that it solves is that vector search is really good at capturing semantic similarity. It is sunny outside and it is a bright day. Those would compare semantically very similar even though they don't have the key words, the sunny and outside, and bright and day. Those are completely different words. If you're trying to do a text-based search, you will not see those appropriate points of comparison. 

The flip side is that vector search is not good at keyword search. And it's not good at terms that the embeddings model was not trained on. Terms that it's not trained on or that it doesn't see often in its training data include proper names, for instance. Jonathan Ellis, it might be okay at searching for that. Those are both pretty common names. Sean Falconer, I would guess that it would do less well at searching for your name just because it's less common. 

And so, right now, the state-of-the-art in the industry is to say, "Okay. Well, let's take the vector search results and let's take keyword search results that you get from something like a BM25 algorithm. And let's give both of those sets of results to a reranking model and then let it sort out which ones are actually best." That does work pretty well. Actually, surprisingly well. Especially with like the most recent generation of reranking models that you've seen from - Voyage AI, for instance, released one in September that in my opinion is the most accurate available right now. 

But the problem is that these reranking models are expensive but not just in like, "Hey, I have to pay for the service to use them." But expensive in terms of time. Voyage's rerank 2 model takes just about exactly half a second. A little bit under half a second to rerank a list of 40 documents and say which ones are the top five results for a given query. I mean, it's certainly acceptable for a lot of use cases when the alternative is not getting the right answer. But that's definitely slower than you'd like when your underlying vector search is 50 milliseconds and your BM25 search is probably faster than that. 

What Stanford ColBERT does is it says, "Hey, instead of representing our documents or our passages with a sing single vector, let's create a semantic vector for each token that's influenced by the surrounding tokens in the passage. And we'll index all of those." And then we'll do the same thing with the query. We'll break up the query into tokens and create a semantically-influenced vector for each of those query tokens. And so, now I'm comparing 16 or 32 query vectors with my database of all the vectors that were involved for this passage. And what that does is, now instead of having a vector for it's bright outside or it's a sunny day, now I have vectors for sunny influenced by its day and a vector for bright influenced by its and outside. And so, I get the best of both worlds from my search result. I get the semantic matching. But I also get something that's very similar to keyword matching. When you do that, now I don't need to throw in BM25. Now I don't need to rerank anymore. And so, it's both a more theoretically satisfying approach as well as potentially a faster one as well. 

With Stanford ColBERT, what you get is a standalone vector index that's specialized for doing ColBERT searches. But you can't do any predicate filtering. It doesn't integrate with anything else out there. And so ColBERT Live says, "Hey, bring me your vector database, like DataStax Astra, like Apache Cassandra, like PGVector, like SQLite-Vec, whatever it is," and ColBERT Live wraps the intelligence of how to do ColBERT-style indexing and searches with multiple vectors in a standard single vector database. 

[0:34:29] SF: Is that overall going to be a better search for vectors than using sort of traditional vectors? Is that like a wholesale replacement of the traditional vector search? 

[0:34:39] JE: I mean, gazing into the future, is it going to be a wholesale replacement? Probably. If there are use cases where single vector search is giving you adequate results, then ColBERT won't replace that. Because ColBERT isn't adding value because you're already getting adequate results. And it's inherently slower since it's dealing with multiple vectors per query. But if you do need more relevant results than you get from single vector search, which I think is most use cases, then I do think the ColBERT-style search has the potential to replace those.

[0:35:18] SF: Yeah. Essentially, if you need to rely on reranker today, then potentially this is a better, faster option that gives you a more accurate result. 

[0:35:28] JE: We've been talking about just pure text passages. But the advantage of the ColBERT approach is magnified even more when you're talking about multimodal data. A group of French researchers created a model called ColPali which applies the ColBERT approach to searching images. 

I can give a text query and it will match that with images whose vectors are similar to the query that I gave you. And the way it does that is they trained a model to map both the text queries and the images that it's indexing to the same vector space. And so, the alternative, what people are doing in industry, is they're taking those images and they're running OCR against it. And they're training models to recognize tables and charts. And pull those out and describe those in a way that can also be indexed with traditional vector search. 

And so, you've got like this really complex pipeline that's a little bit slow and a little bit fragile. Versus, "Oh, I can use this ColPali model and I can index it with ColBERT Live. And all of that complexity goes away." I think even more than the traditional vector search and getting better results, I'm more excited for the image search side of things. And ColBERT Live supports both of those.

[0:36:57] SF: What's the state of the project right now? 

[0:36:59] JE: As far as, I know I'm the only person who's actually used it. I would love to get some feedback and say, "Here's what was useful. Here's what was hard." It supports Astra and it supports SQLite-Vec today. What I did in an attempt to smooth the learning curve is I created two cheat sheets. One is for, "Hey, I'm working with Claude or I'm working with GPT. And I want it to help me use ColBERT Live." And so, you can give it the cheat sheet which has like the API and the doc strings and so forth. It's a Python library. 

And the other is, "Hey, I want to use Weaviate, or I want to use Quadrant, or want to use Pinecone, or I want to use some vector database that you haven't implemented." I also have a cheat sheet for, "Here's how you extend it. Here's how you implement the database class for ColBERT Live." Because it's an Apache licensed project. The intent really is for it to be more than just data sets. 

[0:37:55] SF: How does the integration work with these various vector stores? Presumably, it's creating an index outside of them. Then how does it actually like use the index to find the data that's stored within the vector store? 

[0:38:09] JE: This is something that I wish I could find a better way to do it. I haven't found a better way. What I did was there's an abstract database class. And it has two methods that you need to implement. One is for running the individual vector searches per query. And then the other is, given a unified list of documents that I've identified as the best candidates, fetch all of the vectors associated with those documents so that I can compute the ColBERT MaxSim score for each of those. 

And the reason why it's at such a high level is that I want you to be able to add in predicate filtering. I want you to be able to add in any other aspect of your database that you want to, ACLs, whatever. And so, it's hard to find a common denominator across Cassandra, which is a very different animal from PostgreSQL, which is a very different animal from Pinecone. 

I left it at this fairly high level. And then if you drill down into the Astra implementation, there it gets more opinionated about here's what your code should look like. And, similarly, I did it for a SQLite. It's a little more opinionative there. And so, that's why I have that cheat sheet for, "Hey, I want to implement this for this other database. And there's some examples for you to follow. And hopefully, that helps."

[0:39:29] SF: And in terms of where we started the conversation around what DataStax is doing in the AI space, becoming this full end-to-end platform, what is the state there? Can I come to DataStax today and do everything that I need to do in order to build an AI-powered RAG application? 

[0:39:48] JE: Everything is a big word. But you can definitely build AI applications completely on top of DataStax today especially if your application has family resemblance, shares some DNA with chatbot applications. Because that's what LangFlow was originally created to target. 

And so, if you wanted to build a backend for Cursor to compete with them today, you definitely need to do some customization. And that's not going to be something that we're going to do out of the box. But we will give you the building blocks and we can help you figure out how to do the missing pieces.

[0:40:25] SF: Yeah. I mean, I think that's going to be the case with any fully-fledged platform today. You need to do really advanced stuff. You're going to have to roll up your sleeves, I think, and get your hands dirty and do some custom work. What's next for DataStax? 

[0:40:39] JE: We see the journey in the industry overall as going from 2023 being a year of experimentation and testing this new tool that we have. And 2024 really being where people have been able to successfully turn that into production applications. And we believe that 2025 is where people are going to go from automating and enhancing their existing products and existing workflows to addressing things that weren't possible before. 

And just as an example, a very simple example, I mentioned earlier that I was fine-tuning an embeddings model. And the way that I got the data to do that fine-tuning was by asking Gemini Flash to OCR a bunch of PDF documents for me. And a couple years ago, building a data set to train or fine-tune a model was considered one of the most difficult things that you could do. And it required an army of human labelers to do it. And now, you've been able to reduce the time to do that by three orders of magnitude. Maybe four. 

I think that you're going to start seeing - just the internal combustion engine. It started off - I think they called it a horseless carriage, right? They just thought like, "Oh, I know things with wheels, carriages. That's what we're going to build." Now, a modern Tesla has very, very little resemblance to a carriage. And so, I think that's the trend that you're going to see. And DataStax wants to help people make that transition. 

[0:42:11] SF: I think that's right. If you look at any kind of big technology shift that's happened in terms of how consumers interact with technology, whether that's internet and desktop, to mobile computing and the cloud, usually, the first couple of years is kind of experimentation and setting up infrastructure. And then it takes a couple years for the really net new baked in that technology consumer experiences to happen. 

Uber and Instagram that were these like landmark mobile-first companies didn't happen immediately when the iPhone was released. It took several years. Because you had to build all of essentially the tooling and infrastructure to be able to even serve and create that kind of use case and have people thinking about it that grew up with the technology and stuff like that. And I think 2025, 2026, that's when I think we'll start to see that in the AI world as well.

[0:43:03] JE: Right. I think so too.

[0:43:05] SF: Well, Jonathan, this has been really interesting. Thanks so much for being here.

[0:43:08] JE: All right. Thanks again, Sean. 

[0:43:10] SF: Cheers.

[END]