[0:00:01] JB: Hello. I'm really excited to have with us today here on Software Engineering Daily, Paul Masurel, who is the creator of Tantivy and Founder of Quickwit. Paul is a French software developer, living very far from France, for all the right reasons, so he can be a very good spouse. He spends a lot of time thinking about and working on complex search and analytics queries that can work directly on cloud storage, but also have sub-second performance speed. We're going to talk to him a little bit today about Quickwit, a little bit about the architectures and creative technical decisions he's made to make sure that he has a very competitive product available, and we're going to talk a little bit about his new release that just came out as well. Welcome, Paul. [0:00:48] PM: Hello, Jocelyn. Thanks for having me. [0:00:51] JB: Let's get started with a couple of quick questions around – I'd like you to just talk a little bit about QuickWit. What are the challenges today for existing search engines? Why can't I just use those to search object storage very quickly? [0:01:09] PM: Yeah. Maybe before jumping into this interesting question, we need to take a little bit of the main marketing challenge that I've been having. When we say that we are a search engine, sometimes people will project us into the space of search engines that we are not very good at. There are very different types of search engines. You cannot choose Quickwit, for instance, to back any commerce, website, or something like that. We have specialized on a very large amount of append-only data, and if you are not sure about what is append-only data, usually it's logs. We do basically, nothing to try to return the best possible data. We just try to search into a large amount of data very fast and usually, people are, okay, we’re sorting by timestamp. That's the problem we are trying to solve. Then there are a lot of challenges that exist. Usually, the main one is evolving around decreasing the cost, because even the smallest data today is a little bit successful. They can end up with terabytes and terabytes of data. Just being able to search into that, it's really hard. If you just try to brute force your waste through this data, you will probably end up with a cloud bill that is much higher than what you can afford. The decreasing cost is a first challenge. Then you have some challenge about reliability and managing your cluster. Most solutions today, if your cluster is a little bit big, you will probably notice that you end up having a team of DevOps and being in charge of the cluster and it really feels like babysitting. I think it's okay if your search engine is a money maker. If it's the search engine that backs your e-commerce service, that's fine. If it's for your logs, you really would rather have your team of engineers working on something else. Then there are a bunch of sub challenges to that, so that would be our goals, diminishing cost and improving the manageability of your software. Then, there are different challenges associated with – should we jump into those, or? [0:03:56] JB: Yes. The three things are specific to append-only data, cost containment and cluster management are the areas that you're focused on. I haven't talked about this in a while, but append-only data is a particular problem in working, right? Can you talk a little bit about why it is a particular problem? [0:04:16] PM: Super interesting question, but a very tough one. I go for the simple answer in our case. Usually, when you – the way we store our data is we receive the logs and we produce pieces of the index and we upload that to Amazon S3. It works great. Amazon S3 is really cheap. It's extremely reliable. You enjoy the 9999 replication and everyone is happy. The trouble there is, let's say that someone comes and says, “Oh, I'd like to remove all of the documents that contain this token, or all of the documents associated with this tenant, typically, or this user.” GDPR is asking us to remove all of the data associated to this user. Now, by design, you have a physical constraint, which is those documents are in every single pieces of the index that you produced before. You are asking to modify a large amount of objects, basically. That's basically the crux of the problem. Quickwit does handle delete, but only in a very asynchronous way. What you can do is typically, address those GDPR requests, because of course, people have those. You will just put them in the queue and it will be processed maybe once a day. Batching the queries like this removes the inefficiency, because instead of pounding one request after one other and having to write all of the files, and times through the day, you do that only once, so it's fine. [0:06:14] JB: Okay. I got a little distracted with the append-only data, because I remember there's a – we had to deal with that in my past experience of mine, where it gets very difficult to backtrack and make changes. I just appreciate you explaining that a little bit. But let me backtrack a little bit to the higher level, where you said, hey, you’re focused on append-only data as part of the core architectural decision that you've made of things that you're going to focus on. You've also said, reducing cost and making it easier to manage your cost. Do you want to talk a little bit about how Quickwit addresses each of these buckets? [0:06:49] PM: Yes. I will also explain one of the problems that you have. There are some super fascinating technical problems that do this. Usually, in the rim, where we are working, people have a super large amount of data and they do not solve that much. It’s a joke that you made, I think. It's like, write once, read once. It's typically what we observe. People can have 200 terabytes of data and it's only one person searching them once they have an issue. The most extreme case would be people in the security rim, they want to index everything, both of the data that they have, keep it for three years. It's only one person who, if they have some kind of security issue, needs to investigate the past and see what the possibility leaks. This actually suffer from this security issue. [0:07:48] JB: You're the expert here, but let me just plus one. That is such a common problem in large enterprises. It is a very common thing that you'll see massive amounts of logs and materials that it's particularly, cyber is hanging onto in case they get one question. [0:08:05] PM: Yeah, exactly. Exactly. At one point, the index is not even there to be able to search into the data. It's there so that someone can sleep at night. Because maybe at one point, they will have to search in the search index and it will really save their drop in the company. It's a very critical stuff, but very rare occurrence. Back to the problem. In this case, your costs is very different than if you were just trying to run your e-commerce website. Usually, your cost will be mostly storage and indexing, just because you don't search that much. Now, your challenge is, how can I index my data really fast? Indexing is very tricky just to implement. It's a complicated problem. You want to maximize the use of your CPU when you do that. That's something that we are very good at. To explain a little bit of the challenge. To increase your indexing throughput, given a single machine, what you will want to do is to use your CPU at all times at the maximum possible capacity. It could be [inaudible 0:09:34], or it could be Tantivy. Usually, the wait walks is you will do some touching. You are not into your typical SQL database, where there is a write type log, and then after every single transaction, all of the data is made visible. What you do is that you accumulate data for, let's say, 30 seconds, and you produce an index artifact after 30 seconds, you'll put that to, in our case, you'll put that to S3, and then it's visible. This batching unlocks a lot of the optimizations that you can do in indexing. If you do that, you are building – I'm a classical engineer at heart, but it feels like you are building some weird engine in the mechanical sense of the term. It's like a cycle, right? You ingest data in, you crunch the data using your CPU, and then you write stuff to your disk and you start again. In our case, we write stuff to the disk, we upload it to S3. What else we do? We discuss with the meta store, to tell the meta store, “Okay, the data is available.” All of these steps, they don't all use the CPU, right? They don't all use your IO. Every single step is consuming some specific resources of your system. When you're writing to disk, you are wasting your CPU, which is staying idle. When you're uploading to Amazon S3, you are wasting your IO and you're wasting your CPU, and so on. What I'm going is that you want something that is streamlining your process entirely. You need to have a nice little pipeline where you are indexing one batch, and you are uploading the index bits that is associated with the previous batch. Everything happens at the same time, so that you are taking the best out of your hardware. That is something that, yeah, we are pretty good at. [0:11:44] JB: I think that's such a great way to talk about it in terms of a mechanical analogy, because when you describe at a high level, the way your architecture works, it reminds me of driving a standard vehicle, where you have just enough gas and just turning the clutch at the right moment. That's the kind of interconnectedness of your architectural components. When I read your documentation, that seems to be a huge part of how you're able to get a lot more speed and efficiency. [0:12:11] PM: Yes, yes. Exactly. [0:12:13] JB: Maybe we could talk through your architecture a little bit, too, when you're ready. [0:12:16] PM: Sure. [0:12:17] JB: Because I think, those are something people would love to understand about Quickwit. I certainly found it really – I wouldn't say I understand it, but I found it very interesting when I read about it on your site. We'll just do it verbally, no pictures, but just left to right, or going around your architectural components around indexing, the meta store, the control plane. Could you help us understand what happens? [0:12:41] PM: Yes. I'm going to even try to do some good transition to come to that. We were talking about the challenges and CPU usage. I'm going to pick one challenge that brought us to this architecture. On the indexing side, if you want to reduce cost, you want to use your CPU at maximum. Every SRE, or DevOps listening to this knows that it's a goal that is in opposition with the idea of reducing your search latency. Because if your CPU is at 100% and one request comes, it has some logic of huge tasks. You are into the statistics of queues. If it's working at capacity, the latency will be bad. Where I'm going with this is ideally, to solve this problem, you would launch indexing and search to append on different hardware, which leads me to architecture. Yeah, we totally take up our indexing and search. The way things work is your data is coming from somewhere, which we'll probably discuss about that. For the simplification, I would say that you have some Kafka, too, that is containing all of your data. You need to ingest maybe like, I don't know, 2 gigabyte of data per second. [0:14:26] JB: Like a massive stream of logs, right? Of the massive stuff. They're all the logs for the whole company, because I'm in cyber and I'm just keeping all of this massive stream. [0:14:35] PM: Yes. You need to index this massive amount of data. What you will do is that you would have a set of indexer. It can be any number of indexer that will – in the case of Kafka, they will consume that as a consumer group, that you get your exactly one semantics there. What they do is that they work on their own. They don't need any coordination. Kafka does a job of rebalancing stuff and they work on their side of things. Just when they finish one batch, they're brought to Amazon S3, and they will so-called publish their splits by discussing to the meta store. Today, our meta store is almost the implementation of the meta store is just a PostgreSQL instance. The load is much lighter on the meta store. Specifically, we have 10 million documents. They are split. You might be scared about like, and it's not like we are indexing documents in PostgreSQL. Every single line in PostgreSQL is handling 10 million lines of logs, so you have some nice leverage there. [0:16:01] JB: Those are big numbers. Those are big numbers. [0:16:04] PM: Yes. It's enough to – one million rows, you reach petabytes index without any problem. One million rows is nothing for something like PostgreSQL. Then you have a very separate world, the world of searcher index, or do not talk with searcher at all. The searcher are stateless. That was an interesting part of our challenge. With every single searcher is, I don't care to as a searcher. You don't have any idea of – you do not need to have a persistence, hard disk, or anything like this. You just start your nodes and they join the cluster and they are ready to work. When those search requests arrives, it can hit any searcher. The searcher will act as the so-called root search. It will coordinate the work that is needed to do the distributed way. The first step that it will do, of course, is connect to the meta store, and it will then dispatch the work amongst its friends. There is no raft, no coordination per se. The searcher just have to know more, as those are searcher nodes are and we do that through gossip. It's extremely light and it's extremely easy to manage. There is no tool keeper, for instance, involved in this world. [0:17:40] JB: The notions of splits and searcher, those two ideas, where did those come from? Is that novel to your offering, or did they come from a previous architecture, or thought process? [0:17:53] PM: No. I would definitely not call it novel. The idea of doing search with, by writing files that are write once, read many, it's actually at the core of [inaudible 0:18:13] to begin with. [0:18:15] JB: I guess, I'm at the stateless quality of the searcher. [0:18:18] PM: Oh, yeah. The stateless quality, I don't know anyone who is able to do it, but because there are a bunch of technical challenge behind that. But I can probably go through those. When all of this architecture, the way I just described it, it came very naturally. It was like music. Oh, we have these challenges. How do we do this? Oh, of course, we want to separate searcher and indexer. Of course, it would be nice if you could have a stateless searcher, because it means that you can scale very rapidly. You could have only one searcher and add 10 searcher and then switch it down. In terms of the manageability or so, it's a breeze, because you don't have this state. You don't have this trouble of, or should my node failed, and now I need to rebalance all of the details that it was hosting to those of searchers. My search performance will be terrible. We don't care about that. All of the data is stored on an object storage, usually Amazon S3, but a lot of people are using us on any other cloud. Yeah, it was on any object storage. Yeah, so then that, of course, that was the obvious goal. Being able to pull it would be great, super low cost. Just to give you the figure, maybe you know that others storing 1 terabyte of data on Amazon S3, replicated. You don't care about anything. It only cost you $25 per terabyte. That's much cheaper than having it on your local SSDs and having to replicate yourself. You don't have to pay this cost, like multiply by three and then the higher. It's really brilliant. [0:20:15] JB: Yeah, right. Because I don't know if everyone knows this, who’s listening, but for low latency, there's often at least three copies. [0:20:21] PM: Yeah. And your ability as well. You want to be able to not use support data if one node crash, so you will have to copy it. Usually, you have at least two copies and three, if you actually like your data. Yeah, so it can – it ties up. It becomes very expensive. Okay, so that's what we wanted to do, but it's actually impossible. Trouble there is Amazon S3 is slow. Extremely slow. I think it's very important to have the figure in mind. You can actually read in the middle of a file in Amazon S3. You don't have to read on the file, you can ask for a bunch of bytes. Asking for a bunch of bytes, so the equivalent in the hard disk would be called a random seek. It has a latency of typically, 70 milliseconds. In search, when you start to search engine, because we are talking about stateless node, right? We assume that they don't know anything about the index when you start. When you start a search engine, your read will look at a bunch of footer and reads here and there on every single file in the index. It takes a huge amount of reads to do, and the footer will point to another part in the file. You have some chain of dependency of reads that makes it so that if you wanted to do that on Amazon S3, for your searcher to be ready to search, it would take 30 seconds, or something like that. It's absolutely impossible to do it the usual way, like the trivial way. We had to deal with this problem of latency. I will talk about the other problem and then I will explain about the solution. The other problem with Amazon S3 is your throughput is bad as well. The throughput is of about maybe 70 megabytes per second, and that's also very bad. Of course, we are an inverted index, and it's a very nice way to reduce the amount of data that you have to read. It's extremely compact. But still, 70 megabytes per second, to give an idea to people listening, SSG is nowadays is it can now show several gigabytes per second of throughput when you read. They're more expensive, but they are way, way faster. In terms of latency, it really varies, but you can go below the millisecond with an SSD. A spinning disc, like the ones that we used to have in our desktops, the ones that makes a lot of noise, typically, they were considered already very slow at the time, and the latency was 10 milliseconds. Seven times faster than Amazon S3. It's very challenging. [0:23:41] JB: That's a great analogy. That's a good one. [0:23:43] PM: The throughput is about the same. We are trying to make that work. It seems like a stupid challenge. Actually, we tackled it, so I can explain a little bit. For the simple parts, the throughput, the throughput is 70 megabyte per second. But it's possible to run as many reads as you want at the same time. What we do is that we just read from several speeds at the same time, and we streamline all of the process. When you search it, download data from many different places at once. We do observe a throughput that can go over the time of your query, from start to finish in average, we can get the throughput above 1 gigabyte per second. We are able to reach the throughput that you would expect from a bad SSD, which is awesome. [0:24:40] JB: You want to be able to search all this. Let me just review where we're at, though. I'm getting lost. Hold on. You want to be able to search very, very quickly over something like S3, but it's a problem, because the way S3 works, it's going to burn a lot of disk by going to this long indexing chain. It's going to have poor throughput, something else. I can't remember what you said. There's another problem with it. Your premise was we're going to make search really fast on this thing that fundamentally doesn't work quickly and is expensive, when you try and really rev it up. The way you're addressing that is through these three core principles of the index, or the meta store and the searchers. Do I have that right? [0:25:22] PM: Yeah. [0:25:25] JB: You can tell me that’s wrong. [0:25:27] PM: A bunch of off-use goers, and the obvious architecture was, okay, we are going to have all of our data on S3, and we will have a stateless searcher, and everything will be great. Then we had one, the ideological trouble, which is, okay, Amazon S3 is slow. How do we deal with this? That's where we are in the discussion right now. What does it mean being slow? Latency is bad, and throughput is bad. Throughput is bad. It's actually not exactly true. You just have to read from different parts at the same time. It's a very well-known issue. Then your throughput can be much better. You just need to reorganize the way you run your queries to be able to do the IO all at once. The CPU, you won't do that all at once, because it will be terrible. You don't want to do that. The CPU doesn't like to change from one side to another. We just do our scheduling in a much smarter ways, and such engines that do not need to care about that. Then latency, yep. [0:26:43] JB: You go ahead. [0:26:46] PM: Latency, it's another beast. Latency, 17 millisecond is only a problem if you do a lot of hops. If you have some kind of chain problem, so if for my query, I need to read in one side, and then I need to go somewhere else, and then I need to go somewhere else, then you will have to pay 17 millisecond multiplied by the number of hops that you have. You need to rethink your search to reduce the number of hops that you are doing. We are able to do that in four hops today. Our critical chain is four hops, which means that in a sense, we will never be able to run a software faster than 300 millisecond. I’m rounding, but this is – so it's an interesting property that we have. We are very fast if you have a larger amount of data. We can do sub-second search on several terabytes of data. If you have only 1 gigabyte, we also throw a search engine out there. [0:27:59] JB: Yes. I think I heard you say – I was looking at the materials you had shared before, and that's right. I didn't understand that before. Thank you for explaining that. Let me ask you a couple quick questions, because you've talked a lot about – you touched on indexing and querying. I'd like to ask you a few questions about that and go into that a little bit more deeply. Before I do, what haven't I asked you about the product that I should have at this point, so that people understand the kinds of problems and challenges you're addressing? [0:28:28] PM: Great question. [0:28:30] JB: Have I missed anything? We can move on. We can come back to it. If you think of something I forgot. I don't want to step over any, because you have done a nice job of laying out the challenges. We've talked a little bit about at a high-level, some of the architecture, and I'd like to talk a little bit more around indexing, because if you can explain a little bit more about your indexing process, because it has some complicated notions that I wasn't familiar with, such as burning, what was it? Inverted indexes, I was not familiar with. I think, and some of your material is talking about how that could drive a full-text search faster. Can you help me understand a little bit about your approach to indexing that's different? [0:29:16] PM: Yes. Pruning is more something that happens at such time, but it’s something that I totally missed and thank you for reminding it to me. The idea is quite common. You see that a lot in columnar database. The idea is your index is a union of many small pieces, smaller pieces. We call them splits, but everyone has a different name for this. One product that everybody knows that does that really well will be Snowflake. In their case, they call that micro-partition, I think. The idea is when a query comes in, usually you can extract from the query some predicates that will help you reduce the number of splits that you will search into. The one obvious property will be time. Naturally, as you are ingesting your data, if it has a time stamp field, we will store the minimum and the maximum time stamp associated to each split in the meta store. If your query is saying, I am searching into data within this lens, then we will be able to tell, okay, we only need to search into this 100-speed, or in any SW, improve the search performance. That's what the pruning is about. We just add extra metadata on the indexing process to be able to do pruning on the search site. [0:30:59] JB: Okay, I didn't understand that terminology was so similar across Snowflake. I get that a little bit better. Then inverted index for supporting full-text search. That's special to you, or? [0:31:10] PM: No. No, no. [0:31:12] JB: No, okay. [0:31:14] PM: Okay. Usually, most people, when they talk about the search engine, they mean inverted index. If you Google as an inverted index, it's the data structure that is driving the search engine. It's a very old and classical one, and it's the most efficient way to do search. Still, as that’s an interesting debate, nowadays, all of the stuff that I explain about indexing and the cost of indexing, and how hard it is to do that on a super large amount of data, it's such a big problem, that there has been a trend that started maybe, I don't know, eight years ago, or something like that, of at the time, it was called index research. Some people just decided, okay, the problem is too hard, and we don't search that much anyway. But what we're going to do is we won't build an index, and we will just grab. That's an interesting way to do things. The idea is pruning will save the day, so the time pruning that I describe will reduce the amount of data that we have to search into, so we can probably reduce a lot the amount of data we need to prep into, and then distribution will save the day. We'll just have 200 searchers, and it will work out. This is a trend that is actually quite popular these days. Low key, for instance, is working like that. It doesn't have an index. Then you might have seen that people, they decided to handle their logs using Clickout, which is a columnar database. That's basically what they do. They use a columnar, and it totally makes sense. The war reasoning there is both about engineering and economics. Is it really worth indexing if you don't search that much, basically, what the reasoning is about. We are basically the revenge of the search index, the inverted index. We are bringing back invested index. We say, we solve all of this problem. Look at us, we can search in large amount of data 40 times faster than what you would have gotten with Clickout once. Interestingly, so Clickouts is the fastest solution to do that today. Interestingly, one of our customers has trouble with searching data into Clickout. They do a lot of business analytics, though Clickouts is absolutely brilliant as is. But search, it's in two graphing, and they simplify it greatly to be as fast as possible, it was too slow. It's actually possible to use today Quickwit as a second index on Clickouts. Clickouts is I'm drawing – [0:34:43] JB: Oh, that's a great use case. I was just going to ask you about some use cases. That's a great one. [0:34:46] PM: Yeah, that's one use case. But yeah. This use case is unfortunately not that much replicated. Most people are not interested in this use case, so I don't usually talk a lot about it. [0:35:01] JB: They're not? Why not? I feel like they should be. [0:35:05] PM: I agree. Yeah. Clickouts is I'm drawing expensive as it is, and we have just a very cheap, a few very cheap instances, and our data is on S3. Still plugging us, make search for two times faster. The message that I'm trying to put out there is, yeah inverted index is they work really well. That's some big news. [0:35:35] JB: Yeah. If you're looking to spend less money and get faster results, that might be your choice. Yeah. I do want to talk a little bit about use cases, because I think it's – we started out talking about logs and traces. Can you just share with us some – just the business narrative of – not telling us anything about your customers, obviously, that you don't want to share, but just at a high level, some business narrative about how Quickwit has been used inside of a complex setup, like you just described the Clickouts. [0:36:13] PM: Yes. When we started, for me and for Adrien, so we have three co-founders. I should have told you about my other co-founders as well, but – [0:36:23] JB: I'm going to ask more about that later, but yeah, that's good. [0:36:25] PM: Me and Adrien, it was our first startup. Once we had been working in different startups before, so we had more experience about this. But me and Adrien, we had to understand a little bit more about product market feeds, like finding the right product positioning, stuff like that. Usually, the startup playbook is you try to solve the problem and you discuss with your customers and you iterate and you’ll find the right positioning for this. Ideally, it should be like a – at first, you should take on a niche problem, like a specific problem as much as possible. We are basically trying to compete with Elastic, which is a product that does everything. Fortunately, every single people we discussed with seem very different. There was no emerging pattern of all. We have 30%, or 70% of the people that want to do exactly this small use case. Yeah, we ended up choosing many rabbits at the same time. [0:37:45] JB: I think that's so interesting that you mentioned that. We could do a whole separate show on that, because I think that, yeah, that's the typical pattern. Talk to your customers, understand their pain. But this is for iterations of known solutions. You could iterate. Oh, we had this problem. We're going to make it a little better. When you're trying to do something really new, sometimes that doesn't work. [0:38:06] PM: It’s tough. [0:38:08] JB: Right? Because you something new and something large scale, sometimes it doesn't quite work to just ask people what they want. [0:38:13] PM: Yeah. I suspect that the effect was coming from the fact that we were competing with Elastic. If Elastic didn't exist, maybe it would have been much easier and much harder at the same time. The existence of Elastic makes it so that the existence of product market fit, it's not even a question. Some other products exist. [0:38:41] JB: Too much noise in the system. There's too much noise in the system, you can't see what the product market fit is when Elastic is on the horizon. But I've seen you on YouTube talk about Elastic and the scene, right? In that you’re a fan. Why would people move off Elastic? Why would they? Why would they change? What's different now? [0:39:03] PM: The Elastic does a lot of things. They will back up your blog search engine and your e-commerce search engine with 100 million products. They will also back up your logs and they would do your P.I. [0:39:21] JB: Swiss army knife. [0:39:22] PM: Yeah, exactly. [0:39:22] JB: Swiss Army knife, right? [0:39:25] PM: I guess, the main reason we started Quickwit is that we spotted that the architecture is good, but it's not good for append-only data. It's actually extremely bad for append-only data. One very simple case where what they do is very bad is they do a so-called document replication. The way they replicate their data is that the ingested data is sent to different servers. The work of indexing the data is done on every single replica. Just straight, right there, they are doing two or three times more jobs than is necessary. This is a huge problem. That would be an example, but our menu is our example. Yeah, we just did some corner off the table computation, and we saw that it was possible to do something that was 10 times cheaper, and faster for our amount of data. We went for that. [0:40:42] JB: For you, like an append-only log of transactions, let's say that cyber might be keeping, that makes sense. Is one of your use cases be an append-only for a ledger situation, or? [0:40:57] PM: Yeah, exactly. We usually say, we are a log search engine, because people don't really think of all of the use case. They don't understand what append-only means. We sometimes just say logs. But here, a ledger discussions as well. [0:41:18] JB: Append-only, let's just go back for that, though, because most people, I'm sure it was a very sophisticated audience here, but append-only is – we like the append-only, because it gives you great consistency, and you have these immutable objects. This thing happened, and then this thing happened, and then that thing happened, and you can always look back and see it. That's what I think of when I think of append-only. What else should I know? [0:41:42] PM: Yeah, it's right. At the core, the idea is you don't have much problem with consistency. You don't have things like, you don't have trouble worrying what should happen if you have two transactions happening on different nodes. What should be the eventual outcome? You see, it's like, oh, yeah, the two documents. If you start doing delete, then there is like, oh, did the delete happen before, or after you added the document? You have trouble like this. [0:42:15] JB: I'm more familiar than the log situation, because I worked in finance. Because if you have ledgers being written, say, at multinational banks, writing ledgers all over the – into one account from all over the world. Let's say, you're in Paris buying something, and I'm your sister, and I'm buying something on the same credit card in Detroit, which one writes first? Do we have enough money in the account? It actually gets tricky really fast, even though you would think it's just in it, one thing being written at a time, it gets extremely tricky, and very hard to search. [0:42:45] PM: Yes. Then there is a problem that I described before, which was, you need to write in many places. Yeah, digits are tricky. We didn't talk about Tantivy, but Tantivy handles deletes, and it was such a headache to get it right. It's just so hard to find interest for that indexing and still be able to delete and to have, and keep this idea that if a delete happens before or after you added the documents, it should affect or not affect the presence of the documents that index it. [0:43:25] JB: This deleting notion is really, I don't know. It’s like home dentistry. We never used to do it, right? We never used to have this notion of deleting, until we had all this discussion around GDPR and permanently removing. It's permanently removing anything from the data is always so tricky to do by itself. Then the downstream applications can be many. [0:43:50] PM: Yeah. For people who are scared of using Quickwit, because of the lack of delete, the two things we handle is obviously, retention. If you can set up, “Oh, I like to keep my data for one instance, three months, or whatever you want,” the system will automatically delete your data when it moves out of retention. We do a handle, like large request, like GDPR request, or churn, because we have also interesting property for people who are doing image tenants search. We are quite interesting for people who are running a source of stuff like that. They need to address a problem of all my customer left and I remove all of the data associated with this customer. We handle that as well. [0:44:43] JB: Is that around GDPR requirements, for the most part? Or why else will we be deleting? [0:44:50] PM: Yeah. Yeah, GDPR is one thing. Then we have this churn thing, which is another one. The churn is just, okay, the left, probably you want us not to keep their data. Also, it costs us to keep it as the other retention may be set to three years. We don't want to keep that around. [0:45:13] JB: Oh, just regular retention rules. Okay. Yeah. Okay. Let's talk a little bit about the company and your founding of the company, because this is your first startup, or no? [0:45:25] PM: Yes, it's my first startup. All of my career, I've been working in different companies building search engine. It’s a first start up. [0:45:34] JB: Yeah. Tell me a little bit about your background before you founded Quickwit. You have been very focused on search your whole career. What is it about the search? [0:45:47] PM: Yes. It happens randomly. I put my first foot into search. When I was young, I got a job at this company called Exalead. You probably don't know it, but it's a very interesting company. It was a French search engine company. Enterprise search used to be a very hot subject in the startup and VC world. Most of the startups were actually located in Europe. A lot of them were in England, Norway and France. A lot of them still exist. One of them was Exalead. They were building a search engine that scaled very interestingly. They had a web search engine just to display, like showcase the fact that they could scale to web scale, actual web scale. Not joke web scale. It was a lot of fun. A lot of great engineers. The company got acquired by Dassault Systemes. A lot of the great engineer left, and spawned many startups. For instance, you might know Algolia. Algolia was spun by two engineers at Exalead, that I grew as well. I worked a little bit at that before leaving for Japan. Also, very, very spawned unicorn. Yes. A lot of stuff that have got spawned out of Exalead. I worked at the Daiki after that for a little bit. My wife is Japanese and she didn't like living in France. She wanted to come back to Japan, so I kept on here, on here what kind of job opportunity would be in Japan. I recruited from Indeed, Japan contacted me. What's funny is that they didn't know that I spoke Japanese. They didn't know that I studied in Japan. This was a pure coincidence. He was just looking for search engineers. I was like, “Yeah, I've been tested actually.” I went and worked in at India, Japan. Then I moved to Google Japan, into search team in both case. [0:48:37] JB: That's meant to be. That's fate. [0:48:41] PM: Yes. When I was at Indeed, I started this project called Tantivy. It's an open-source library that is nowadays quite popular and especially amongst search engineers. Maybe not everyone knows about it. [0:49:01] JB: I wish I started there. I wish I started there. It is incredibly popular. I've seen you present about it a couple of times in recordings. I mean, it's an amazing accomplishment just to so many people using it. I thought I heard you say you started it, because it was a bit of an experiment for you. [0:49:20] PM: Yeah. I wanted to learn about trust. Everything at Quickwit is developed interest. I just wanted to learn the language. I don't like hello world tutorial. You will build some love for a language without testing the stuff that actually matters. You will be like, “Oh, this [inaudible 0:49:38] implementation, it looks quite –” But actually, what I mean that I said in the language is can I build actual real-life stuff? Is a IO story good? Is rounding good? Can I do mid stretching in a safe way? That kind of question. It's what I really want to test. At the same time, I was a machine user, and I wanted to check that I really understood all of the internals of machine, because it's extremely important to know how it works to actually get the best performance out of it. I said, okay, I'm going to implement search engine library and we'll see how it goes. After, I don't know, maybe a few months, I had something that was working. It was about to index Wikipedia. It took 30 minutes. It was not great. It was working and it was – it was nice. It was already quite nice. Interestingly, I never restarted from scratch. It's my first project. I never went, okay, let's hallways of gold, and restart. Now that I know Rust. The Rust is supposed to be a very hard language, but it's very good. It's possible to refactor it really nicely. Yeah, it ended up going out. [0:51:12] JB: We should do a different show. I want to do a show on Rust. I want to do a show on Rust. I forget who I was talking to about that. Because everything in Quickwit is rust, right? [0:51:21] PM: Everything in Quickwit is Rust. Yeah. I think this language is brilliant. I suspect that my opinion on it is not necessary a big consensus. For instance, a lot of people are saying that – [0:51:41] JB: I don't think it's controversial. I think all the cool kids are using – all the cool software developers love Rust. [0:51:47] PM: No. But for instance, a lot of people say that Rust cannot be used in many places, because the learning curve is too hard. I think, I keep telling people that I was more proficient in Rust than in say, C++ for weeks. Two weeks of doing Rust – [0:52:08] JB: Why do you think that is? [0:52:10] PM: I was more efficient, more productive in C++. Maybe more productive than in Java. The learning curve is not that bad. People just think that I'm lying when I say that, or posting. I need to explain another thing. I've never been able to learn GoLang. It's too hard for me. It's supposed to be a very simple language. I don't get it. It just doesn't tick. Maybe a few reason why I picked up Rust rapidly is because I knew C++ beforehand, it’s cheating. A friend of mine was a bit of a genius at learning languages. He studied in Japan at the same time as me. I studied at the University of Tokyo. He always has the worst possible advice to learn Japanese. For instance, he was like, “Okay, if you if you pick a class, just always try to put one or two level above your current level. If you are beginner, just go for intermediate, or expert. Because anyway, after one month, you will be the best kid in the class.” It's like, that's not an advice. It only works for you. It makes no sense. [0:53:34] JB: That's like a prescription. It’s a prescription for heartache. It's just feeling bad yourself. [0:53:37] PM: Yeah, exactly. Then the second advice that he had was, “Oh, Japanese is relatively easy. The trick is you learn Chinese first.” I cannot tell people, the trick to learn Rust is to learn C++ first, of course, because C++ is extremely hard. [0:54:00] JB: That's right. That's a high bar. It’s a high bar. [0:54:04] PM: I can advise people who are used to C++ to try Rust. Say, we'll see that it's not as hard as what they think. It's much, much simpler than learning C++ by base of very, very high margin. [0:54:17] JB: I like that piece of advice for people who have been using C++. I also like the advice of creating a project that really helps you understand the inner workings of the entire system, rather than doing a hello world type of tutorial, which doesn't teach you this much and can be tedious. That's good advice. [0:54:41] PM: I was mostly working in Java at that time. Java, people are surprised when I say that. I like Java. I think it's a good language. It scales nicely. If you have a huge team, the compiler is fast. Tooling is good. It's okay. But Rust is it gives you this feeling of safety that is even higher than Java. For instance, you don't have any null pointer exception. You don't have trouble writing multis for that code. Java is supposedly, is it's much better than C++ in that regard, but it's really hard actually, and I've seen so many people writing multis for the code in Java that was incorrect. I don't know how they sleep at night. [0:55:41] JB: It's very easy to write a lot of very mediocre code in Java that’s not efficient. [0:55:49] PM: I was supposedly a seasoned software engineer, but I was scared every single time I wrote much credit code, because every single thing I did, I had a bug. I wouldn't describe quite the way, it would take time. Semantics behind the keyword, like final – [0:56:07] JB: Where this show up. [0:56:08] PM: - it's very complicated. It's really tricky. Yeah, Rust removes all of this. You can write your code, put it in a box in a small module and you get this feeling of safety. You know you won't have to reopen that box anytime soon, so you can focus on architecturing your code the right way. Then the big, big thing that I loved was performance, of course. It's not just that the language is producing better byte code. It's also that it's very easy for me to look at the assembly code generated by the most important function in Tantivy. I do that a lot. I use gold part a lot a lot. I can always angle my code a little bit and make sure that the assembly code produced is what I want. In Java, you will never be able to do that. It's very hard to just know what assembly is produced. The git will get in the way and you don't have that much knobs to turn to actually get it to produce what you want. Yeah, that was very important for performances. [0:57:32] JB: Here's what I've learned about you. You really like to know how things work right down to the assembly line, right? [0:57:36] PM: Yes, absolutely. [0:57:39] JB: You may have had a different life as a race car driver, because you have a great love of speed. [0:57:47] PM: You said you knew Algolia, right? The city of Algolia. [Inaudible 0:57:52] is now in driving racing cars. He goes into. [0:58:02] JB: Oh, my gosh. Oh, my gosh. [0:58:04] PM: It’s so funny that you say that. [0:58:05] JB: See, I knew it. I knew it. Well, it's been really great to have a chance to pull up with you. Let me ask you one question before we go. What is coming up next for a Quickwit? I know you have had some releases just recently. Can you help us understand where you're going? [0:58:23] PM: Yes. We just reached Quickwit 0.6. It completes a lot of the missing features for users who wants to use Quickwit as an opensource product to index our logs and places. The most important part was support for Grafana. We now have a Grafana plugin. You have an actual UI to look at your logs. It's post-part. It's to explore, look at your logs and drill down and do stuff like that. Also, the dashboarding part, because we do handle aggregation and stuff like that, so you can also build a graph out of it and populate your Grafana dashboard using Quickwit. [0:59:15] JB: Those are some big features. Those are some big features I know enterprise will love. [0:59:21] PM: Then one huge feature that we had is that we added an Elastic software compatible API. It was requested by one of our customers. Yeah, I think it will help a lot of users who want to migrate from Elastic to Quickwit. [0:59:43] JB: I know I said I was going to let you go, but now I have more questions. How's business going? How is the business going? How do companies typically start working with you on a small project, or big effort? What is the pattern? [0:59:56] PM: Yes. Right now, we have signed two customers. It's slowly going. [1:00:05] JB: Early. Early days. [1:00:06] PM: Early days. Yes. One of them is quite big. I cannot give too much detail, but it's a big contact and a super amount of data. It's challenging. Right now, our main objective is to finish this project and then put the rest in introduction. That's our main objective. [1:00:34] JB: I was just curious, because of what you said about how the bigger the data, the more effective your architecture is. However, people with a ton of data have a hard time saying, “Yes. Let's start this huge data project.” [1:00:49] PM: Once we have them in production, we will be able to answer. [1:00:55] JB: You can show that value. [1:00:56] PM: The one that comes with 1 petabyte of data and it's the bills that we can do it, we can tell them, “Yeah, we're in production.” There was 1 petabyte of data and it works well. [1:01:09] JB: The cost savings alone, probably is the turning point, right? [1:01:14] PM: Yes, that's our main objective at the moment. As usually, people come to us, we don't have any sales. People come to us listening to your podcast, for instance. [1:01:27] JB: Of course. Of course. That's going to be the main channel now. I think that's going to be the big channel for you. This podcast. [1:01:34] PM: Yeah. We have a small pipeline of people with very interesting use case. [1:01:41] JB: Like design partners and potential customers. Then the pattern I was describing is that they'll do a test with a couple petabytes of data working with you, see the outcome and then identify larger use cases, or potential contract down the road. [1:01:55] PM: Yes. Usually, the test is with a few terabytes, not petabytes, which is – we had some – [1:02:03] JB: Oh, sorry. I misspoke. Yeah, that would be hard to support. Sorry. [1:02:08] PM: Then sometimes we see – obviously, we also have producing us with in the open source world. It's a little bit difficult to track this information, but we have telemetry, but it's possible to disable it. It's opt-in. Sometimes we spot gigantic users and insane amount of data. People indexing 200 terabytes a day. They never told to us. Luckier for them. It just disappear. It's a bit sad. Yeah, we have a little bit of open-source traction. I hope that that it goes well, especially tracing these days. [1:02:59] JB: Well, in the data space, we spend so much time on these explosions of data, right? The data is just exploding. Even though I've been working in data for so long, sometimes it's almost intellectually hard to grasp how much there is. It's just growing. I think a solution like yours is nothing – the fundamental economics, right? You've just got so much more append-only data, while building out there. Enterprise alone, I don't even know about medical, or government use cases, but I know finance and it’s massive. I can only see a great trajectory and more appetite for replacing Elastic with something that specifically focuses on this short problem. [1:03:43] PM: Yeah. I'm sure a lot of people will come with ever growing – [1:03:49] JB: I'm only telling you what you already know. Just to say that it's really, there's still more. There's still more coming, not just dealing with what we have today. There's still something. Because I was going to get, I was going to say okay, knows Japanese and Rust and C++. That's enough things already for you to know. Somebody else does the music in the household, that's good. Well, thank you very much. Looking forward to maybe having you come back in a few months and let us know how it's going. [1:04:16] PM: Oh, I'd love to. You should do an episode about Rust. Yeah, I'd love to. [1:04:21] JB: Yes. Would you help me with my dream of doing a show about Rust, even though I'm not a Rust developer? I feel like I can pull it off. [1:04:28] PM: Yes, of course. It will be awesome. [1:04:29] JB: All right. Well, thank you so much. [1:04:30] PM: Thank you, Jocelyn. [END]