EPISODE 1719

[INTRO]

[0:00:00] ANNOUNCER: DuckDB is an open-source, column-oriented relational database that was first released in 2019. It's designed to provide high performance on complex queries against large databases and focuses on online analytical processing workloads.

Hannes Mühleisen is the co-creator of DuckDB and is the CEO and co-founder of DuckDB Labs. He joins the show to talk about drawing inspiration from SQLite, why DuckDB was written in C++, the novel data processing scenarios it enables, and more.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His bestselling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk and their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[EPISODE]

[0:01:19] LA: Hannes, welcome to Software Engineering Daily.

[0:01:21] HM: Thanks for having me, Lee.

[0:01:23] LA: Great. So, we're talking about DuckDB today. After reading a little bit about DuckDb, I have to be completely honest, I had not heard of DuckDB before I started preparing for this episode. But it sounds actually quite interesting. But my first thoughts were, it sounds kind of like SQLite on steroids. Now, you probably cringe when I say that. But what do you think about that comparison? What would you say to people who make that comparison?

[0:01:50] HM: I think that's an actual pretty good comparison. In fact, we took a bunch of inspiration from SQLite, when we started making DuckDB in the first place. So, one of the things that we really loved about SQLite, there's many, but the first thing that we really loved about SQLite was its ease of use. Maybe you have ever used SQLite. You know this, but there's no server to manage. The database is just the library. It's a piece of code you glue to your application. Then, the database file itself is just a file. You can instantiate the library, talk to the file to run queries, do table stuff, it's all fine. I think it's really impressive how easy to use SQLite.

That's one of the inspirations that actually we took to heart when building DuckDB is that it should be the look, the feeling, sort of the feeling should be kind of similar in the sense of like simplicity of use. Being able to just use a database as a library. It runs in the same process that's really, really interesting, because it's both simplifies handling and has a bunch of really great advantages in terms of performance. Then, the steroids part, I'm not offended by the concept on steroids. We often call DuckDB the SQLite for analytics ourselves.

So, there's maybe something to understand here that there's this difference between the transactional workloads that SQLite was built on, which is changing rows, updating rows, inserting rows, deleting rows, changing individual values, and the analytical workloads that DuckDB was built for, that are more like, "I'm going to run a query on like a million rows or a billion rows, and I'm going to join 10 billion row, tables together." That's more the analytical use case. SQLite is not designed for it. It's not built for it. Its engine is not designed for it, but DuckDB's engine is.

We took some of the inspiration from the architecture, but we basically swapped the engine to an analytical engine. In general, it's a complete rewrite - not a rewrite, it's a complete sort of separate-from-scratch implementation. We did borrow, they are shelled, however. So, if you use the DuckDB command line interface, it's actually a patched version of the SQLite shell.

[0:04:01] LA: Oh, cool. Cool. It'll look very familiar to people actually using it. Is there an API compatibility with it? I know, obviously, from an SQL standpoint, that we can talk about the details there. But the in-process API itself, is that compatible?

[0:04:16] HM: Yes and no. Well, best answer. So, we do have something called a SQLite API wrapper, which makes the DuckDB API look and behave exactly like the SQLite API. In fact, that's what we're using in the shell. The shell is actually using the same sort of API calls it normally uses to call into SQLite to talk to DuckDB and there's like some layers of software that make that happen.

We actually have a compatibility layer. So, if your application runs on SQLite, chances are you can just swap that out with DuckDB without too much hassle if you're using the C API. But DuckDB itself has two main APIs. There's a C API, which is different from the SQLite. Again, that has to do, I mean, why make a new API if there's one that people already use and like, now that there's reason for it. So, because DuckDB works with analytical datasets, we use a columnar data representation, and that means that our client API also needs to be columnar because one of the cardinal sins people do with data systems is to have a columnar system. Then they slap an old-school row-wise API on top like the Postgres API and protocol, for example. Then you kind of lose a lot of the advantages that's been going in and out of that engine, because you have to convert.

For that reason, we actually used a, redesign an API that was columnar sort of designed to access data that comes out of the thing and to put data in. This is why we have our own API that's separate, different from the SQLite API. People that use SQLite C API will probably recognize a bunch of things, though, because it looks a bit similar. But DuckDB itself is written in C++. So, the C API is also another wrapper around the internal C++ API is that you can also use to talk to DuckDB. And so, that's the main API that people program again, although we are trying to move a bit more to the C API at the moment. 

[0:06:02] LA: So, built-in C++ vs C, and SQLite was built in the C. Why C++?

[0:06:09] HM: Database systems are old school in a way, right? If you think about things like SQLite, indeed, or Postgres, or MySQL, or pretty much all of them, they're all written in C. That has to do with, I think, two things that, first of all, when many systems were created, C++ simply did not exist, or C++ for a long time had a bunch, like a bad rep if you want.

[0:06:32] LA: A bad rep.

[0:06:33] HM: For some time, there were some issues with performance of exceptions, and things like that. So, data systems people being somewhat traditional and conservative, I think it's fair to say, went for the old-school C. I've spent a lot of time in my career writing data systems in C, or working on data systems are written in C, and C 99, great stuff. But when we started DuckDB in 2018, sort of the C++ had sort of been around for long enough that it was actually zero-regret abstraction where you could use things like exceptions, and smart pointers, and things like that without fearing for horrible performance regressions.

Still, you have to be careful. The oldest hold on the inheritance that C++ gives you is something that we cannot use the hard part of our data system at all, because there are all these virtual function calls, that just basically not work. It's very tempting to write an abstract sort of database system. It has to be polymorphic on the types. There's different data types and have to do different things depending on which type. Sounds wonderful, it doesn't work in practice, because of the overhead of virtual funtions. 

So, you do end up writing C++ that looks a lot like C in some way. But we are very, very happy with C++. We use like templates that make it very easy for us to generate code on compilation, which means that we can expand code into simpler code, that then the compiler can efficiently generate machine code for these kind of things. I think C++ kind of was a great middle ground between the new world and the old world at the time in 2018, that we could kind of avoid some of the horribleness of C, with regards to error handling, memory management stuff, and not pay the performance penalty, or the portability penalty of some of the more modern languages, right?

[0:08:20] LA: Yes. I did my early career, I did mostly with C++. In the nineties, in the early 2000s, I did a lot of C++. Back then, you're right, it was starting to get a bad rep back in those days. Before some very specific, what I consider more corner case, but there were some very specific use cases that just didn't have the performance that you wanted it to have. But for the most part, it was performed just as well as C, but it gave you some additional features like better memory management was one of the things I remember was one of the main things that I really loved about it, is not having the ability to build something that could be memory leak free, or at least as close to it as possible, without having to go through a lot of gyrations and a lot of hoops was one of the things I really liked about C++.

[0:09:08] HM: Yes. We also really liked that. We make heavy use of smart pointers in DuckDB. As a result, memory leaks have not really been an issue. They pop up every now and then, usually in client APIs that have to talk to some other language ecosystem that then does its own memory management, then things go wrong. But inside DuckDB itself, it's quite rare because we can use smart pointers. The structures looking into memory generally works well. You can have fairly complex data structures that just clean up after themselves. It's great. It's something that C world still doesn't really have. C world sounds like C world, right? But the C world still doesn't have - no relation. C world doesn't have and it makes it, you spend so much more time.

I think, from my perspective, it's a bit more like, okay, productivity of developers is higher in C++ because there's a bunch of stuff that you don't have to deal with. Still, C++ isn't the most modern language out there. There's of course newer ones like Go and Rust and all that stuff, that try to fix the more things. But for us as a sort of low-level systems people, which I think we are, it's a good compromise. We actually are quite also conservative in C++. DuckDB uses C++ 11. You would say, "Why on earth are you on a 13-year-old version of C++ with your top-notch data system?" There's a very simple reason to it. We run on a bunch of platforms that were GCC four point something is the newest compiler available, and that can only do C++ 11 reliably.

So, we actually made a conscious decision to be on a slightly older version that has most of the nice things, but not the crazy nice things, just because we maximize compatibility and just portability of the system, as opposed to, the latest and greatest in the languages. It's also an advantage to simplifying a bit, or just to stay on slightly simpler technology on purpose is that the code becomes way more accessible, right? If you're doing crazy Meta programming hacks in templates, it will be much harder for people to just jump into your codebase and maybe contribute, or fix a bug, or track down a bug, things like that, as opposed to, "Yes, it's C++ 11. It's well understood. The tools like it. The debuggers like it. It all works really well."

[0:11:21] LA: There's something to be said, for following simple guidelines that work well, if you want to - especially if you're, for an open-source project like this, which we're going to get into in a little bit. The open source philosophy behind it. But let's stay with the SQLite comparison for a little bit here. We look at the use cases for SQLite, there's really two rather extreme independent use cases. One of them is simple startup of development environments. You can start something and use it and it's easy to use, and it just works. As a development tool, it's good enough. You can do a lot of really neat things with it. 

But the second use case is in production systems, for small footprint embedded software applications, on-device applications, or in-field devices and the database usage in those cases. Those are really the two main use cases for SQLite. What is the main use cases for DuckDB? And how does it compare there?

[0:12:22] HM: Yes. I think that's an excellent question. I would say there's maybe three use cases as opposed to the two that you just mentioned for SQLite. I think we shared the aspect of people using it as a prototyping environment, because I don't know, Snowflake is expensive, and you might as well run your queries locally on DuckDB to iron out most of the problems. I mean, why not? People do that a lot. But then there's two more. One of them is, we don't really have this kind of tiny device thing, because DuckDB realistically, you need a couple of gigabytes of RAM to really take advantage of it. I mean, it can run on 500 megabytes or something, but it's really pushing it. SQLite can even run on statically allocated RAM. It can run anywhere. We can run on many things. People run DuckDB on phones, on watches, and these kind of things. But these are pretty powerful computers at this point.

I think the second use case is to do data analysis on like your local machine, or like your Jupyter Notebook that you run on a cloud device, whatever. This is a local data analysis aspect. So basically, when we started DuckDB, we kind of built it for this case, where we said, people are doing a lot of stuff locally. In fact, many, many datasets do fit on single machines. Now, let's give people a tool, a local data system that you can run, instead of spinning it yet another Spark cluster. For many problems, you can just run a local instance of DuckDB on your machine, on your laptop, they are pretty fast, they can go pretty far in terms of data set sizes. People use it a lot for that.

I would say, it's a very big use case. It's just like people writing Python scripts, people writing our scripts, people writing their own thing in Go or whatever. Using DuckDB on a local device to just crunch through some data, to read a CSV file, to convert it to a parquet file, to read a parquet file, to write I don't know. Run an aggregation. Run a big join. These kinds of things. So, with ad hoc, I would call this. It's ad hoc analysis.

The third one is also, that's one that kind of surprised us was that DuckDB, because it's so small and nimble. And you don't need to manage a lot of state and anything like that. People also use it a lot in sort of enterprise data pipelines. So, when you are running some sort of weird long data pipeline, the one thing you don't want to do is to run a cluster next to it that runs something else, and then your pipeline service have to call into the cluster, you get this load balancing issue, and it's not very pretty, you kind of would rather put this data processing, the reshaping part, cleaning part, quality assessment, all that stuff into the pipeline itself. And DuckDB allows you to do that, right? Because we take a couple of milliseconds to spin up that database instance from scratch. You can run arbitrary queries on with a state-of-the-art query engine. We haven't talked about that but we can talk about that. It's a state of the art query execution engine. Then, you can, I don't know, hand the data off to the next step.

That's the use case that surprised us a bit how popular it was for that. That's, I think the third big use case of DuckDB, where it becomes a component. People even run DuckDB on things like lambdas, right? That's a use case, because you have a lambda. Okay, that's fine. You wouldn't start Postgres in your lambda. That makes zero sense. Postgres to initialize takes, I don't know, five seconds. This is way too long for lambda, but a couple of milliseconds of DuckDB, fine.

One thing that people sometimes forget is that DuckDB can run on anything. It's almost like one of these tank engines that can take any kind of fuel. We don't have to ingest data into our own storage from it before we can run queries. DuckDB can run on CSV files. It can run on parquet files. It can run on Aerostructures. It can run on God knows what. It's pretty versatile in that sense. You don't have to actually ingest data, which means that if your data pipeline happens to produce a parquet file, and you need to filter out these three annoying rows, it's like a one-liner in DuckDB, and people use that, almost, to just plop that in and use it as a component. That's as really cool to see. We're kind of surprised by that.

[0:16:19] LA: So, you ingest the parquet file, make -

[0:16:21] HM: You don't have to ingest it. You just run the query on top directly.

[0:16:25] LA: Really. So, you're actually aren't - so I know, one of the benefits you touted was the ability to rewrite multiple file formats. I assume that was an ingestion and export process. It actually runs natively on top of those file formats.

[0:16:40] HM: That's right. So, our query processing engine is built with an abstraction on the data source. We have our own storage format. It's a great storage format. It's the best storage format. But if you have data somewhere else, in let's say, parquet files, also pretty good format. Or CSV files, not such a great format, but still very much used, right? You don't have to actually ingest.

So, you basically can point DuckDB to those files. They can also be on S3. It doesn't really matter where they are, but they can be local, they can be - you point them to those files, and say, "I want to run this query." The engine, DuckDB, will actually directly execute the query on the file without ever ingesting. It will do clever things like on a parquet file, for example, do clever things like, if you're not using a column in your query, it will never read these bytes from this, because it doesn't have to. From the structure of the file, it knows where the interesting things are. It will use statistics to prune entire sub-trees of the query because it knows that this file has no values that can fit this part of the query. So, we might as well not do that.

It's very clever. Our own storage format, the DuckDB file format doesn't have a name. It's just the DuckDB file format. Our source form, it is slightly better here and there. Because we have slightly better statistics. But in general, you can run DuckDB on almost anything. It works really well on things like parquet and arrow. It also works really great on CSV files, although there are optimization opportunities are a bit more limited, because due to the nature of the file formats, right? But I think we have the world's best CSV at this point. Because we really care about you being able to read data without ingesting.

[0:18:14] LA: This is really cool, because one of the things I do a lot is I have applications that import from CSV files, and having to deal with all the different file formats and all the different structures and stuff. It's kind of a pain to deal with an application. But I imagine that, so with DuckDB, I could just read the file with whatever structure it has, and then interpret the fields and do the magic work I need to do from within an SQL database, instead of the CSV or some sort of CSV library that's trying to read it in. 

[0:18:44] HM: That's absolutely true. You can point the engine to the CSV file. As I said, I think we have the world's best CSV reader. There's actually a reason for it. This is not just because we deeply love CSVs. We don't.

[0:18:56] LA: Nobody deeply loves CSVs, but that's okay.

[0:18:58] HM: There might be some weirdos out there. Who knows?

[0:19:01] LA: They're useful in important places.

[0:19:02] HM: It's true. People have a love-hate relationship with them. I think that summarizes it fairly. But we have the world's best CSV reader, because it's the first thing you do with your data system. And if that doesn't work, you're immediately frustrated. So, we realized, hang on, we got the installation covered. That's pretty easy, because we have this old SQLite-style deployment. Let's deal with the second thing people do, which is this ingestion part, right? Not for copying into the file one, but also just for running queries on. If that is a gigantic pain, then you're already frustrated.

It certainly has happened to me like 20 times. I started trying a new system. I have some files laying around, I want to ingest them. It's like, six hours of screaming before that works, which is why we heavily invested. Actually, one of our guys, Pedro, he has been working on that for a long time now. I think he's done really an awesome job in getting a great CSV reader together. So, it's also fast. Have I mentioned this? 

[0:20:02] LA: I've already, in my mind, just even some of the things I'm doing for reading CSV files. I see huge advantages, or I've been able to use it. Is that overkill to use you to read CSV files and then do manipulation through SQL? Is that overkill?

[0:20:22] HM: No, I wouldn't say so. I think that's perfect. I think there is no - there's like very few things that you can do with DuckDB that are overkill, right? Because it is so nimble, the binary is like a 20-megabyte thing. You can just install it from PIP, or you can download the CLI or whatever, and it's up to you. But there is not - if you're running a query that does like a single filter on a CSV file, I think it's already a much better user experience than anything else that's out there, right? You're going to use grep for that? Sure. That's not going to be fun. We see that a lot that people just use this very trivial query. From this file, do this minor transformation and write down this other file. Our engine is perfectly happy with that and we're happy with that and it will be parallelized, which is otherwise also gigantic pain to organize. It will be automatically parallelized. It will be using highly efficient primitives from our execution engine. It will be equally streaming, so that means it doesn't actually have to materialize any of that in memory. It will just go through in a streaming fashion.

I don't think there's a use case too small. And I think that's also what sets DuckDB apart from the competition, if you want. There's all these big iron systems out there. But as you intuitively said, is there a use case for which it would be overkill? The standard thing is like, if you wanted to just read something from a CSV file, setting up Postgres, if the Postgres CSV parser were any good, which it isn't, would be massive overkill to set up Postgres to a series of CSV file. But it's not overkill at all, in my opinion. Obviously, I'm biased. But in my opinion, it's perfectly reasonable to just run a single query and only ever do that and do that all the language, why not? Right?

[0:22:02] LA: The startup time is so short, who cares?

[0:22:04] HM: There's also nothing to like really do, right? What is initializing database mean? Maybe parser configuration? I think that was pretty much it, right?

[0:22:13] LA: You don't have to read the file in at initialization time.

[0:22:18] HM: We also have, like we will be lazily looking at metadata, these kinds of things, like it will not do anything without it having to do it. So, I think it's an interesting sort of philosophical difference. There's this company called Tailscale. I don't know if you've heard of them.

[0:22:33] LA: Yes. I've heard of them.

[0:22:33] HM: Did you know that the name was a joke, actually? Because that's the lore I heard. The company was started by Google people and inside Google. So, I'm told. I've never worked for Google. They were making like these jokes about the tail scale problems as like the tail scale that's like the small problems with a long tail that we shouldn't be talking about because we're Google and we only care about like the -

[0:22:55] LA: The big part, yes.

[0:22:56] HM: The beginning of the tail, the beginning of the, I don't know, the whatever called, the high breeds of the distribution. 

[0:23:01] LA: Low variation, high use.

[0:23:04] HM: Then this company, Tailscale, was found to be like, "Hang on. If we fix the problems of the people that in this long tail, this is also worth something, and they will probably give us loads of money." It's true. With DuckDB, it's kind of the same thing. We set out to solve the data problems for the 99% of us, because it's also in my world of - I come from academia, where it's like you've read the papers, who has the biggest data or something like that, right? It's so common to solve the problems of the one-percenters. I know this is a bit of a war cry. It has been a bit. But the point is, people love solving the problems of the one-percenters. People love solving Google's data problems in like their research, in their work and systems.

Because yes, if it doesn't scale, it's useless. But we completely forget that 99% of data problems doesn't actually need any of that. We have just made the life of people that have these problems so much harder, because our systems only work if you run them on 100 nodes. Or I don't know what the minimum Spark cluster size these days is. But it's big, right? By focusing on this, on the tail scale, if you want, I think we can provide a lot of happiness. I love data systems. I think they should be fun to use and happen and make people happy and not make people frustrated, which is, I think this the state of the things of our -

[0:24:19] LA: Makes a lot of sense. Now, a lot of this you can do to as far as, like the streaming of the incoming data versus starting to memory, et cetera. Because you're focused on the analytic side. So, you're doing a lot of select calls, analytic select calls. But I imagine things like inserts and updates then become problematic or at least slower. For instance, how do you insert into a CSV in a performant way? I'm assuming you can do those things. But I imagine that's less performant in those cases. Is that a fair statement?

[0:24:53] HM: Exactly. So, we can actually insert into a CSV file because when I mentioned we can run on data formats, we can read them in a select query. We can't necessarily run transactions on them, right? This is a different problem. So, we can read from them, we can write into other files and some formats, as I said, for example, parquet files, they don't really like being changed. It's very difficult to append at all to a parquet file. It can be done and we have some ideas, but it's not trivial.

[0:25:16] LA: And the parquet file in an S3 bucket is even harder to -

[0:25:20] HM: Yes. Like, how do you do that? You have to read the whole thing, then rewrite. It is not fun. It's not a fun thing. In order to update, we have our own file format that I mentioned before. The DuckDB file format. If you want to do changes to your data, then they need to be in the DuckDB file format. Or there's another cool feature, since a couple of months or probably a year or so, we've now had connectors to other data systems. For example, you can attach a Postgres database, or a MySQL database. Or a SQLite database to DuckDB. Then, we can sort of treat them as our storage.

So, there's some really cool aspects of this. For example, we can run queries on SQLite files faster than SQLite itself. Why is that? Because we can parallelize queries and SQLite can't. Our data access is as fast as SQLite, because we can just use their reading code on their data files. It's actually a great idea. We can also run - and there we can run transactions.

But to come back to transactions, it's actually an interesting question. What do updates actually mean for analytical data systems? We thought about that quite a bit. Because you have, I mean, from third quarter of your database, undergrad course, you remember that there was this thing with a bank, and the bank needed to send money, and there was a transaction, and there was two updates, and yada, yada, yada. But if you think about what does this mean for analytical systems, do we really have these kind of queries and analytics? The answer is no. We don't have that.

What we have is we have like interesting queries. Like, "Hey, read these two-terabyte file into a table." Bam, that's the transaction update. But now, this is really annoying, because now you have to read two terabytes into a table. Okay, I'm exaggerating. No way. But it could be. Actually, it happens. So, we have done that. Other people have done that. That's okay. So now, you're reading two terabytes into a table. That's a transaction that you have to be able to abort and rollback again. But how do you roll back a two-terabyte transaction? It's not obvious. It's actually created part of a sort of collection of interesting problems to work on, is how do you actually deal with sort of transactionality, and in the presence of these crazy updates, another aspect - okay, so we had the volume. So, you have to be able to abort in an ingestion of two terabytes, and basically, rollback as if nothing happened. Fun stuff. 

Another interesting aspect is, like, it's very common in analytics to kind of add a column to a table that's like computed somehow from the other columns, right? You run some expression, it cleans up something, creates a new column. Now, that means that we change every single row in that table, because we just added a column. In traditional systems, this means a full rewrite of the table. Because Postgres can't deal with that. Postgres cannot, in its storage form, it cannot deal with adding a column to an existing table, because it's a row format, there's no space behind the row to put anything. So, rewrite. Right? Not great if the table is, again, two terabytes tall.

We have actually designed I think, one of the first multi-version concurrency control schemes for SQL systems that are specifically tailored to bulk columnary updates. So, if your changes are in somehow more columnar than row-based, and they tend to be analytical systems, and if your updates tend to be more bulky, they're not like - you're adding a thousand rows at a time, not one, then our system will outperform anything else out there. It's still transactional, which is really cool. Many data systems that don't have that, they kind of traded away transactionality for some idea of performance. But we have managed to get all that done in such a way that if anything goes wrong, we can go back to the status quo added to it without too much trouble. That's, I think, pretty unique that you can do that.

[0:29:00] LA: Right. You talked at the very beginning about columnar access being more important. That's really the reason why it's because of these sorts of transactions. That's cool. Now, that's, of course, with the native file format.

[0:29:13] HM: That's with the native file format. That is the only one that we know of that is designed for these kinds of things.

[0:29:20] LA: Cool. But you can still import from let's say, CSV file into your data file format, and put it out back out again. You can do that.

[0:29:29] HM: You can. You can also run a select on a CSV file. No problem. Just if you were like, what does it mean? You're deleting a column from a CSV file. It probably means rewriting the file. So, that, you can also do. That's not an issue. But that means that you read the whole thing. You change it, you brought it back out.

[0:29:45] LA: It's not as performant, but it is possible to do.

[0:29:48] HM: Yes, exactly. I mean, I don't think there's a performant way of deleting a column from a CSV file at all. If you think about it, how do you do that? The operating system, like the block APIs will not allow you to do that, in any way, shape, or form. I'm not a wizard. I cannot do magic with operating systems.

[0:30:07] LA: If you're dealing with the CSV file that's so large, that performance of rewriting becomes a problem, you should be using another format anyway. But that's a different podcast episode right there, I think, to have that - 

[0:30:19] HM: I've seen things. 

[0:30:20] LA: I have too. I have too. Should doesn't mean do, but definitely. So, let's talk a little bit about you and the start of DuckDB. Why did you build this?

[0:30:33] HM: Well, I had a pet duck, and I thought, I have this pet duck, and I better build a database to - anyways, in seriousness. DuckDB was built because I was doing research at the Dutch National Research Lab for Mathematics and Computer Science in Amsterdam, together with Mark, Mark Raasveldt, who is the co-creator, and also co-founder of DuckDB, who was then my Ph.D student. So, we built DuckDB because we saw this need for the - what I mentioned earlier, for the tail scale, for solving the data problems for the 99%.

We saw that there was this massive sort of lack of interest in single node, like single node processing, like focusing on performance on single node, was a massive lack of interest, and also a lack of tooling. It was pretty bleak. The result was that people were just building their own data systems left and right, and we thought, "Hang on, we know a lot about data systems. But why don't these people use data systems?" Because of the various issues in sort of ergonomics, and just ease of use, and these kind of things, people were not just ignoring them. We thought, maybe they're ignoring us for the wrong reasons. Maybe they're ignoring us for the wrong reasons. That's a good way of putting it.

We set out and we thought about what does a data system and analytical data system have to look like, in the long, long ago of 2018, that would actually be useful to the data scientist. That's how sort of the first version of DuckDB came around is we started engaging with the community of data scientists. We still talk a lot with data scientists on what would work for them. We did a bunch of prototypes and things look promising. So, we sat down, and we started writing a database from scratch, which is, I'm not sure if you have sort of heard this. But this is a monumental undertaking. I'm one of these people that kind of leaps, and then starts looking. It's probably good that it's like this, because I would have never done it otherwise.

I have to admit. I have underestimated the effort to make a database management system greatly. It's good, because it's good, otherwise, it would have never started it. I mean, now we are six years down the road and almost closed, actually, within a couple of months. We are just getting ready to release our DuckDB 1.0 in the next couple of weeks, months, we'll see. So, that's kind of how long it took from us from building, started building something from scratch, just Mark and me, later joined by more people. Now, we're about 20 people working most of the time in DuckDB. But that's how long it took from nothing to something where we like happy enough with to call it a 1.0. That has to do I think with, just because data-based systems have been around for so long, there's a lot of expectations. There's some very reasonable ones like you shouldn't crash, you shouldn't lose my data. That's very reasonable.

Then there's also just this incredible width of SQL. I'm still finding out things about SQL I've never heard about. It's like nothing teaches your SQL as much as writing your own data system, because you have to write the engine that interprets other people's crazy SQL queries, right? We learn things that we never knew existed. For example, the other day, have ever heard of UNION all?

[0:33:44] LA: I've heard of it. I'm not sure I could use it.

[0:33:46] HM: So, in SQL, you have UNION and UNION all. You can take two queries, union together, you kind of glued the rows together from these two queries. Sounds pretty benign, right?

[0:33:56] LA: I don't know what UNION all is. So, UNION vs UNION all?

[0:33:59] HM: Yes. UNION actually has set semantics or removes duplicates from result and UNION all doesn't. UNION all has back semantics, and doesn't do that. That's pretty well-known. If you're writing SQL queries, you know the difference. I knew that difference. SQL also has something called intersect. I thought, okay, that sets sort of theoretic intersection. Sounds legit. Then, like a couple of months ago, somebody came to me and said, like, "Hey, did you know that there's also intersect to all understand that?" What does this even mean? What does back semantics mean or intersect?

So, it's actually exists, and we had to implement it. Another to be supportive. But we learned so much about sort of the craziness, and maybe not aware of this, but SQL is actually a Turing complete language. There is something called recursive CTE's, recursive common table expressions, and if you implement them correctly, and you have a Turing complete language, so we accidentally have built an execution engine for Turing complete language, which is terrifying, right? So, come back to my earlier point, there's just so much that you have to do, you have to get right, you have to get the transactions right, because people expect transactions rightly so in my opinion. You have to get this crazy language under control SQL, which is gigantic and has things that you've never heard of, and you have to implement in, which is much harder than to write them.

It's so much harder to get something right in all circumstances than to kind of use it. You're the person has to think of all eventualities. What if this isn't there? What if this isn't there? What if this happens? That took quite a long time. There's also just aspects that nobody thinks about so much. It's like parallelism, right? We had to come up with a way of running queries in parallel and we had to come up with a way of not crashing when we run out of RAM and all these interesting things. Actually, that's the point I wanted to come back to, because in the very first sentence, you said, activities in memory system, and that's only part of the story.

So, DuckDB is actually a large memory system. There's this thing that traditionally database systems, they tried to be very conservative on memory usage. If you run Postgres in the default configuration, it uses like, 32 megabytes of RAM. It's ridiculous. Then there were in-memory systems, like, I don't know, SAP HANA and a bunch of other systems that used only memory, and they would crash, if they revert our memory. That's not great either, right? So, what we have is actually like, a third-gen system is where we use memory. We will use your memory. No problem. If it's there, and it's free, we'll use it.

But if things stop fitting in memory, we don't just crash. We actually have wonderful mechanisms to use as much memory as we can. Then gracefully degrade performance as we start using discs and these kinds of things. These are also some of the, what should I say, some orthogonal issues that you don't really think about, but they are extremely critical to get right, like parallelism, like transactions like storage. You just spent an enormous amount of time. I'm super grateful for our team here at activity labs that has people that are experts in all of this, these things, and they spend all their time working on that. So, that's really cool to see. But it takes an enormous amount of time to get all that right.

For example, also the storage format. I've mentioned our storage format. We have this DuckDB file that is just a file, a lesson we took from SQLite. You can email that file around. Now, you have this file, and now, DuckDB, you update DuckDB. Now, what happens to the file? In the past, it meant that you had to reload the data, because new version, things change. But now, since the last version of DuckDB 0.10, we actually have backwards compatibility on the storage front, which is a whole other bucket of pain. How do you define a storage format, you still need to be able to read in 10 years? We promise that from every version from now on till infinity future, we'll be able to read the files that you create today. Well, first of all, it's a big commitment for a software project. But it's also something from a technical perspective, that's really tricky to get right. Forwards and backwards compatibility is tricky. So, there you spent, easily spent many months on dealing with these aspects. Then, it's not that surprising anymore that you take six years to get to 1.0.

[0:38:09] LA: Right. Yes. That makes perfect sense. You're almost at 1.0 now. By the way, DuckDB, I don't think we've talked about it yet. But it's open source, is that correct? So, you do have community contributions and it's available for free, for people who want to use it, et cetera, et cetera. All the normal open source things. But many companies struggle with the monetization side of open source. What is your solution to that?

[0:38:36] HM: That's an excellent question. I think we are all too burned by the recent rug pulls that have happened in that space. So, DuckDB is free to open source under the MIT license. You can go ahead and do whatever you want with it and there's no restrictions, which is, I think, really important to gain wide adoption. We have a GitHub. You can send us pull requests, that's perfectly fine. How do we monetize? So, we actually don't monetize DuckDB itself. DuckDB itself is actually owned by a non-profit foundation, the DuckDB Foundation that has its in its statutes that it owns the name, the trademark and all the IP in the implementation. So, the project itself, it's just open, there's no money to be made there. Our company, DuckDB Labs that employs most of the people that work on DuckDB, Mark, me, a bunch of other people. We finance ourselves not by VC money, which is what usually happens.

Usually, what happens is that you start a project and you get VC money until that runs out, and you have to make the hard decisions. No. But we are actually self-funded. We bootstrap. We had the big advantage that we were working at the Research Institute for the first three or four years of the project's existence, which meant that we didn't have to pay our salaries. We got research grants for a bunch of organizations. That's fine. But it means that we could kind of get the project close to being useful commercially. So, when we spun off the company, we already had customers lined up that were ready to sign up as customers to basically pay us to add features, pay us for support for DuckDB. With that, we fund the further development of DuckDB. So, DuckDB Labs today is doing well. We are very happy. Still haven't had any investors. So, it's still a company. It's just owned by basically the team.

[0:40:19] LA: I wonder if there's any other companies, of major open source companies that have self-funded a for-profit entity that focuses on just the support aspects, and that other products of their own and things like that. That's a pretty amazing accomplishment.

[0:40:36] HM: Well, thank you. I didn't set out in life to become a businessman. But alas, it's really interesting to see. So, the company right now, and you will forgive me, the advertisement. If you use DuckDB, and you need help, call us. It's 0-900-quack, quack, quack. No, it's not. But basically, we sell consulting and support contracts. So, if you need commercial support for DuckDB, if you need your favorite feature be added, then you can do that. We have customers big and small. We work for the biggest companies out there. We work for small companies. And that's working out really well. That basically funds the whole show, funds all our team. Yes, it's been super interesting to try to build this. 

But yes, as I said, what really helped was that we were coming out of this research world, which meant we had like a couple of years, where we could just work on it and not worry so much about like money and things like that. Yes, we got some research grant, as I mentioned. But we didn't have to deal with getting desks and getting an accountant, and I don't know, lawyers. All the fun stuff that comes with running a company. So, when we were ready to spin out from the institute, it was good enough to basically charge money for supporting it. We're still very grateful for the customers that signed up then in 2021. Still very grateful for the people that signed up in '21 and they're still around, lo and behold. We're still working with them and we're very grateful. Because it's really made a big difference and it allowed us to not take investment and basically keep control of our roadmap.

[0:42:03] LA: Makes sense. So, we're unfortunately, rapidly running out of time here and I've got some questions to ask. But why do I end with one open question and that is, what that we haven't talked about is something that's important that's going on that you would like to make sure we talk about and make sure we bring up?

[0:42:21] HM: I think what we haven't talked about is maybe the extensions. I mean, you briefly mentioned it in the beginning, I think, that DuckDB is extensible. I think that's a super interesting future direction for the project and for us. So now, we're all working on getting the 1.0 out of the door at some point in the future. But what happens after that is, I think we're going to spend a lot of time on basically building an ecosystem, where you, or anyone else, really, can basically use DuckDB as a platform to build their own stuff to build extensions on, and where DuckDB kind of becomes this fabric of efficient operators and efficient data representations, all that stuff, that you can glue various extensions together. And it's all accessible from sort of a unified interface look like SQL.

In that sense, I think we are trying to go a bit where Python is, where Python itself is great. But the real value of Python comes from the ecosystem of contributed plugins, and we're working very hard to make it easier to kind of build, and ship, and manage your own DuckDB extensions in multiple languages. So, for example, we're going to add support, so you can write these extensions, also in Rust, and maybe Go, and C++ and anything else, so that anyone can really extend it. I think that's going to be the next big thing for DuckDB is going to be this ecosystem around it. But that's host we have at the company, we have sort of two timeframes. One is like pre-1.0 and one is past 1.0 and they're very, very different from each other. It's really fun. But it's also just been a lot of fun working on this. It's been a blur of six years or so. I mean, obviously starting with Mark, and now, it's just absolute pleasure to working with everyone.

[0:44:11] LA: That's great. Sounds like a great project. I always tell when something is really interesting when I went from, "I don't even know what this thing is," to, "I got to try this out on my next project now." I definitely will be doing that.

[0:44:23] HM: Please do.

[0:44:24] LA: I already see a couple of use cases of things I'm working on right now that this would be a perfect fit for. I think we talked about CSV import, but also some things on, I forgot what the tune is, the AI distance comparisons and things.

[0:44:38] HM: We have support for that now as well. We have a plug-in for vector search, things like that.

[0:44:44] LA: Exactly. Yes. 

[0:44:45] HM: An extension, you see.

[0:44:45] LA: Exactly. So, I might use it for that as well. So anyway, very interested. I'm going to learn a lot more about it and start using it. I appreciate your time. This has been great. My guest today has been Hannes Mühleisen who's the CEO and co-founder of DuckDB. And did I get that pretty close on the name?

[0:45:04] HM: You got it right. Thank you.

[0:45:05] LA: Wonderful. DuckDB is a modern extensible open-source, in-memory and out-of-memory analytics database engine. Hannes, thank you very much for joining me in Software Engineering Daily. 

[0:45:18] HM: Thank you, Lee for having me.

[END]