[00:00:00] L: Dor, welcome to Software Engineering Daily.

[00:00:02] D: Hi, Lee, and hi, everybody. I'm excited to be here.

[00:00:04] L: Great. Let's start out with the question that's gotten me for a while. I've always pronounced it ScyllaDB, until I found out recently be a Scylla is the more correct pronunciation. But I understand that there's other ways to pronounce it as well, too. What's the story behind the name? And what's the story behind the pronunciation?

[00:00:26] D: Well, we have a couple of interesting stories. Thanks for asking. It's pretty much always the number one question. So, the story behind the name is that we pivoted early on when we started the company, we did something else. We needed to choose in different names. We were a distributed company from day one, and we use Google Doc to collect ideas about names, and Scylla was one of them. Now, we wrote Scylla in the Google Doc along with other names, but nope, everybody pronounced it in their mind. But we didn't meet. So, it was all written down without pronunciation. Eventually, we voted and we selected this name, and then we had a meeting, two weeks before the launch, and we realized, hey, everybody has a different idea about how to pronounce Scylla. At the end of the day, we have a rule. If you use it in production, you get to pronounce it the way you want it to.

[00:01:27] L: That’s make sense. It’s cool. So, the particular pronunciation that you prefer, though, is Scylla. Not Skyla, is that correct?

[00:01:37] D: Correct. But I was also thinking about Brian saying it Skyla, but it's just only pronunciation.

[00:01:47] L: Yes, makes sense. So, let's talk about Scylla then. It's a highly competitive database in a very crowded product space. There's a lot of products that do similar sorts of things. Why another database? And what does Scylla provide over and above DynamoDB, and all of the other similar no SQL modern databases?

[00:02:15] D: Indeed. Very good question. Actually, a question where my co-founder and I sat before we started the company, we listed multiple ideas for a setup, and one of them was the database with a design for assists these, and we crossed that, we've marked out as office, maybe because we believed back then that everybody already have done it. So, we selected it on our idea. Eventually, it was an operating system from scratch. Eventually, we had to pivot away from it. The [inaudible 00:02:46] still lives, but that's another whole episode – another story for a whole episode.

So, when we looked for another idea, and actually, we're using the OS that we created, that's supposed to boost databases and other workloads compared to Linux, we stumbled across Cassandra and Redis, and multiple other products. Hadoop, back then it was late 2014, and we realized that with Cassandra, when we ran Cassandra, we didn't manage to show the improvement over the OS that we created from scratch, versus Linux. When we did the same with Redis, for example, we boosted Redis by 70%. We had a super cool kernel with our own TCP/IP stack and other things. So, Redis was relatively a slim application in C++, and basically, it's more for lightweight cache.

Cassandra was a full-fledge, relatively heavyweight Java process, and we realized the complexity lies within Cassandra. And if you change the underlying engine, it wouldn't change the needle that much. Back then, Cassandra was among the 10th most used databases. So, we did one bulk plus one, did additional research, and we realized, okay, let's rewrite Cassandra from scratch in C++, bring all of the knowledge we have from our operating system creation – Avi, my co-founder had the novel idea how to create the KVM hypervisor and the implementation I managed team.

So, we had a lot of opinionated ideas of how to create a low-latency infrastructure, and it happen to be that we're targeting the database space, and Cassandra was screaming from inefficiency. But back then, Google published a paper that they use 330 machines to get million operation per second, with all of the replication and all. In 2012, KVM, the hypervisor we created, broke the world record with a number of IO operations, like just file system operation in a single VM, did 1.6 million operations. So, it's really a comparison of apples to oranges, but it was screaming of inefficiency.

We realized, let's keep the really good interface Cassandra has, also derived from others, it stands on the shoulder of giants of Dynamo paper and big tables. Let's keep all the goodies that come from this type of wide column structure. Just changing plantation. Basically, this is what we've  been doing since.

[00:05:51] L: Yes, so your intent was a high performing Cassandra, and that's really what you what you've accomplished. But your biggest competitor right now, really, I guess it is still Cassandra, but it's mostly I would say DynamoDB. Correct? Is your biggest competitor at the moment?

[00:06:12] D: Today, as you said, before, that there are so many databases. There's no one gigantic target that we just target it. But basically, we solve problems at scale. Some databases cannot scale at all, like they're extremely good in what they do, like relational databases. But it's extremely problematic, not impossible. But it's extremely complex and problematic to scale them. Some databases can scale but to a level, and they may have additional problems, while they scale, sometimes stability, and some of them can scale, but they are extremely expensive.

So, we solve problems at scale, and depending on the pain that our users are having with other platforms, at the end of the day, if they do not have pain, they can stick with what they've got.

[00:07:08] L: It's kind of an interesting space. The database storage space, because it requires – it's the one part of an application that fundamentally requires that people make decisions early on in the product development process about exactly what type of database they want to use. Maybe they can – they don't have to get the specifics. You can always move from, like moving from MySQL to Postgres is usually pretty easy, and things like that.

But you have to understand the type of data architecture you want to do want to use at the very beginning of your application development, so you know whether or not you want to use – whether you need a SQL based database, or a NoSQL database, or key value, wide table, whatever type structure you need. You need to decide that pretty early on in your product development process. So early, in fact that you really don't know your scaling needs. What advice do you give to someone who's trying to decide, should I – basic question number one, should I build my application using SQL or NoSQL? And why? What's the fundamental advice you give to someone in that quandary? That's a very common quandary that early development processes are in. But what do you give as advice to those people. 

[00:08:39] D: At the end of the day, let's say it's about the workload and your anticipation from the usage patterns. If the workload isn't that big, and it can sit, workloads that can run on SQL types of databases, go ahead and pick SQL. They're the best databases. They give the best flexibility and the best query capabilities. So, that'll be the easiest if the amount of data is not gigantic.

Also, it's not just about the amount of data, it's also about the latency, the throughput, and the high availability in disaster recover. If you need an active passive, maybe relational databases can still work. If you need to active, active and you need to sustain or a region that goes down, then probably SQL is not your solution. If you can go back to the cup theory, and you need to select two out of three, or sometimes both approximations today, sometimes the boundaries get blurred. But at the end of the day, you need to model your failures and what you'd like to support with it with your applications.

Now, some applications are really web-based properties. If you're planning to compete with Discord, with the Spotify and the Disneys of the world, then you better start right, and you expect to have a 24/7 mission critical application with lots of data and lots of users. If you're working on let's say a small startup with 10 people and the product in terms of, let's say, you're just doing credit card transactions. I was just talking about this with our original VP of sales. So, how many credit card transactions you're going to have a second probably not that much. Actually, is probably such a company's product will evolve over time, they'll add a fraud prevention application with a database that will need to do many more database transactions in order to validate whether the user is a real one versus a fraud. There, they may be many more database activities, and you may need NoSQL there.

While at your core, the activity may be sufficient to with SQL. But we've seen that users can be flexible and there is not necessarily a need to do premature optimization. You need to optimize it from the very beginning. You need scaling, then go ahead, pick the most scalable database you need. If initially, you care about speed of programming, and you don't even have product market sets, you can also try to do what makes sense. Discord, for example, one of our users, they started off with MongoDB, because they needed the agility, and I believe that they also did the pivots early on. Once it got picked up, they moved to Cassandra, but back then we were pre-GA. They looked at us but we were – we didn’t have GA release, and couple of years later that they moved to us. So, it is possible to change while you're growing.

[00:12:31] L: Yes, you're right. You certainly can change technologies while you grow. But moving from Mongo to Cassandra to Scylla is a lot easier than moving from let's say MySQL to Cassandra or MySQL to Scylla. Because it's a fundamentally different model in how you think about data, and how it works. You're right, you can make class of decisions, final decisions. I love what you said, you mentioned, active-active versus active-passive. I liked that because that's a very concrete thing that you can evaluate at the beginning of your lifecycle and say, “Yes, active-active is going to be important to us someday, whether it's important today or not.” So, we want to build an architecture that allows us to easily have an active-active multi region setup. That leads you in a certain direction, in a direction towards someone like Scylla.

But I think what some people struggle with is the high-performance, high-scalability, yes, NoSQL is better for that. But how much? My application is going to be very, very, very big someday, but I don't know how big, and I don't know exactly when, and what sorts of guidelines can you give people as far as, if you're above this size of an application, or this size of a data set, or this size of a number of transactions per second, et cetera, et cetera, et cetera. This is the area where NoSQL really starts to shine. Can you give any guidance in those areas? Or is that just too open ended up a question?

[00:14:20] D: I'll try, but at the end, I can easily be wrong by 400%. It’s still 400% and not 47%. So, it's about as possible to go, to be totally wrong. Number one is the amount of volume. With relational, most of the time, you're going to fit it into a single machine that needs to have a certain amount of volume. Let's say, a terabyte, or 10 terabytes, even if the machines have more capacity, it's really hard for a relational database to deal with more than a terabyte. Roughly speaking. Again, every object and every partition can be different. But more than a terabyte –

[00:15:06] L: Terabyte, a number to be thinking about.

[00:15:08] D: You can also think whether your data is chargeable or not. Sometimes, let's say if you have multiple clients, and you don't really care to cross match all of the different clients, so all of your queries will always have a single client ID, and you wouldn't join multiple clients together, then you can say, “Okay, I need my biggest client to fit into a single computer.” That's also possible. But we can also see it as sometimes a single client becomes like, worse, and they have sub clients. So, it can get trickier and trickier over time.

One terabyte and let's say 10,000, to 50,000-ish operations per second is good for relational, and anything above it is good for NoSQL. We have small customers who use 10s of thousands of operations per second, and bigger ones who can do multimillion operations per second and keep petabytes.

[00:16:19] L: Yes. To be clear, like you've been very clear here is that it depends on the application, it depends on the use case, and lots of variables there. I was pushing for numbers simply because that's what people want. But it's a thousand plus or minus 100,000 of guesstimate. The variability of these numbers is so variable. Are there ever cases where you would either consider using SQL for an extremely large application with large amounts of data and lots of transactions? Or alternatively, consider using a NoSQL database for a very small application?

[00:17:01] D: Use SQL for large application – one option is application that you can easily shard. So, that's example number one. The size wouldn't necessarily matter and you'll be able to do multiple shards. Also, queries may be relatively cheap, even with our sharding, depends on the data set and in your schema.

[00:17:26] L: For the schema too, and what types of selects you need to do, what type of queries, yes. Lots of variability there.

[00:17:34] D: Correct. We also have NewSQL with and Distributed SQL. We actually, if someone is interested, I rather not make it a product pitch, but that was a hot Distributed SQL, great technology, and product. We tested it versus Scylla to figure out, okay, let's see what Distributed SQL can do. we use the same three nodes, the same hardware, and we compare Scylla, and that other technology, and used one billion keys, and the other database just crashed when we use one billion. So, we reduced that to 100 million, and Scylla, still used one billion. Then, we compare the two, when Scylla holds 10 times the data, and we managed to do 9x the throughput and the latency was 4x better. Altogether, a very big advantage, and the other technology is great. So, there is a big advantage going to a NoSQL. You're also leaving good functionality off the table. Both SQL vendors keep on adding functionality, and also NoSQL vendors also add functionality.

We, for example, are in a transformation to move from eventual consistency to full consistency to full consistency. The words kind of marriage over time.

[00:19:18] L: Yes. I'd love to get into a conversation. I think this is probably another episode though of full consistency over highly distributed databases, and how you accomplish that, and then the techniques involved in it. It's an interesting conversation in general, but probably more than we have time for today.

But let's talk a little bit – you mentioned that the everyone is growing in their capabilities more and more. And in some cases, that's making the difference between SQL and NoSQL less and less significant, right? You look at a database like CockroachDB, which is very much focused on the same sort of feature set to that a lot of non-SQL, non-SQL databases are focused on. Things like large datasets, distributed over geography, large applications, active- active, et etcetera. All those same sorts of things, yet, they're a SQL based database. So, they're actually going deeper and deeper into NoSQL capabilities from a SQL database. You also have databases like Redis, which is historically been a in-memory database. It's very, very, very fast. But now they're focusing more and more on persistent databases, the ability to persist using the same API.

So, each of these companies with their own focus is merging into the space of the other ones. Are we creating a universe where all databases are going to be partially good at everything, but not really good at anything? Because we’re all kind of this mess in the middle? Or are we really truly making it so that the choice of database is less important? And choose for whatever other reasons you want? Maybe performance or whatever. But the specific choice is less important, because the capabilities exist everywhere. What's your take on that industry change?

[00:21:27] D: It's an interesting question. The thing is, so on one hand, the greater common denominator keeps expanding, so more databases are more consistent, can deal with more data, with more transaction, all of them, and have more and more APIs. On the other hand, it's not that the workloads remained static, and just the databases getting proven, and the workloads remain static. Usually, these workloads also keep expanding, sometimes in a higher pace, and data may grow 2x year over a year, and the amount of data will grow.

We also see it, for example, there are cases where customers want, let's say, they keep history of 30 days, and they'd like to keep a history of one year or seven years and run analytics on it. So, we're expanding into usage of S3, and not just us, but also other companies in the market, both relational and no relational databases. Everything moves, and what was sufficient even for a baseline is not sufficient in the future. Even though the database capabilities keep on expanding, then still, the industry requirements are also expanding. Today, it's a standard to have three zones and region replication and SaaS, in general, database as a service with bringing your own account, bring your own key, et cetera.

[00:23:13] L: So, in other words, even though the capabilities of all the data bases are growing, so are the customer needs, essentially, which makes sense, and the net result is there's still differentiation in the database space, so I think, there's still going to be differentiation in the database space for some time to come.

[00:23:33] D: Correct. I think, there's always be, it's a hard domain, it's the biggest marketing software. So, it will be in two years $100 billion market. It will attract lots of vendors, and these vendors could come both for fun, because it's a fantastic domain to, to work on, and also for profits. So, it will attract lots of improvements and lots of competition. But although there are many databases and there is no one winner takes it all, so it's possible to do multiple things. There's clear winners in several segments, and you can name them and sometimes their initiatives, but initially, let's say a graph database is a large dish too. It’s self-large, but it's large enough to be a billion-dollar market.

[00:24:28] L: Yes, a small is still huge in this environment. So, yes. Well, one of the hot topics in the last several years, well, actually, probably about a decade now, but it's certainly growing. Sorry about that. Speaking of Alexa, by the way, which was what I was about to get into, one of the hot topics is AI and machine learning. And that's created – that's interjected itself into almost every application development process, in almost any industry, large or small all over the place. And one of the things that AI requires is huge, huge, huge amounts of data. So, the amount of data we're storing, and making available for use for things like machine learning and AI, is causing an increase in need of even more data. You mentioned, the customers want to keep the data for a year or seven years, versus 30 days, a lot of that is like you say, for analytics, but a lot of that is also for machine learning, and artificial intelligence needs.

How does this increase in quantity of data that the world requires for technologies like AI? How does that fit into Scylla’s long-term plans and long-term requirements? And how does Scylla fit into – what is the future of Scylla in that world? I guess, is what I'm trying to ask.

[00:26:14] D: AI is a direct continuation of the digital transformation of everything. So, it's another kind of – before, let's say, transformation was relatively basic, now it's smarter. I bet today, ChatGPT usually use the term as an answer below average, the answer that ChatGPT gives you, because you can't verify it. But still, if you look at big data, sometimes it's good enough. So, obviously, lots of our customers are into AI, whether it's generative or non-generative AI news feature stores to figure out what's going on. Everybody, expands towards AI, and make sure that they run machine learning models, so they have all of the training, and all of the history to train the model. Later on, they need to – sometimes it's too slow to ask AI. So, they need to place all of the decisions from the model already to pre-populate it in your profile. So, once you will get into your shopping cart, or to the recommendation for your video, or audio streaming, or anything else, but basically the world, half of our usages are recommendations, but you can look at it, and e-commerce a recommendation to buy something. 

Serving is a recommendation to continue to watch and to watch something similar. Fraud is a negative recommendation of, “Oh, I do not recommend this deal with the user profile that AI computed before.” It's everywhere, basically, and it will just, let’s say, if today it’s half, in the future, it'll be 99% or so.

[00:28:26] L: Make sense, yes. So, if someone wants to start evaluating Scylla for their either new software project they're starting or for an existing project that they need to migrate their data somewhere. What advice do you give them and what should they do?

[00:28:47] D: There are two approaches. One approach is just a simplified approach, just grabbing a Docker container, run it on your laptop, and play with it a bit. We have a Scylla University with lots of nice, super quick Docker composed apps that you can run and emulate in a complicated environment super quickly with one command. There is the other approach that says, “Okay, let's go ahead and let's imagine our most complicated use case.” Let's say you need a million ops, let's see if it can run in a million ops, and in order to run in a million ops, you need to a couple of good machines, and then use your favorite cloud, provision them, and just to try to see if it answers the hardest problem first, instead of delaying it to the end.

[00:29:47] L: Well, make sense. Makes sense. Now, you mentioned the Docker container you. Obviously, you have an open source version, but you also have a cloud hosted version and you have an enterprise offering as well. Do you want to talk about the difference between those at all? Your chance for your marketing message here now, if you want to.

[00:30:07] D: Sure, thank you. Well, open source is the base. So, you get the fully functional database that can do a lot. It's open source. For good and bad, you'll get community help with it. It's pretty good, and it's an easy starting point. You don't need to speak with sales. Who like to speak with sales? I hope sales do not hear the podcast. Then, there Scylla Enterprise which is based on the open source product. It has a longer longevity and we Harden it more, and it has couple of more features. I'm not trying to sell here. But if someone needs support and some other things, they can purchase it and they manage the enterprise product on their own. They provision the machines and if a machine dies, then they need to come up with a replacement or code that will replace it, or Kubernetes. The SaaS product is a service, so we are responsible for everything exactly like DynamoDB, and it's easy to use, no installation, just a service. That's of course the easiest.

About 80% of our new users, or new customers, other clients, are interested in the service because it's just easier. They get to focus on their app, and not to maintain a distributed database.

[00:31:44] L: Right. Thank you. Thank you very much. So, Dor Laor is the Co-Founder and CEO of Scylla, a highly scalable, cloud centric no SQL database, and he was my guest today. Dor, thank you so much for coming and being with me on Software Engineering Daily.

[00:31:59] D: Thanks. It was a pleasure.

[END]