EPISODE 1647

[EPISODE]

[0:00:00] ANNOUNCER: Starburst is a data lake analytics platform. It's designed to help users work with structured data at scale and is built on the open-source platform Trino. Adam Ferrari is the SVP of Engineering at Starburst. He joins the show to talk about Starburst, data engineering, and what it takes to build a data lake.

This episode of software engineering daily is hosted by Sean Falconer. Check the show notes for more information on Sean’s work and where to find him.

[INTERVIEW]

[0:00:40] SF: Adam, welcome to the show.

[0:00:41] AF: Hi, Sean. It's great to be here.

[0:00:42] SF: Yes, thanks so much for being here. Let's start with some basics. Who are you and what do you do?

[0:00:47] AF: Yes. So, my name is Adam Ferrari. I lead engineering at Starburst. I'm relatively new to the organization. Joined there in October. So, pretty new in my journey, but excited to be part of it. 

[0:00:57] SF: Awesome. How's the experience been so far?

[0:00:59] AF: It's a lot. We're doing a lot of really interesting stuff across a lot of big organizations. So, there's a lot to learn in pretty rapid order. A lot of interesting technical stuff going on, a lot of interesting business development going on. So, it's like jumping onto a rolling freight train.

[0:01:14] SF: What were you doing before you joined servers?

[0:01:17] AF: Yes, I've been in the Boston area in tech for a couple of decades over, and most recently, I was leading engineering at a company called Salsify, which was a data platform for a retail data, we were helping brands get their content to all the major retailers. I've been in sort of data stuff, pretty much throughout my career. In the 2000s, I was the CTO of a company called Endeca, that was well known for creating the first version of faceted search and transforming how we search, and also use search for analytical applications, which got me into the data analytics space. So yes, that’s a bit of it.

[0:01:50] SF: Awesome. So, I talked to your CEO, Justin Boardman about a year and a half ago. I encourage anybody listening to this that didn't listen to that episode to go check it out. We had a great conversation about some of the history of Trino, the open-source project that Starburst was built on, how Starburst transitioned into becoming a managed service available to the enterprise. For anyone that missed that episode, can you maybe start off with helping them get their bearings and explain what Starburst is? What it's used for? What problem you're solving?

[0:02:19] AF: Yes, absolutely. So, Starburst, it's a great question. I think we still like struggle to get the identity well known out there, except for the people who really are in the data space. Starburst is fundamentally a data lake analytics platform. If you're doing anything around, having a larger scale data lake or federating across your different data assets, you probably know about Starburst. The core technology is an open-source package called Trino. It's pretty amazing. That gives us all of our superpowers. So, on one hand, we're a data lake analytics platform. Another way to think about us is as the enterprise provider of what we think is the best way to consume Trino for your big data needs.

[0:02:57] SF: So, when you talk about it being essentially designed for data lake and analytics. When I think of data lake, I immediately think of unstructured data. So, is this only designed for unstructured data? Or would this work also with something that might be more traditional, like RDBMS database or something that?

[0:03:15] AF: It's definitely oriented towards structured data. I think people think of it as unstructured. When talking about a data lake, they think of it as unstructured, only because in a data lake, it's common that the data is in object storage, like an s3. But in reality, what's in most of that object storage is structured data. It's going to be in a format like a Parquet, or one of the more modern formats, like an Iceberg, or a Delta Lake that is fundamentally structured. So, even the data lake component of it is structured. Then, the really interesting thing about Trino and Starburst is, it's got a really great architecture for connectivity to other structured data sources.

So, you can be connecting to your MySQL, your Snowflake, any structured database, and then federating that together with your object storage-based structure data. So, it really is about structured data and SQL, especially providing one unified way to SQL across all that, but spanning really traditional databases and object storage-based models.

[0:04:11] SF: Okay. Then, in terms of the concept of a data lake, it's been around since about 2010. But if you compare that to something like data warehouses, or even a database, it's like a relatively newcomer in the world of data. So, from your perspective, what led to the growth and interest in data lakes?

[0:04:30] AF: Yes. It's so interesting, because I think it was the changing nature of data problems led us here. Data warehouses have been around for decades, right? Because people wanted to take all their corporate data and be able to drive fast queries on it. Transactional databases weren't like they were much more balanced for read/write. They weren't suited for that task. But then what happened was sort of with the advent of the Internet age was there were these new sources of sort of machine data and like all the user logging that'll happen on these massive Internet sites. That became a class of data that was like beyond the realistic capacity and scalability of traditional data lakes. You needed something that was way more scalable than that, Internet scalable.

I think, that's where the concept emerged from. We wanted a way to land stuff into more of object storage, just fundamentally easy to scale, just keep dumping data, and it's very cheap to maintain. Then you got big data systems, like Hadoop emerged to let you do something with that, crunch it. But they were fundamentally, in the original conception, MapReduce is totally a different model of accessing data than SQL, which is a lot more approachable, and also has connectivity to all these kinds of tools, your dashboarding tools, BI tools, et cetera. That's where the data lake, really, I think came from was the reality of this data we were amassing cheaply and scalable, but wanting it to kind of be usable in the way that our database or data warehouse data was. That's where technologies like Trino come from, was to bridge that and create an idea of a data lake being actually consumable, usable in a more data warehouse-oriented fashion.

[0:05:59] SF: You mentioned MapReduce and Hadoop. I feel like that era of like big data was kind of like something that we applied, because we basically didn't have like a better method for dealing with the scale issues at the time. It was like a band-aid or I don't know, like some chicken wire or duct tape that we put on top of that, that was really complicated. And luckily, it kind of went away fairly quickly. But it was huge. People were part of that era. It's hard for them to probably conceptualize how big a thing that was during that period.

[0:06:29] AF: Yes, absolutely. It is interesting. I think there wasn’t any answer at the time when there weren't other better answers. Fundamentally, I'm going to think about MapReduce. It came out of very specific problems. I mean, it was doing search stuff during those years and it really emerged at Google, like for a couple of reasons. One was just basic indexing. You've got these heaps and heaps of unstructured documents that you're crawling off the web and you need to index them. That's a MapReduce problem. Then, the idea of PageRank of like getting link weights and stuff, that's again, like how do you kind of consolidate all the information?

But I remember using the original versions of SQL on top of that, to make sort of the more structured data more generally usable, and they were slow. Often, you'd get a query that would fail with a giant stack trace. It was kind of disappointing. I remember the first time I experienced the predecessor of Trino which was called Presto and using it. The experience was game-changing. It felt like a database. It was like instantaneous, it worked query after query. So, it's absolutely something where it's like, yes, there were those early days that got us doing something with this massive data. But it absolutely was just right for a next-level treatment, which is, I think, where we're at today.

[0:07:33] SF: Then, you mentioned Presto there, which was the sort of precursor to Trino. What is history of that? Where did those projects come from? And then when did Trino kind of spin-off into its own thing?

[0:07:44] AF: Yes. So Presto, which was the predecessor of Trino, was built at Facebook. This goes back more than a decade and I'll just call out our CTO group here, Dean Sundstrom, David Phillips, and Martine Traverso. These guys and a group working with them at Facebook created Presto, and they did it out of necessity. Facebook, they were on a gigantic. They were, I think, the one of the biggest Hadoop and Hive clusters out there to try to analyze all their data, which was I mean, massive, unprecedented scale. Perfect example of why data lakes exist.

It was obviously slow and cumbersome. They think they looked at commercial things, but nothing was at that scale and set out to develop this. So, it was developed. I mean, these guys developed this amazing thing, for the most demanding data warehousing problem out there, brought it into the open source, and basically, in the evolution of that, to take it more independent, they transitioned that project into Trino, and continued to evolve. It's actually really vibrant. It's one of these open-source success stories, really wonderful community, tons of people, tons of big organizations that utilize Trino, continue to contribute to it and evolve it. It's super vibrant, excellent governance on that community, which I think fosters it. Staying healthy, staying fresh, and frankly, getting new capabilities at an incredibly rapid rate, which is wild to say.

[0:09:02] SF: Then, back to the data lake. So, on the surface, I think a data lake sounds great. You take basically all your data, structured, semi-structured, unstructured data, throw it in the lake, make it usable through BI, ML, whatever, R&D. But the reality, I think, is probably a lot uglier than the like Instagram story version of running a data lake. So, what are some of the challenges with actually building, maintaining, and keeping a data lake usable?

[0:09:29] AF: Yes. I mean, at the end of the day, it is more complicated and more DIY. You're assembling a lot of different technologies to kind of source that data, format it, maintain it. It's a lot more of a choose-your-own-adventure architecture, compared to like, a data warehouse is fairly opinionated, you're going to put some stuff around it, you're going to put ETL on the front and you're going to put reporting and dashboarding on the surface of it. But it's very opinionated, versus data lake, you've got to make all these decisions.

Frankly, you go down that road. I've built around a data lake architecture in previous lives to great effect and like got great adoption. But it was one of those things where it does require more investment, more documenting processes that you have to make up on your own because the opinionation isn't there. And it's actually something. I mean, when I look at the evolution of Starburst, and what we're trying to do around Trino, is to bridge that divide and create more of a complete platform, so that – today, I think we're still in the era where data lakes are for people with the more demanding data problems and therefore, like the values there. So, the investment is there to put a reasonable-sized team and to pull those technologies together, that you need to make a real data lake, and actually have the ongoing maintenance, documentation, data curation that make it successful.

If you're not there, I mean, a data warehouse is an easy button in a lot of ways. You can get it going and it's very opinionated, which takes you down the road. So, what we're trying to do, I mean, there are real benefits to a data lake architecture beyond scale. There's fundamental agility, flexibility, evolvability benefits. But today, they come with this hard trade-off of like, you're going to go on your DIY adventure to make a complicated thing. You'll get those benefits, but they come at a cost. I think a lot of those costs, they're things we can beat down by having more of an opinionated platform that helps walk you down that path and just end up with a slightly different architecture in terms of your data platform.

[0:11:19] SF: Yes. How would you even think through sort of like data modeling when you're dealing with a data lake? When you're talking about a warehouse, like you mentioned, there's strong basically opinions about how this should be done. There's tons and tons of references and books that you can follow. But with a data lake, there's a lot less frames of reference, and it is more of this DIY approach. So, how do you kind of think through those initial data modeling challenges and then that's not something that kind of like, too rigid that shoots you in your foot later down the road when you need to adapt it to something new?

[0:11:49] AF: Yes. The beauty there is all of this prior work and thinking and literature on data warehouse modeling, is completely applicable. In fact, I mean, in the limit, if you're crazy, we shouldn't do it this way. You could approach using a data lake exactly like a warehouse and model everything exactly the same way. You would forego all these awesome benefits of agility that come from a lake, but you could kind of like, all the same modeling concepts come into play.

The idea with a lake is, it's not a big bang. You don't have to do all that upfront. You can start out very flexible, very unstructured. The truth is, it’s like these query engines are powerful. If you need to morph the data to look at it in interesting ways, you can start to build that into your query, and that's for your more of your Trailblazer users. Then, you know what, what a data lake lets you do is, let's face it, you're going to have your pillar data sources, your CRM, your ERP, your sales data, whatever it is, in your organization. Those are going to be the first datasets that get super well curated in your lake. They may not even be native in your lake. They may be federated in by a warehouse, because you're already there.

So, that's great. That gives you a core to your data model. The lake and the way the lake operates is there's all these other data sources. You can acquire data. You can put more telemetry in whatever online presence you have. There's a million ways you can get additional data in. But the interesting question with a data warehouse is, it’s expensive. You do have to do the modeling upfront and the ETL. It's expensive. So, speculative data sources, you'll often make the decision like, “Okay, I got to prioritize and do the ones that are more obvious from a business case.” The data lake makes it super easy to create a world where you're like, “Hey, let's put that in. It's as easy as dropping a file and getting a little metadata on top of it.” Then from there, you can start querying it. It may not be modeled perfectly. It may not be perfectly appropriate. But you can start to find out if there’s signal in there. The datasets that are interesting, well you put more effort into, and then you curate those as you go on. If they're really interesting, you go even further. You'll turn them into a data product with better documentation and like sample queries that make them more consumable.

For some of them that don't turn out to be paydirt, that's okay. You didn't pay a lot. That's actually one of the fundamentally wonderful things about a data lake is, it actually creates a culture. It's not even that it's data-driven. It's sort of data-seeking. You're asking the questions that are out of the scope that a traditional data warehouse organization will ask, because they're too expensive to even close in, on the architecture.

It's one of these hidden things. I think, so often, we talk to customers that are looking at a TCO of their existing problems. They're like, “I've got my snowflake bill. It's really expensive. How do I beat that down, but maybe taking some of this stuff and putting the workload into a data lake?” It's awesome. I mean, I’m not going to take away from the value. That's real money that you could invest in other stuff. But to me, the real transformation of organizations that embody a data lake is they actually, they do see and they collect more data. They're better at seeking out the data that answers interesting questions and actually using it to drive their business. That's actually a wild transformation to watch. I've been part of it. This is part of my transformation. Part of why I'm at Starburst, is again, the religion of the true believer and having lived this, and it's amazing to watch. It actually is. It has wonderful effects. And you see more users. It's viral. You see more users in your organization, seeking out data, trying to like learn and become more literate on analysis. It's actually really an amazing thing to watch from a transformational perspective.

[0:15:09] SF: Yes. Do you think part of the value of the data lake is that, there's sort of like a separation between the actual data that you're storing and object storage with, essentially, the metadata that you're going to use for sort of making it like queryable. So, there's like a natural obstruction that's happening between the actual raw data and then, attaching this metadata. So, does that give you more flexibility in terms of doing something one way, and then kind of like throwing in a way down the road when it's not working?

[0:15:38] AF: Yes. That's part of it. I'm not going to lie, the cost model of using object storage for stuff that is speculative, is pretty amazing, especially with – I mean, this is I think, part of what's driven data lakes is the evolution of more and more efficient and also capable file formats. Now, tape on disk table formats that like actually make – you're getting the cost of storing your data in S3, which is dirt cheap, right? Let's compare it to like storing it into like a first-class data warehouse. And yet, you're getting these formats. I mean, with Parquet, we got sort of a columnar representation that was incredibly fast to scan. So suddenly, like the performance trade-off.

Now with the modern, the Hudis, Delta Lakes, and I think Iceberg is probably in our view is really the front runner, in terms of openness and capability. You're starting to get like a really nice set of semantics, ability to do updates, ability to take snapshots. It's really incredible. So, I think one of the unlocks in the space has been smarter and smarter formats that still let you achieve this decoupling of your compute and storage. So, you're paying the S3 rate for your data, which is amazing.

[0:16:46] SF: Yes, absolutely. I want to walk through like an example. Let's say I have a bunch of data being stored in object services, like S3, or something like that. Then, I want to go through the process of actually creating a data lake. What am I using to keep the data organized and queryable? What's my first steps to kind of like, turning that into something that I can actually use for analytics and drive insights on?

[0:17:13] AF: If you've already got an S3, well, then you got an amazing leg up. In many cases, there might be a prior step there where, in the simplest cases, like you've got something coming off of a SaaS application, it might be as easy as getting an export. But it's going to come in some format that's usable, but not ideal. It's going to come in CSV or something like that. More interesting sources, you might have data you want to replicate of a database, and you need to do some ETL, or Change Data Capture that kind of thing, to get it in.

If you've got streaming data, you've got like stuff going over Kafka and you'll need to figure out a way and this is stuff we're working on building into the platform is ways to kind of connect Kafka streams and actually land them into lake. So, there's, in many cases, prior to what you pose, there's work to do to even get something into S3. From there, your next juncture is making the data useful in a query engine, and that's getting metadata on it. It’s understanding, like, what's the format, what's the actual schema represented in that. It's going to be how's it actually laid out on disk. It's like all that metadata that makes it consumable by a query engine. And getting that available in a Meta store, in some kind of a catalog could be glue, could be Hive Meta store. You've got to actually get that available so that a query engine could come up and say, like, “Okay. I can look at the definition of that table. And now I know what I'm going to find on the desk when I go there.”

Again, this is stuff where, bridging there, having built-in catalog capabilities and having schema discovery type capabilities. This is stuff we're building into the platform, two easy steps. These are DIY steps that we’re trying to like sand off the work of and have them be more built-in. From there, you're golden, because now you can write a SQL query, and there's a ton that you can do there. You can use SQL to like, if you're just exploring the data for the first time, maybe you're done at that point, and you're actually just posing some queries. It's just you, as an analyst, and you're trying to figure out what data is there. Maybe you do figure out. There's a data there, and you're like, “Oh, I've got to make these queries fast. They've got to show up and dashboards. They've got to be made available to interactive users.” From there, you might want to transform the structure, the format. Maybe you want to turn it into an Iceberg.

This is the kind of stuff where there's a lot of capability in SQL itself, like in an engine like Trino, to do that work, right in the engine. Once you've got the data, once you can get that connection into the data from the engine, the engine is amazingly capable at doing like ETL workload transformation workload. That's the kind of stuff that you can do right in the environment. So, in a way, the game with the lake is about getting a toehold into your operational query engine environment, and then from there, you're really golden. You can pretty much do anything very efficiently and at scale.

[0:19:49] SF: Yes. I remember when I talked to Justin last year, one of the things he mentioned that he was, I think, like somewhat surprised that people ended up using Trino for and the Starburst enterprise version was to actually perform these transformations from essentially like data format to data format. 

[0:20:03] AF: Yes. In a way, I think it is actually a sort of a delightful surprise. In another way, it's kind of if you look at the evolution of ETL, which started out like way more of these specialized platforms, the Informaticas of the world. It really evolved into like ELT, like, get your data in and then use SQL. Why do you need more languages and representations to actually work with data? Once you've got a toehold on it, use the environment.

So, surprise, surprise. Once the SQL became native to the data lake, this concept transported, because it was sort of in everybody's mental space of that's how you work with that. So, I think it's played out in a way that's great for the user experience, and also kind of make smart reuse of all the technology investments that are being made.

[0:20:48] SF: Yes. And it’s less expensive than moving your data around.

[0:20:51] AF: Exactly. Exactly.

[0:20:53] SF: So, once I have things flowing into object storage, where we figured out how to define the metadata, so that we have some way of essentially making it queryable, how do you actually keep up with the performance demands? As this thing scales, how do we actually make it performant?

[0:21:10] AF: Yes. That is one of the key investments, I think, in terms of lake maintenance. If you're successful, you're going to drive consumption of the data. If you drive consumption of the data, you're going to hit demands on performance. Also, you're going to hit demands. What I've experienced is, you hit demands on consumability, just like people understanding what is this, and what are the definitions that are – it's non-obvious stuff.

You might be looking at like your Salesforce data, and you've got an ARR column, and you're like, “Oh, is that like – what if that's a multi-year term?” You start getting these little – they're small, but meaningful definitions in terms of getting the right answer out of a query you make. So, there's that. Performance, there's a whole set of activities that you want to do. One is, I mean, the probably the most basic one is getting it into a higher-performance file format. You might land your data in whatever, getting it into Parquet, or ideally, one of the modern table formats, like Iceberg that builds on that. It's like, that'll get you a certain amount of speed.

Then there's like the traditional like database design, do you have appropriate partitioning, ordering? Those considerations based on the kinds of query workload that you anticipate, the kinds of stuff that enable. Things like a smart query engine, like Trino, is really powerful in terms of not just the ability to process the data in parallel, but also push down predicate, so the more leverage you can get in the underlying store, and how it's laid out, to handle that in relation to your query workload, the better kinds of query latencies you're going to see for the common queries that you're doing.

Then, there's other stuff. I mean, again, this is like, because it's a never-ending, there's those basics in design considerations, we, at Starburst just in terms of making this easier, and having more tricks up your sleeve in terms of delivering performance, we've got capabilities we're building into the Starburst platform around Trino. For example, one of the things that you can do to make object storage better is like smart caching. If you want to have faster access, and you're caching that. We've got a caching and indexing layer called Warp Speed that's built into the platform. So, caching for access to be faster. Indexing so that like there's more interesting predicate push down that can just retrieve the relevant data, when that kind of query comes through. You've got a where clause, it's very selective. You're just looking at like the last day of data out of this mountain that's got years of data.

That's great. That should be leveraged to turn that query around really, really fast. But you've got to have it laid out and indexed. So, having something like Warp Speed is a way that we put in the box, more smarts and easy buttons so that you can get that kind of performance out of workload with those characteristics.

[0:23:40] SF: Yes. I think, also, for some of these types of things, the performance with demands are different than – it's not like you're necessarily looking, you need like real-time performance, or less than 10-millisecond query operations for some of these things. If you're basically unlocking access to data that you never had access to before, then it could be okay, that the performance even takes multiple minutes, essentially, to do the computation if it gives you insights that you never had before as a business.

[0:24:10] AF: Yes, certainly. I mean, I will say though, it's maybe not for the most demanding real time, but it's amazing. The Trino engine is blisteringly fast. So, I use it for years before joining Starburst and as an interactive user, just trying to get my data, and it's really amazingly conducive to data exploration, getting quick answers to questions. I remember a lot of times being in meetings and a question coming up, and just being able to quickly like type some SQL, answer a business question in real time. So, it's great, especially I think for that kind of human data access, it's amazingly fast. Again, it depends. It depends on how smart you've set up your data, and like it depends on all those variables. But in terms of what you can deliver on it, it is actually a pretty amazing technology.

[0:24:49] SF: You mentioned Iceberg tables a couple times, and I feel like it's something that is becoming more and more popular, and it feels like the weight of the industry is kind of leaning towards Iceberg tables and moving away from other types of formats. What is it that Iceberg tables are doing that has made them popular? And how is it different than essentially some of the formats that had been used in the past?

[0:25:11] AF: It's really beautiful and that it layers on all the work of formats. It layers on things like Parquet. So, if you were to actually – if you look at an Iceberg, actually, it's a whole directory of files underneath there in the Iceberg table, and there are different segments and updates and parts of that table structure. If you think back into one of the underlying leaf files, it's going to be a Parquet or similar. And what it's fundamentally done is on top of the actual raw data, it's maintaining metadata. It's got a manifest that's tracking versions and updates, and it’s laid out in such a way that you can atomically land updates. It's laid out in a way such that, if you haven't pruned out a version, you can time-travel back into a version. Because fundamentally, it's just about what files constitute underlying files in the directory constitute that version. You can then get that snapshot version, which is actually pretty amazing.

Honestly, one of the perennial problems in analytics is dueling numbers. It's like you get different numbers. Being able to say like, I want to see the data, as of this moment, at a specific time in the past. It's really, really cool property to have out-of-the-box, out-of-the-table format. And just by intelligently managing the set of underlying parts of the file, having a manifest on top of them that keeps track, does all the bookkeeping appropriately, such that you can keep all those versions straight, in addition to putting in new versions.

So, it’s like, what it does and what it creates, honestly. This is, I feel like, the journey of – this goes back to the earlier discussion we had about like the evolution from Hadoop. I feel like the data lake space, it's been on a journey. It's gotten more and more database-like along the way. It starts out like with Hadoop, looking nothing like a database. It's a bunch of giant data sitting in files with like general-purpose computation and MapReduce on top of it. It ends up with SQL. Now, like with Iceberg, connecting Iceberg, Iceberg’s got really great integration into Trino, and you start to be able to do all these inserts and updates, and it becomes much more just like a SQL database, but you're still operating in the underlying architecture, where that compute layer, and the storage layer, or independent, and you have all the great properties and flexibility that comes from that.

It's wild. I'll be honest, in my past lake that I was most recently, in the organization where we were using a lake. We weren't on Iceberg. I wish I had known about it, because it cleans up so many of these basic problems and it’s adds so much richness without cost. It's almost like magic. It's this layer of intelligence that isn't really adding that much cost to anything.

[0:27:39] SF: In terms of, you mentioned, the lake, essentially. We went from MapReduce, and now essentially, everything's sort of queryable through SQL. I think it's similar to like the – if you look at something like MongoDB, in the document store, like I'm originally they were called no-SQL source, because they actually didn't support SQL. But then they added SQL, because they found out, “Okay, well, people actually want SQL.” So, it feels like everything that kind of starts out as something different, eventually converges to being like SQL as sort of the way to interface for actually interacting with it. So, do you think that the lake in the warehouse is converging, like, essentially, these like concepts are coming together in some fashion?

[0:28:15] AF: I do think that. I mean, it'll be interesting to see, a warehouse is a specific, highly integrated storage and compute, and there's stuff you can do, I'm not going to lie. The value of that stuff, and what portion of your workload really benefits from that maybe that like last bits of performance, I think that remains to be seen. I really do believe like the project that where, if you look at the overall set of things we're doing to add more opinionation and completeness to the lake platform, I think for many use cases, it does start to cut into what you would’ve wanted to do in a warehouse. Data warehouses, it's one of these long-term, highly incumbent and entrenched technologies.

I think for many organizations, especially for if you've got anything like a Greenfield or Replatform, today, you're making a tradeoff choice to go lake versus warehouse. I think depending on the space of your requirements, that's going to become less and less true. And you're going to see many, many cases where a normal organization that today, it would be too exotic for them to go with a data lake architecture. They're like, “I'm going to take out of the box. It's structured warehouse.” I think you can see more and more cases like that, and it'll start with the with the more interesting like organizations that have more complexity to the data. But I think, it'll reach down into organizations with more standard data.

I totally agree, I think there is a bit of a sorting out there. And again, it goes back to, for me, the benefits of the lake are flexibility. It really opens up what you can do and how quickly you can land new data in there. You look at many organizations do, their answer to the lake is instead to do an export into a spreadsheet, and then they take the new data that they're trying to understand and put it in a spreadsheet. It's scale-limited. You end up with dueling spreadsheets that have different representations of different data. It's just so messy. There's such a better way to do that version of data experimentation that people intrinsically they want to do, which is to create a lake, make it super easy to experiment with new data, make it easy to then promote that data, that's the winners into the main lake, and make it available to more and more users.

I do think that there's a very natural evolution there. And it is funny, SQL, you mentioned SQL. SQL is one of these technologies that's clearly like, there's a bit of manifest destiny to it. There's something about it. Its learnability, its expressiveness, simplicity. I went through those no-SQL years and thought it had a lot of legs, and then SQL is the terminator of likelihood, it seems.

[0:30:41] SF: Yes. Every time you think SQL is dead, it comes back with a stronger than ever, essentially.

[0:30:46] AF: Exactly.

[0:30:47] SF: So, we talked about how the Hadoop era, MapReduce was kind of like – we had that essentially available to us, because we didn't know a better way of solving this problem. The warehouse, in a lot of ways is not like a new technology. We transformed it to the cloud. But it's not that different than what existed in the on-prem world and so forth. So, do you think maybe the future is essentially – or I guess, the point I'm trying to make is with the warehouse, we had to use that technology, because we didn't know how to do this with, like the lake architecture. Now, we're to a point where we can actually apply the lake architecture, but also make it performant.

Is that essentially the future that we're sort of moving towards? Because companies are going to have more and more complex forms of data, and there's a lot of work to, essentially, take data from one place, export it as one format, transform it into something so that we can stuff it in an expensive warehouse, versus just putting that raw data wholesale in the lake, and then using the other technologies that we have available to us to essentially make it suddenly queryable and actually drive insights from.

[0:31:52] AF: I absolutely agree that that's kind of the direction. The thing I’ll say, maybe one of the more interesting aspects about Trino, and what it's capable of, is that, I think what you're describing is going to happen. But if that was something that had to be like a forklift transformation, if you had to suddenly be like, “I'm going to go lake now.” That means I have to offload everything from my warehouse. I think that'd be a huge barrier to entry. But like with federation capabilities, you can leave – if you've got a giant cloud data warehouse, you can leave it all in place. Trino is incredibly capable at querying that, pushing, for instance, predicates down so that you only retrieve what's necessary. Taking those results back, and paralyzing, combining them with stuff from the lake. It gives you this idea of data federation, or data mesh is sometimes used to describe this concept, a practical evolutionary route to the lake. The lake is where all the benefits, I think, really are in terms of flexibility, agility, and cost. 

But you can't say like, you got to go there in a big bang. But if I can offer you an incremental path that like – and by the way, if you've got relatively slow changing, well-managed stuff in databases, or warehouses. You may never, or it may take you a long time before you ever bother to move that, because who cares, and it works perfectly fine. So, I honestly think that a lot of what I think about in the space is very – has a gravity towards the data lake. But like being practical about the reality of data platforms is they're very heterogeneous, and there's a lot of existing investment that's been sunk into them. I think that a platform that's going to drive the journey to the lake has to be something like we're looking at where it offers a federation capability that gives you an incremental journey along the way and benefit from the investments you've already made. Just being practical.

[0:33:36] SF: In terms of Trino, what's going on, I guess, under the hood to actually make it scalable and performant, working with these kinds of technologies?

[0:33:44] AF: We've got like in Starburst, including the CTO group who created the technology, we've got, like many of the maintainers are here. Not all this, like actually maintainers, and contributors in many other big companies and stuff. But we definitely have an amazing crew here. So, I won't claim to be the expert after my first four months in the organization by any stretch of the imagination. But like some of the things that are very clear about the technology A, it's very, very good at parallelizing the workload. It's really good at chopping the work into granular tasks, scheduling them across the cluster in a really efficient way, and you can operate this at massive cluster scale. You can throw tens or hundreds of nodes at the giant data problem.

This just came out of Facebook. So, it really was built for cutting the work. This is why it's strangely counterintuitive thing that I had to come to understand about the data lake. You're like, “Oh, how could I possibly be efficient?” You're storing this data in these objects, inefficient object storage, but it just cuts it up. If you chop up the work, the access, and the processing into granular enough bits, you can do amazing things in terms of turning around queries very quickly.

The other thing is the architecture. They really developed a beautiful architecture for extensibility. They have these extension points called the service provider interfaces, SPIs, that's where you plug in things like data sources, underlying databases, object storage systems. Or even like, you can have no-SQL, like Elastic Search will plug into it, or Mongo will plug into it. You've got these connectors to systems. What they do, that I've already kind of alluded to it, is A, well, they're doing all the semantics matching. You have to actually come up with one semantics which involves how do you handle the types and the semantics of the underlying databases. It does that. But it's also smart. It's like, in that SPI, it has knowledge of what it can push down in terms of predicate filters, or aggregations that can be handled by the underlying source, so that you're pulling back more distilled data. It's like between parallelizing and processing in place as much as humanly possible, you really get amazing performance and scale out of the thing.

[0:35:48] SF: So, normally, in like a traditional database, you kind of have like these two systems. You have the piece that's in charge of the query engine, and then you have the actual storage. Then, with Trino, essentially, it's like a database that only does the query engine piece. Then, the storage is kind of like separate and handled by these different separate technologies. So, how does something like a join across these different, basically, that could be like different separate data storage, like work across one single query interface or query engine?

[0:36:19] AF: It’s a great question, and absolutely does that. So, first of all, it's got a smart cost-based query planner. If there's work that can be pushed down below the join, especially filtering, right? If I can pull back less, there's less stuff to join, that's great. So, it's got a smart query planner that will do that kind of optimization. But then, it's able to pull back the data into the worker nodes in the cluster. It runs on a cluster architecture. And it'll actually pull that data back and pulling data from other sources.

So, you might have like, in one query, you're pulling data from multiple different kinds of sources, chopping those into small bites. Obviously, there's data shuffling. It'll do smart things around join selection if you're joining a very small table against the big table. You might broadcast that. You do a broadcast join. It's going to do smart algorithm selection for the join. But at the end of the day, it's going to shuffle the data, such that you're doing joins in small nibbles with the stuff that matches up in the way that it needs to.

It's just very clever. It's a combination of really nice architectural design with some of the core – these are long-standing distributed database concepts of how you do a parallel join, combined with the fact that you've abstracted those underlying sources. From the engine’s perspective, since it’s doing join, it's almost like it doesn't matter where the data is coming from. It's abstracted away, which is beautiful.

[0:37:32] SF: What about something like data governance, or controlling access to these different data points? How do you essentially do that consistently across – essentially, it could be like a bunch of different types of technologies that are underneath?

[0:37:46] AF: You can have like access control at different levels. Even like hitting like a lake, if I'm hooking up to data in say, my S3. I'm going to create an IAM role that will let the system get into that, with whatever minimal privileges I want to provide. So, there's that like underlying system access that you can use to kind of lock down what's available for access. But then on top of that, it is really very much like a database with a full access management system, where you can kind of go in. It's got roles that you can define in a whole permission system.

Like in terms of just defining this, it answers – this is again, in the space of opinionation, providing an out-of-the-box access control system, that sort of opinionated and database light. It's one of these things that in my past building this, this is something you would have DIY’d. You might have used an open-source access management thing like Ranger. Again, you're in the business of assembling the platform. With Starburst, we're trying to like put more of that right out of the box so that you don't have to invent it. It's something that's built-in.

[0:38:43] SF: Then you mentioned Warp Speed earlier, which is the caching layer. How does something like invalidating the cache work? One of the classically hard caching problems.

[0:38:54] AF: Simple answer is, the Warp Speed technology lives right in the Trino cluster. It's kind of built into the Trino processing. It's inserting itself between the actual query evaluation and the data access, and it's like looking at the traffic caching stuff, indexing stuff, based on the workload. And based on analyzing the workload, there's simple stuff, just like what's been recently used. But it is something, in terms of continuing to evolve that technology, there is smarter and smarter workload analysis that you can do. Looking back further, looking at the trends, what has value, so that you can have a smarter and smarter cache replacement policy.

Today, it's simple but effective. But I think, this is an area that we're continuing to invest in and make it smarter so that you make best use of your limited storage on the cluster.

[0:39:42] SF: What are some of the other types of investments that you're making at Starburst?

[0:39:46] AF: Yes. There's some really, really cool stuff. We've been on a journey. To me, this is one of the more game changing things in terms of building forward in terms of what Trino can do. Trino introduced an abstraction layer where you can do persistence of intermediate results in query. So, this is a new mode of execution for the Trino cluster called Fault Tolerant Execution. Tirno is fundamentally an in-memory engine. But like what happens if you lose a node mid-query? Historically, the answer would have been, you have to start over, which is a bummer. But if you hang on to intermediate results, buffer them, you can start off midflight.

So, we're working on better efficiency for analyzing queries that way, and making it more and more competitive with a purely in-memory execution mode. Offering more resiliency and better ability to handle like longer running, like ETL oriented workload that might run for a long time. We mentioned we talked about Icehouse a little bit. That's when we're doubling down. So, I would expect us to continue to track that, and expose the capabilities of Iceberg files, but also build in more out-of-the-box ways to as hydrate and populate your data in the lake.

I mentioned, the first one of those was sort of a streaming ingest that we've gotten sort of privy right now to pull data off Kafka and land it in Iceberg files. I expect us to keep doing things in that neighborhood. So, as I mentioned, at the very top, it's like, there's a lot going on, it's like jumping into a moving freight train. There's a lot of areas of investment, and we're trying to focus them. I think the vision, if it's come through along the way is, really strive for completeness, and make that data lake concept accessible to more mainstream customers, who may not have the wherewithal to construct and imagine their own data platform. So, I’ve omitted about 1000 things that we're doing with those little ones that kind of stick out to me, because they're part of that journey.

[0:41:28] SF: Yes. I mean, the data space isn't getting smaller. It's getting bigger and more complicated. So, you mentioned streaming there in Kafka. What kind of like new challenges does, integrating with like streaming service add to the complexity essentially, managing a lake?

[0:41:43] AF: Well, I think, actually, it's not really from the lake maintenance, but the problem in and of itself is challenging, just to keep up. You’ve got to actually be like catching streams. Streams can come in very high-volume varieties. So, you've got to keep up. It was one of these features that on the surface, sounds simple, but ended up being a pretty substantial investment in the actual extra process architecture to consume the stream and efficiently turn it into an Iceberg.

Also, if you really are flooding data into Iceberg, it's like, you got to be smart about how you're applying those updates. Keep the Iceberg consumable and efficient to process. So, it brought new architecture challenges. But at the end of the day, I think it's so valuable in terms of an example of having more ingest capabilities, because streaming data is one of the motivating use cases for the lake. It's something where you're like, “Okay, it's just going to be a pain and so expensive to keep up with that in a warehouse.” So, of course, we landed in object storage. How much nicer to be able to then query it readily than to have the separate ETL step? And all the hard decisions about what do I make queryable, and how do I manage that with reasonable cost? You can have it be part of the platform with good cost, and all queryable, depending on what you need to do.

[0:43:04] SF: What do you think big problems or challenges in the data lake space are yet to be solved?

[0:43:10] AF: That's a great question. I do think that there's plenty of MIDI problems. I mean, I mentioned, really being able to execute longer running queries, without it, being a big performance sacrifice. I honestly think like, it just had such a deep technical challenge, so much as it is like, it's like a product challenges, fitting the right shape of platform completeness to unlock the lake for those more common customers. The customers who aren't going to be doing the most complex stuff. I think there's a never-ending set of performance challenges. There's absolutely keeping up with the space of the lake formats like Iceberg. There's a lot of sort of meaty things – in terms of technical challenges. I don't know, there's probably big concept ones out there that I'm missing, but it's certainly plenty to keep us busy, in the meantime.

[0:44:02] SF: Yes. It feels like one of the most relevant challenges at the moment is putting the right technology together, can make the lake actually usable, without – basically lower the barrier to entry. 

[0:44:13] AF: Yes. I think, from a product perspective, as a vendor trying to build a product, it's interesting, balancing part of the brilliance of the lake is that it's been built in a very open environment, and there is composability, and there's room for lots of different technologies that add value. And we don't want to lose that composability and customizability. Yet, at the same time, I'd love it to be the case to get going. There's a box you can open and it's got the main pieces, or at least basic versions of them, and you can very rapidly start building on a data lake and getting going. So, that's a bit of a balancing act.

At the end of day, I mean, I do think in this space, there's a whole bunch of different companies and technologies that are partnering together and solving the big problems and you never want to lose that connectivity. Even as I was joining, one of the things I observed and I was really proud to be joining the company was I started looking around at all the relevant tools, all the BI tools, your Thoughtbots and Tableaus and stuff, and all the tools for creating, managing your data like a DBT. Just looking all the ETL products. And everywhere I went, I saw like Starburst was at the top of the list of integrations. Up there with the big guys. I was like, “All right, this is great. We've had a really smart partner organization that's out there, working with the other technologies.” Because I think it's so essential. It's not going to be one company show. But yet, I do hope we can build something that stands on its own and really lets you get going without a lot of complexity.

[0:45:39] SF: I mean, I think having those partnerships and those simple integrations is what helps lower the barrier to entry, because if you have to go and do essentially, like a bunch of DIY work to pull this thing together, and plug this thing over here, and so forth, it just adds more and more complexity, and also probably more risk to making a mistake.

[0:45:59] AF: Yes, 100%.

[0:45:59] SF: Well, Adam, thank you so much for being here. That was a really fun conversation. I think we covered a lot.

[0:46:04] AF: Great questions and was fun to get to talk Starburst and Trino for a bit.

[0:46:07] SF: Cheers.

[END]