EPISODE 1681

[INTRODUCTION]

[0:00:00] ANNOUNCER: Sammy Steele is a Senior Staff Engineer at Figma and the Tech Lead for their Databases Team. She previously worked at Dropbox, where she built out their petabyte scale metadata storage and search systems. Sammy recently published a blog called How Figma's Databases Teams Live to Tell the Scale. The blog went viral and made it to the top of Hacker News. We invited Sammy on the podcast to learn more, and she is our guest today.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:00:46] SF: Sammy, welcome to the show.

[0:00:48] SS: Thanks so much for having me, Sean. I'm really excited to be here.

[0:00:51] SF: Yeah, I'm excited to chat. Let's start with some basics. Who are you? What do you do?

[0:00:55] SS: Hi, I'm Sammy Steele and I'm a Senior Staff Engineer at Figma, where I'm the lead for our databases team. In case people don't know what Figma is, it's really a collaborative suite of software that lets people go the whole way from having an idea to designing that idea and turning it into a fully realized product. We're probably most well-known for Figma Design, which is where folks can build apps and websites and build all of the UI for that inside. As an engineer, I really use FigJam, which is our whiteboarding collaborative brainstorming tool all the time. I'm not a good Figma power user, but I love FigJam. That's what my company is.

As far as my team, our secret sauce here at Figma is that we have this really collaborative real-time experience in the browser, which is where people can comment on each other's docs and have interactions with each other to build better things together. To power all that, you have to store a lot of state in databases, things like, what users exist, or what files exist. They have to be really fast, they have to be reliable, they have to be scalable. That's really my team's job, to make sure that our whole database platform works seamlessly, and then the whole rest of the company can focus on building really cool products.

[0:02:19] SF: Yeah, I think it's a great overview. I've been a Figma user for a number of years now. I think one of the things that really blew me away about it is there's certain applications that I think through my 20-plus years of being in tech, I couldn't have imagined transitioning from desktop applications to being a web-based experience. I think, things like Photoshop, for example, is only like, I started playing around with Photoshop as a kid in school in the 90s. I could not imagine that ever being essentially, something that you could do in the browser. I think that's really what blew me away the first time I used Figma.

Then you tie that into the interactive stuff that you can do, the essentially, collaborative elements, the comments that are ways that we would typically think about modern systems, like Google Docs, or even collaborating on an open-source project and bring that to design. It was just mind-bowing for me and for some experience then.

[0:03:17] SS: Yeah. I'm not a designer, but one of the things I really love about working at Figma is being at a company where so many people are really excited about our product, and it's just really cool as a backend person to build something that I'll be talking to people on the street, and I mention I'm working at Figma, and they're like, “Oh, my God. I love Figma.” That's definitely, I feel very lucky there.

[0:03:37] SF: Well, so how did you end up there and what made you want to join in the first place?

[0:03:41] SS: Yeah. A little bit of my story. Before Figma, I had spent my whole career at Dropbox. Both of my college internships were at Dropbox. I was at Dropbox for five years as a full-timer. I spent all of that time on various backend in for teams, but I think I was really lucky to be there at a time when the company was growing really fast and reaching petabyte, exabyte scale, but the engineering team was quite small. I was part of a five-person team that built a petabyte scale search system from scratch. Because back in 2012-2015, elastic search and open search didn't really scale, and so we had to build it ourselves.

Then about three years before I joined Figma, I moved over to databases. That was really the first time I fell in love with databases, because at Dropbox, we had built our own database system in-house. We actually had horizontally sharded on top of MySQL, but Dropbox was getting to the point where that sharded MySQL wasn't scaling to our size. We had a single customer so big that it was filling up a database. We ended up building our own key value store in-house, and having to figure out how to migrate a petabyte of data, and that was about a three-year project with a just a phenomenal team. Which actually, that story has never really been told, because all of us at the end of three years were so ready to do something different that a bunch of us left Dropbox.

I was the first one to leave from my team, because I had known for a while, I wanted to try something different, and I wanted to go to a smaller company that was just growing, where I could feel I was having more of a say at the organizational level of being able to change stuff. I really wanted to go somewhere where I knew I could build a really strong career. It was 2022, the height of the tech bubble. I probably had conversations with 50 startups and companies when I was trying to figure out where to go.

I come from a VC family, and I grew up in SF. I tend to naturally be a little bit skeptical of just, I've seen so many great people with good ideas, whose companies don't work out. I basically narrowed it down to three companies that I felt had really good engineering culture, and really good interesting problems, and actually, had a real good business fundamentals and revenue. One of those three was Figma. I interviewed at all of them, and I just felt Figma was the strongest of all of them. I'm so happy I came here, because it's ended up being everything I wanted, and a lot of things I wasn't even knowing to look for.

[0:06:12] SF: That's a very DC-like framework that you're using for evaluating companies. Maybe there's some influence there from your family members. I also made a transition in 2022. I think, probably a lot of folks did, and it's true. I probably talked to 50 different companies during that process as well. It is a lot of finding this right combination of things that is a good fit for you and a good fit for your ambitions and what you want to do.

When you started at Figma, so I want to talk a little bit about the database systems there. What was there as you stepped into their backend systems? What was the database setup at that time?

[0:06:50] SS: Yeah. Well, maybe first, I'll give a little bit of color about where Figma was as a company, because I think so much has changed over the last few years. Really, the database story starts a few years before I got there. But when I joined, infrastructure was 30 people. Since then, we've at least tripled, where I think we're a little over 100 folks now. That growth has been really great, I mean, especially when so many people are laying off, I feel really lucky that we're still growing, but it has definitely changed a lot. I think I got to have a lot of say in getting to do a lot of cross-functional, cross-organizational things.

A fun story from early in my days at Figma was that last year, we were having our annual user conference, which is called Config. Config happens every June. This year, it's June 26th and 27th. It is by far the biggest day of the year for us. All of our product launches are planned around that. Last year’s Config was so much bigger than anything we'd ever done before. It was seven times bigger than our previous Config event. It was really exciting from a product perspective, and it was live at Moscone Center in SF with 200,000 people joining in remote. But from an infra perspective it was terrifying, because suddenly, we're launching this big conference, we're going to be doing live demos of the features, and we want to GA all of these features from 0% to 100% in production during a keynote speech that our CEO is going to give.

I was like, “Oh, my God.” Before, we didn't plan around Config at all. We just did things and it worked, but at this scale, we can't do that anymore. We need reliability plans. We need backup plans. We need to work. We ended up doing a lot of planning, so that every product team had an infra partner, and we can make sure things were reliable, and we all sat in a war room, and thankfully, it all went off without a hitch. I think that's, there's just – with all this exponential growth, even outside of databases, you end up in a lot of cool problems.

[0:08:50] SF: Yeah. Obviously, Figma has gone through explosive growth over the last couple of years. Going back to when you started there, what were some of the problems from an infrastructure and database standpoint that you were seeing from a [inaudible 0:09:05]?

[0:09:05] SS: Totally.

[0:09:06] SF: That was going to stop, essentially, the scale and growth that you want to do too?

[0:09:09] SS: Yeah. Okay, so really the problem started in 2020, about two years before I joined the team. At that point, infra was something like 10 people, and there were no teams. There was just infrastructure. They realized that their database was on the largest database that Amazon offered, and that it was rapidly running out of space. Essentially, I think, people have trouble realizing what that means. Sometimes if your database fills up and falls over, and you can't just pick a bigger machine, your company is going to go down, the website will go hard down every day at peak traffic for five or six hours until you fix the thing. And the fix, if it's gotten as bad is probably a month or two. Your company probably won't exist if you get to that point.

[0:09:51] SF: Yeah. Was this AWS from RDS on running on –

[0:09:55] JH: 24 XL.

[0:09:56] SF: - X3 instance, or something like that?

[0:09:57] SS: Yeah, so we had managed to scale for 10 years on a single RDS PostgreSQL instance, which was a 24 XL. Yeah, I think it's cool for all the startups out there, like, you can get a very long way and be very successful on pretty simple architectures. I think that was a cool lesson from Figma. But clearly, we had outgrown it at that point. What was a really small team of four or five people, most of whom had never worked in databases before, basically had to figure out how to go from having almost negative headroom on the database to adding replicas, adding caching, and moving some really expensive workloads and re-architecting them just to buy some headroom.

I think it has been a four-year journey to get from having zero headroom, to having what is essentially, infinite headroom for us now. But along the way, there have been a lot of scary challenges, because I think people don't understand a lot of times, I'm a math major and I still sometimes have trouble wrapping my head around what exponential growth actually means. Because if your database is close to falling over and you come up with an idea that's going to reduce CPU by 30% for two months of work, that sounds really good, right? I would seem like I would take that trade off. But the math comes out to, if you're growing two or three X every year, those two months of work actually only buy you a month of runway. You actually lost a month of runway on that project. The only way you can actually keep up with exponential growth is to come up with solutions that let you scale out exponentially as well, and that's where horizontal sharding came in.

[0:11:29] SF: Yeah. These band-aid solutions that maybe were a first course of action, like, we’ll add read replicas, separate rates and reads, we’ll add caching, so then maybe we don't have to have to hit the database quite as hard all the time. They're only going to get you so far, especially when you're talking about this exponential growth, because at some point, you need basically, a first principles approach to solving the problem, that is probably fundamentally different than whatever you're doing today, because you just can't get those exponential gains otherwise.

[0:11:58] SS: Exactly. I think a lot of what the early team did was brutally prioritizing so that we could get the headroom to make the bigger foundational investments. Because horizontal sharding, it still took our team nine months to shard the first table. If we had three months of headroom, or six months of headroom, that's a non-starter. You can't even do that. I think it was partly – it was a four-year journey, because we kept having to buy. Now we have six months, now we have nine months of runway, now we have a year of runway. Also, that's constantly changing, because it's like, when you're launching a product, there's a clear date. You're like, “April 15th, this product must be done.”

But when your database is trying to scale, there is no one telling you, in exactly 3.5 months my database will fall over. It's like, you're growing at a rate, but someone launches a product and somehow you're growing twice as fast now, and then maybe you don't grow for a month, but the growth spikes up. You're racing against this clock, but you don't even actually know where the clock is.

[0:12:52] SF: Yeah. I definitely want to get into the details of the horizontal sharding. But before we get there, as you started to embark on this journey of trying to fundamentally solve this problem at infinite scale, how did you balance, essentially, keeping things going? Presumably, you're still growing exponentially during this point, where you're investing nine months into trying to solve the actual problem, but you need to keep the business running the database operational, while also investing a lot of time into solving the fundamental issue. How did you balance the requirements of things are on fire right now, but we also need to solve this long term?

[0:13:31] SS: Yeah, totally. I mean, I think part of it was that luckily, by the time we got to actually doing horizontal sharding, we had about six months of runway, and we had some ideas for how to extend that, like some things we could push off to other product teams. I'd say, that is actually my first advice is that the first things we looked for is how do we get other teams to help us out here, because we were never going to ship this if it was just the databases team, which when I joined was still only seven people. We couldn't have done this alone. A whole lot of other teams chipped in and even lent us resources. We had two people join our team to help out temporarily.

I think, you have to make a lot of hard tradeoffs, but anything you can push to other teams when you're dealing with an existential threat to the business, do that. Then really, it's a risk trade off at the end of the day. We did defer reliability risks that we knew at some point we'd have to invest in, but we felt scale was a bigger risk and probably, we could wait a year to actually build those things. Now, I'm really happy we're getting to make a lot of reliability investments that we knew we wanted to make, but they just felt a lot less urgent than the burning fire of scaling.

[0:14:36] SF: Right. Before you essentially arrived on the idea doing horizontal sharding, what were some of the initial requirements that you thought through? What was, ultimately, the goal state that you wanted to reach as you started to investigate a solution to this problem?

[0:14:53] SS: Yeah. Obviously, we're not reinventing the wheel here. A lot of companies have horizontally sharded in some way or another before. I think what's fun about working in databases is that every company brings its own set of unique challenges and requirements there. Some of ours were, I think our top overall goal was minimize developer impact, because at the same time, Figma's product is taking off, and so we want to keep people excited about it, which means we want to keep building cool features people want to use. If we had stopped the whole company for two years to scale, we probably also wouldn't have been a successful business. That was very top of mind.

As an infra person, we always wanted to really maintain reliability and consistency, because those are things that our core product is built on top of. We never wanted to get in a one-way migration. You never want to end up on a new system where you have no way to get back to the old system. Because inevitably, there will be unknown unknowns. Unexpected things will happen. When that happens, you want to have a way to get back to the old system safely, instead of just having your website down for hours until you fix it.

I think those were our three of our big core things. Minimize developer impact, avoid one-way migrations, and maintain reliability and consistency. Yeah, the other was probably play to our strengths. We knew things we were good at and weren't good at, and we really tried to take a pretty tactical approach for what we knew we could ship quickly and successfully.

[0:16:17] SF: Because you didn't want to do a one-way migration, was that one of the reasons that you decided to stick with PostgreSQL, given all the distributed database options that are out there, like Spanner, CockroachDB, or even a document store, like a Mongo, or a Dynamo?

[0:16:34] SS: Yeah. That was certainly a big part of it. I think universal database scalability is probably, to me, the most unsolved problem in all of the cloud offerings. If you have a monitoring solution, most people are going to use a Datadog. If you have caching, there's Redis, there's Memcached. Data analytics are mostly Snowflake these days. But I think the best way to start a flame war with a whole bunch of infra engineers is to ask them to pick the best database, because everyone is going to have a completely different answer.

I think, like you said, there's all of these options out there, which a lot of companies have really successfully made work at scale. I guarantee you, for everyone that has worked, there's probably also a company that tried the exact same database and didn't have it work for their use case. Actually, this is a fun story that didn't make it into our blog post. We spent about three months exploring one of the database technologies that you'd mentioned. I'm not going to tell you which one. There's nothing wrong with the database. It's a great database, but it didn't work with our very specific performance and reliability needs.

You have a couple months ahead room, and you've just burned three months on what was basically, throw away work, where you hit unknown unknowns. We were really, really risk averse after that. That's a big reason of why we decided to build incrementally in-house, versus trying yet another option.

[0:17:51] SF: Yeah. You don't want to burn another three months on something that you find out is not going to work for whatever your specific requirements are, versus at least you know that you're familiar with the technologies that you're using, so then, can you essentially make them scalable by applying tried and treated approach of horizontal sharding?

[0:18:08] SS: Totally. Plus, I think a lot of people really underestimate how many big challenges there are to adopting any new database technology. First, the migration, which you touched on. If you want a reversible migration, that is going to be really hard. Even if you don't, just migrating consistently with low downtime, pretty challenging. Then you have all the performance characteristics, the workload things that are very specific to a specific database. Then you have team expertise, which my team had built years building around RDS.

I think that when I was at Dropbox, we were performing a similar migration, because we were moving from sharded MySQL to this new key value store we had built. We spent, probably an extra year, because Dropbox cared so much about consistency, trying to migrate in a way where we could guarantee 100% consistency by building atomic cross-shard transactions across two different databases, which was pretty cool. But also, it took an extra year. We certainly didn't have time to do something like that here at Figma.

[0:19:05] SF: Yeah. Dropbox, because of the nature of the business that they're primarily in of file storage, it's going to yield different requirements of database level than probably what you're dealing with at Figma.

[0:19:16] SS: Totally. As far as team expertise goes, I think people assume that if you have an open source, or a managed product, it just works. But at scale, that is pretty much never true. I think most people who have worked with these technologies at a big company know that you actually end up having to become a bit of an expert, because you will hit weird edge cases that no one else has hit before. Some examples of this, my team personally has had to dig into the source code of PgBouncer, which is an open-source connection pooling framework on top of PostgreSQL. Even into the PostgreSQL source code to understand vacuuming problems that we were running into, where CPU was spiking for us at the very end of our vacuums on our largest tables every single time, CPU would hit 100% for about a minute. This is happening five times a week. We're trying to figure out why.

We ended up reading the vacuuming source code about 20 times before someone realized, there's a special kind of query planning that only happens at the end of vacuuming, and that's five times more expensive than normal query planning. Weird things like that, you actually have to learn if you're operating any of these systems at scale.

[0:20:23] SF: Yeah. I think that's why you see a lot of these, like a Spanner, or some of these other distributed systems, they come out of companies that are operating at massive scale, like for example, Google. Because they basically, they had to – nothing off the shelf was going to solve this problem for them, so they had to essentially deploy their own internal resources to solve it fundamentally for their use case. Then eventually, they rolled that out as something that other people could use.

In terms of some of the challenges that you encountered as you started to build out this horizontal sharding plan, what were some of those initial choke points that were – or technical challenges that you had to overcome?

[0:21:02] SS: Yeah. I think, there were definitely a lot of them. I'd say, two big ones were first, figuring out what was the bare minimum feature set we had to build to ship our first table. Because we had a very small team still. We had about six to nine months of headroom. We knew we needed to do something. We also were afraid of, if our first table takes a year, and then our next table takes another year, that's pretty risky for the company. We wanted to pick the simplest possible feature set. That meant picking a table that had a very simple data model and let us shed a lot of features that we could then build six or nine months down the road.

The table we ended up picking was our user file reasons table, which is where we store the data that is like, when a user clicks on a file, we store like, this is a recent file. Then people can go see what they've seen recently. It was a good candidate for a few reasons. It was a very simple data model. Sharding by user ID was really easy. Basically, every query included the user ID. Then it allowed us to use a UUID as the primary key, which was made our – avoided building a whole system we've had to since build to do general global unique ID generation.

Also, at the end of the day, it was much easier to do a table that was obviously important. We don't ever want to lose any user data. Of all the tables at Figma, if a person is missing one row in their recent files, probably, they're never going to notice. Obviously, we tried our very best and we were successful in not doing that, but it felt a lot less risky than picking the files table.

[0:22:40] SF: Yeah, absolutely. People are going to notice when their design disappears, but maybe not so much like the recent file.

[0:22:46] SS: Exactly. I think the other cool one was cross-shard transactions, because this is something we touched very lightly on in the blog. For a lot of companies with these new SQL options, they give you atomic cross-shard transactions by default. For folks who aren't familiar with cross-shard transactions, basically, when you have one database, you get all this atomicity for free. If you're doing a migration where you move a whole bunch of org data from one company to another inside Figma, you can do that all in a transaction and you're guaranteed all the data will just move from one company to another. Once you're moving across two different databases, you can imagine the rights could succeed on one and fail on the other, and you could actually lose data in the middle of that, which is absolutely terrifying. Figuring out what to do for that was one of the first really early interesting challenges we had to do.

[0:23:36] SF: You chose, essentially, this table of recent files. Then the user ID that was the shard key that you use to shard the tables, is that right?

[0:23:44] SS: Yeah, that's exactly right. Then the reason transactions came in is because you think these would be a table where not a lot of interesting business logic happens, but actually, this table gets updated in a number of business-critical other use cases tangentially. We went through a couple different approaches here. Initially, our thinking was that we would actually let application teams deal with this, because – so the two that we initially thought of were one, have product teams rewrite their code, so that it's robust to this, what we call a partial commit failure, where one database rights exceeds and another fails. That seemed like it would push the product problem off of the databases plate and let product team solve it themselves.

The other polar opposite would be to build atomic cross-shard transactions in house, which we could have done. Actually, collectively on our team, they're probably, we have implemented four or five different kinds of atomic cross-shard transactions, because a bunch of people from my team built different kinds of it at Dropbox. We could have taken that approach, but it felt like a lot of work. The product approach was our initial thinking.

Actually, what happened was we got a whole bunch of people in a room. It was me and two senior staff engineers who were some of the most senior product people at the company who have been there a long time and know the product inside and out, and another infrastructure really senior engineer. We were walking through this one example of one transaction and how it would – whether it would be safe in this horizontally sharded world. We talked for an hour and we couldn't definitively say whether it was safe or not. We were like, at that point, it was this aha moment of like, if the four of us can't figure out whether this is safe, there is no way we can expect every engineer at Figma to reason about this on a daily basis.

We ended up going with a much more tactical approach, where we basically classified all our transactions as low risk, medium risk and high risk. Where a low risk is like, no one's really going to notice if this breaks. It's like, a feature will look slightly glitchy for a second. A high risk would be like, it's a consistency problem. It's a security risk if this goes wrong. We really focused on the high risk ones and made sure those all had good fixes. For low and medium risk, we did the math and we realized, this is probably going to happen once or twice a year at our scale. We'll just monitor and we'll fix them as they come up. That's worked out well for us. I think we've seen one in the past 18 months.

[0:26:08] SF: How did you classify those different transactions into those buckets?

[0:26:11] SS: Yeah. We built this whole shadow framework as part of the horizontal sharding system, which has actually been really useful for product teams who want to analyze if they’re preparing a table for horizontal sharding, what kind of work are they going to need to do? The way this works is we have – our architecture just very at a high level is, we have an application layer, and then we have this database proxy layer, which is a go service that we actually built in-house. I'll talk more about that in a bit. Then we have our databases below that with their connection pooling layers.

Inside of this DBProxy service, we built out something called a shatter reads and writes, which would basically act as if a table was already horizontally sharded, even though it wasn't. We would shatter read traffic and write traffic, and we would log what would have happened if a table is sharded under a variety of different shard keys, like user ID, or file key. Then we could send all of that data off to Snowflake for offline analysis. We built out this really nice report and mode that it basically says, “Okay, you have five queries that are not compatible with horizontal sharding. You have 12 cross-shard transactions. You have three scatter gathers which are expensive, you need to audit these.” Once you fix all these things, your table is ready to be sharded.

[0:27:29] SF: Okay. Then in terms of the shadow setup, were you creating, essentially, views on top of the tables to mimic the idea of you're going to different databases as if it was a shard?

[0:27:40] SS: That part is also confusingly is another shadowing that we did. We did create views. I'll talk about that in a second. But this specifically was not actually ever executing the queries. It was only at the query planner layer. We have a planner inside of DBProxy. It would essentially, fork off a process that would async query plan under these different topologies and then send that data. They never actually are executed. It's just logging what would have happened.

[0:28:07] SF: Then you're running that, essentially, in production against real traffic to see what would be the consequences if we actually set up the sharding for this particular query.

[0:28:17] SS: Exactly. Later on when we were figuring out are the queries correct, we did what you were just describing, where we actually would shadow live traffic through views to act as if we were already sharded. That was once we'd picked a sharding key and done most of the work to already prepare a table for sharding. This application readiness thing came very early in the process. It's a lot of what helped us pick the right sharding key, because you could compare. You could say, if we use this sharding key, basically, the table is already shardable. But if we pick this other one, we're going to have to do a ton of work to make it ready to be sharded.

[0:28:48] SF: Yeah. That's a good way to have a pretty good sense for this project actually working once you – it's not like, you're just going to, “Hey, we spent all this time setting up the shards. Now, we're going to roll it to production.” Basically, just a way for you to test the whole thing end-to-end and have a really, really good sense for whether it's going to work or not.

[0:29:04] SS: Yeah. I think the other big advantage is we had to ask a lot of product teams to disrupt their roadmaps. We tried to keep it to a minimum, but there was at least in a one team, especially at Figma that owned a lot of our early mission critical tables, and they probably spent eight or nine months just partnering with us on this stuff with at least one to two engineers. For them, it was helpful to have an estimate of like, this is how much work it's going to be. Now, as we're going on asking a wider set of teams to do horizontal sharding on new tables, we can say to them, “We expect this is only a week of work for your team. It's really easy.” Or, “This is a deeper partnership. It's probably going to take a quarter for you guys to do this with us.”

[0:29:42] SF: Can you walk me through the process of what happens when I execute a query? I want to write a query and essentially, how does a query get processed? Then how does it, essentially, end up pulling data from one or multiple shards?

[0:29:55] SS: Yeah. It starts the application layer and Figma isn't microservices, but it's also not a pure monolith. We have, I'd say, about 10 different services that are talking to our database layer. A lot of these power that real-time collaborative stuff I was talking about. LiveGraph is a go service we have that powers our commenting systems and it lets you, basically, live tail changes and subscribe to changes at the UI layer, so that they get propagated in real-time up to you.

All of those clients, originally, when we were vertically partitioning, which was our first step before horizontal sharding, would have to now know about the databases and they would have to know to change and talk to a new database. That was a lot of coordination and pretty painful. As part of this process, we ended up building out an intermediate layer, which we called DBProxy. Basically, you have the application layer up top and all the 10 different services all talk to DBProxy through a GRPC client. Then DBProxy is a go service that does a whole lot of different things, but it can be thought of at least partly as a query engine, which will know how to parse a query and understand what is happening in that query and understand what tables and what shard keys are involved in the query, which then tells it where to route the data down to the different databases that lie below.

[0:31:19] SF: How does it know which part of the query is a shard key? Is it looking up and – do you have to keep, essentially, a map of what the shard keys are and which tables are sharded?

[0:31:28] SS: Yeah, exactly. That was an early challenge for us, because initially, we had this config file that before we horizontally sharded anything, that just mapped from table to the physical database shard where that table lived. It was causing us a lot of struggles, because it would get out of sync, or people would forget to update it. It was probably the number one source of toil on my team for a while. When we horizontally sharded, that was going to get a lot worse, because shard splits would happen all the time.

We decided to build this service called Topology, which is basically, a config serving service that understands the state of the database and both the physical topology of what databases exists and then the logical topology of what shards exist and what tables and where these tables belong. Because we came up with this idea of what we call a colo. A colo is short for a colocation, which is basically a set of tables that all share the same shard key. All the tables related to user data share user ID, or all the tables related to file data share file ID. That's all contained in the topology.

[0:32:38] SF: Then does each shard, is there essentially, read replicas as well as I think right there, so you're being able to also share the load from a read-write perspective?

[0:32:48] SS: Yeah. I think that's been a really important part of our scaling, because as we scale more and more, one thing we've really done is push people towards most reads come from replicas by default. That is huge from a liability perspective, because one of our primary goes down, we're still serving reads. Two, we have two read replicas for every primary. We had requests to both of those in every request. If one of those databases is down, we can actually still serve traffic.

Over the last year, actually, we had a big effort to rewrite all of our P zero most critical routes to only talk to replicas. You can't write from them and you can't talk to primaries. This was a pretty big effort from a whole bunch of heroic folks here at Figma. As a result, if a primary goes down today, we can continue to serve all the most critical traffic at Figma.

[0:33:41] SF: Then in terms of performance, is there any, because of this slightly more complex setup, or maybe significantly more complex setup where you have to do this query planning, you have to figure out which shard to go to, is there any performance hit, or also from a cost perspective, this is infrastructure cost, higher than using a simpler setup?

[0:34:01] SS: Yeah. That was definitely one of the early bottlenecks for DBProxy was performance. What we found is the most expensive step was the initial query parsing. The client sends a SQL string in. It says, select star from table where like, XYZ. Then we need to turn this into an AST, or an abstract syntax tree. That's a graph representation of that query that we can work with in code and understand what's happening. That was the really expensive part of our process.

We ended up adding a query cache in memory inside of DBProxy. We generally have a much smaller set of query shapes, because all our queries for security reasons are parameterized. Even if the user ID and a query changes, what's being sent to DBProxy doesn't. We can reuse the same query plans over and over again. That was a huge performance win for us.

[0:34:52] SF: Is the sharding setup elastic to where as you – if you outgrow, essentially, the number of shards are available, are you automatically adding a new one for, essentially, drawing beyond where you are now?

[0:35:06] SS: That would definitely be our north star. It's not where we're at today. I think we've made a lot of really good progress over the last year in making a failover operation much more seamless. Even today, it's still maybe two weeks of work to do a sharding operation. Any time you're having to move data from one database to another, you have consistency versus availability trade-offs. You try to make the system as robust as possible to both, but cap theorem, you're never going to win. There's always some slight risk that you hit an unknown unknown and you end up getting stuck in a state where you're halfway failed over and not serving right traffic and you don't know how to recover.

We spend a lot of time DRT’ing that and coming up with all these playbooks, so we could have worst-case scenarios ahead of time, which has really paid off for us. One time, we ran into an unknown unknown, actually a couple weeks ago, and my teammates got the site back within two minutes. They were like, “Okay, this failed unexpectedly.” They ran four steps. A second later, the site was back up, which was really cool to see. You can imagine that that happening in the background with no humans involved is still scary.

Hopefully, we'll get there in the next year or two. Right now, we're mostly over allocating to where when we do a shard split, we expect it to get us six to 12 months of growth. To your point, we certainly are paying more, but that's a tradeoff we're very happy to make, because reliability and scale still matter more to us as a company than focusing on cost right now.

[0:36:37] SF: One of the other requirements that you mentioned at the beginning was lowering the impact that this would have on from a developer experience perspective. What impact did this actually have on the application developer? Do they need to understand what's happening at the database level? Or is it all transparent and handled for them?

[0:36:55] SS: Yeah. Well, any reshard operation that happens is fully transparent. What we wanted to have is basically, there's a one-time cost to sharding any table. Some work will be needed from the application team to understand where the bottlenecks are and fix things that aren't supported, because we can't support 100% of functionality in a horizontally sharded world. It's never going to be no work. But you should pay that cost once. From there, it should just work. The database team should be the one scaling out behind the scenes and making sure that all of that happens transparently to users. They don't notice if we double or quadruple our capacity. That perspective is quite seamless.

We also made some changes to our APIs to make it more obvious if people were doing things that are more expensive. There's this query called a scatter gather, which is where in a SQL query, if you include the shard key, we know exactly where to route it. We can parse that, and we say, okay, the shard key is user ID, and the user ID is 10, and 10 belongs on shard 5. We know exactly where to route that query.  But if you don't include the sharding key, we actually still wanted to support quite a few of those queries, because we wanted to minimize rewriting. In that case, what we do is scatter the query out, where we send it to all the database backends, because we don't know where that data is. Then we gather up the results at the DBProxy layer, and DBProxy aggregates them back and sends them to the user.

You can imagine, that gets quite complex if you're doing a cross-shard join, and now you're implementing almost a PostgreSQL query engine, which we did not want to do. We don't support full SQL, but we support about 90-plus percent of the queries that our users are writing. That's probably the most visible thing is if someone tries to write a really complex query, they may have a test that fails, and that test will say, this query is no longer compatible.

[0:38:50] SF: In the case where someone is writing a query, or there's an existing query that doesn't contain the shard key, and you're doing this scatter method and aggregating results, are you notifying, or logging that information somewhere so that someone can go and essentially, figure out a way to improve the performance by including a shard key?

[0:39:10] SS: Yeah, definitely. The great thing at DBProxy is it lets us have a lot of observability into what queries are flowing through our system. I think that's been another huge win from controlling that proxy layer. We also have PdBouncer per database below that. But PdBouncer doesn't have a lot of observability by default. We add a lot of metrics at DBProxy, which we then expose back to our users. That's one way. Also, we have actually updated some of our APIs, although this is still a bit work in progress, so that it shows up as a scatter gather. You basically have to say like, this is an allowed scatter gather. I know that I'm doing this. We're still playing around with, is that useful, or do people just ignore it? What is the ultimate? How do we push people to by default, just include the shard key without thinking about it? I think that's a big unsolved problem that my team still wants to explore.

[0:40:02] SF: Once you got everything ready, how did you go about actually migrating the production data over to the horizontal shard set up? And were you able to do that without any downtime?

[0:40:15] SS: Yeah. First, we had talked about we were shattering read traffic for quite a while before. We did that for about three months before we started serving any live production reads. Then before we even did a physical failover, one of the cool things about how we implemented sharding is we separated out the logical sharding layer from the physical sharding layer. What that means is the idea of a logical shard is that multiple logical shards could exist on the exact same physical shard. Whereas, physical shards always are represented by two different databases. We can in essence, logically shard and pretend our data is sharded, so that from the application perspective, it looks like everything is sharded, but we haven't moved any data around. All we've done is really flip a feature flag to maybe a 1% rollout.

That made it a lot less risky to do this rollout, because we could roll out to 1%, or 10%. Then if we had problems, we would just roll back. We did actually have problems at one point. We rolled out to 1% and it was fine. We rolled out to 100% and it was also fine. Five minutes later, all of a sudden, we looked up and CPU was heading really high. We immediately rolled back and investigated. We found out that we had basically run into some live lock connection pooling interactions from having four times the load on a single physical database, because we were acting as if it was sharded, so all those scatter gathers were being sent four times instead of one time and that led to some weird behavior.

I think that was really cool that we actually caught a bug and immediately rolled back and it didn't cause downtime, which was great to see. That's the logical layer. Then the physical failover only happens once we've been running in a logically failed overstate for a month or two, and you're really confident everything there is working as expected.

[0:42:05] SF: I mean, what's the typical bottleneck? Well, you mentioned a couple of times CPU getting too high. Is it generally going to be CPU, or is it memory, or something else, or does it depend, essentially, on the types of queries and operations that you're on?

[0:42:20] SS: Yeah. It's been a bit of a moving target for sure. Early on, when we just had one database, CPU was the limit. For probably two or three years, all we were afraid of was CPU on our one biggest database, because it would kept growing three X every year. Then as we started getting ahead on that, we started running into different bottlenecks. Last year was all about the write rate, our write rate and our IOPS were actually starting to head towards the limits of what Amazon could support. Even before we hit a hard limit, we were actually hitting soft limits, where there was reliability issues and more frequent outages that we were seeing, because we were pushing the limits of what people typically do on RDS.

We were trying to stay in a more conservative like, let's not max out PostgreSQL, let's not max out RDS. We think there's a bigger risk of SEVs if we run above a certain write rate. That was last year. This year is actually about our biggest tables, because now we have tables that are 16-plus terabytes. Those run into issues with vacuuming, because PostgreSQL, kind of an odd design choice, this is one thing MySQL did a lot better is that it has an integer transaction ID, which means eventually, you're going to run out of transaction IDs. If that happens, your whole database stops.

The way you prevent that is you have to vacuum really regularly. Vacuuming is not just a for fun cleanup operation to save space. It's an essential must do, or your database dies. Vacuuming on large tables gets really expensive. Those CPU spikes I mentioned earlier, those were our largest tables hitting vacuum limits. That's why we now want to shard our biggest tables as well.

[0:43:55] SF: Yeah. You talked a little bit about MySQL there. Obviously, there's some pros and cons to MySQL over PostgreSQL, but which do you prefer working with, especially at scale and why?

[0:44:09] SS: Yeah, I've definitely, I worked on a petabyte scale, MySQL sharded system at Dropbox. Now I've worked at PostgreSQL at Figma for two years. I definitely have some hot takes. I think, obviously, both have their pros and cons. I think we ran into some pretty interesting consistency problems with MySQL at Dropbox. Dropbox cared a lot about consistency, more than probably almost any company out there, because they cared so much about not corrupting data.

In the project I was working on, if we had a consistency that we found a single row, we would spend six to 12 weeks trying to debug that. In the process of that, we actually found multiple bugs in the MySQL source code, where MySQL was not returning consistent results. You would send a query and you would get back a result that was not the correct result, using like, read committed, or whatever consistency you're relying on.

One crazy thing was a bug where there was a bug in ascending scans, but the implementation didn't have the same bug in descending scans. The way we fixed it was flipping our entire data representation front to back, so that we returned our most common operation, which was a descending scan into an ascending scan, and then the issue just went away. I think I have some battle scars from that. I also think, one thing PostgreSQL does really well is that it's actually from version to version, very backwards compatible. In MySQL, often a major version upgrade can be a two-year long project. Even a minor version upgrade can be scary.

Whereas PostgreSQL, minor version upgrades, we're telling like, it's going to work. We're not even slightly worried. Major version upgrades, we can do those in a month, and mostly it's just going to work, because the open-source folks are really thoughtful about making sure things don't change. I'm a little more on the PostgreSQL side, although maybe that's just because that's what I worked on today. I mean, certainly MySQL has some nice features too that PostgreSQL lacks. That transaction ID one I just complained about is probably the biggest.

[0:46:10] SF: Would there have been value in instead of using RDS directly, using, like running PostgreSQL yourself, or not necessarily yourself, but essentially, a stalled instance of it, so you could take advantage of some of the extension support that they have, so that you can more directly extend it or modify it?

[0:46:30] SS: Yeah. I mean, I think [SiTest; inaudible 0:46:31], I've heard really good things about that software. It was certainly a route we could have taken. I think at the time, because we had a team of five people, and most of them didn't even know how databases worked at the time, it would have been a really big lift to manage our own databases. At Dropbox, all of our databases were in-house, all of our hardware was in-house, we owned our own data centers. I think coming from that background, I really would not underestimate the cost of having to manage your own databases. You can do it, but you're now managing, is the database healthy? How do I promote it? How do I make sure there's really low downtime? I think being managed, let us operate with a much leaner team than we would have had to have to be able to manage our own databases.

Also, we've developed a really good relationship with the AWS folks. We actually are in good speaking terms with members of the RDS team. I think at one point, they gave us their phone number, because they were like, “You had a hard time reaching us during an issue. We're just going to give this to you next time, so you can text us if it's a real problem.” I think that relationship building also is a big reason why we've chosen to stay on, because we'd have to build that from scratch with a new company.

[0:47:40] SF: Yeah. I'm sure if you're storing petabytes of data with RDS, then the tier of custom support that you're probably getting from them is maybe a little different and a little more bespoke than maybe the average customer.

As we start to wrap up, what advice would you give to some other tech teams out there that are facing similar scaling issues with databases and are trying to figure out like, how do we actually fundamentally solve this problem? Where do we even get started?

[0:48:05] SS: Yeah. Well, I think my first advice would be, don't just copy what we did. Think very carefully about what problem you're actually trying to solve. Also, think really carefully about building in-house over buying. I know that sounds a little bit hypocritical, considering that's what Figma did, but also, we were really, really aware when we were setting off down this approach that it was pretty ambitious and it was very much like, because we were under tight time constraints, I think that if we were starting the project today with the experience team we have and with the amount of headroom we have today, probably we would use a managed solution.

I think it's really important to understand your own problem area and understand, what are your bottlenecks? What are your workloads like? What are the biggest risks for you? What expertise does your team have? Because all of that is going to matter a lot more than just copying our approach. It worked for us, but it's not going to work for everyone.

[0:49:01] SF: Yeah. Then, I guess, what's the future of this project? You're going to continue, assume to choose other tables to shard, but what is the natural evolution of this beyond just doing the rolling out horizontal sharding across the entire database?

[0:49:16] SS: Yeah. I think my team has a lot of cool challenges coming up. Certainly, horizontal sharding will still be a big part of our next, probably, two years. I think that's how long it will take to do every table at Figma. Because of the incremental approach we built, so far we haven't even tackled schema migrations. We've only migrated tables where the schema basically never changes. Next six months, we need to build out horizontal sharded compatible schema changes.

Similarly, we've gotten away with not having cross-shard transactions, but probably that won't last forever. We'll have to build that. I think that will be a big part of two years. But also, it's really exciting that we're moving on from horizontal sharding and from these scale problems. We're getting to do things like, actually tackle really important reliability risks for the company and build a much more robust future, or power a lot of new revenue generating features. Some of the things we're building at Config, my team has been really involved in helping with pretty critical database layer stuff that will unblock that.

Also, the future of APIs. I mentioned, we've done some pretty tactical things to make it easier for folks to understand what horizontal sharding is. One thing we're thinking about is like, should we build an entirely new API layer, a new ORM layer? What would that look like? Do we migrate the whole company? Do we just have a new use cases use it? I think there's a lot of fun unknown unknowns, which is a big part of why I think databases is always throwing another problem at you. That's one of the things I love about being on that team.

[0:50:47] SF: Yeah, that's awesome. Yeah, I mean, I think now that you're no longer fighting these fires day to day, you've earned the right in some fashion to solve some of these other problems. You had to, essentially, figure out how to solve this so that you can move on to some of these other challenges, like reliability and also, make investments that go beyond where you are as a company right now.

[0:51:09] SS: Exactly. I think as a tech lead, a big part of my job has been leading horizontal sharding for the last few years. It's also things like, being able to think of the long-term vision. Our two to three-year roadmap are things that I now get to build out. And making sure that at any moment in time, we're actually prioritizing building the right things at the right level of quality with the right designs, and also, helping grow all the engineers on my team. Because I really want to leave Figma better than I found it. Not that I'm leaving anytime soon. I'm going to be here a long time. I want to build the next generation of leaders, so that – when I joined the team, there was this engineer, Dylan, who had been leading the team for two years. He kept the team together through all the really hard times.

He looked around, this horizontal sharding was happening. He saw that I had a potential opportunity to lead it. Then if he stayed on the team, he would probably just lead it and it would be easy for him and I wouldn't have a chance to grow. He ended up deciding to leave the team. He's still at the company. He's doing really cool things. I always thought it was really wonderful that he looked around and said, “I have a chance to help someone else grow.” That's what I really want to do for our team is get the next generation of leaders at Figma.

[0:52:18] SF: Yeah, that's awesome. I mean, if you don't put responsibility in someone else's hands and give them a chance to succeed or fail, then you're never ever building, essentially, the next generation of leaders.

[0:52:28] SS: Exactly.

[0:52:29] SF: Well, Sammy, I want to thank you so much for being here. Also, for anyone that enjoyed this, definitely check out Sammy's awesome blog post, which is titled How Figma's Database Team Lived to Tell the Scale. It's really great. Lots of visuals explaining how you did all the stuff that goes into even more detail than we were able to get into today. We'll link that in the show notes. Thank you again for being here and cheers.

[0:52:50] SS: Yeah, thank you so much for having me, Sean. Hope you all enjoyed it today. My team is hiring, so if you're excited about these problems, please shoot me a note.

[0:52:59] SF: Awesome. Thank you.

[END]