EPISODE 1591

[INTRODUCTION]

[0:00:01] ANNOUNCER: When Adam Berger was at Uber, his team was responsible for ensuring that Uber Eats Merchants correctly receive and fulfill orders. This required them to think hard about engineering workflows and state management systems. 

Six years of experience at Uber motivated Adam to create StateBacked, which is an open-source back-end system written in TypeScript. The platform is oriented around using state machines to model application logic and automatically handles the associated persistence, infrastructure and consistency. 

Adam joins the show to talk about state machines. Why they're the right paradigm to manage global application state and what are the practical advantages of using state machines and a back-end platform? 

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author and thought leader on cloud computing and application modernization. His best-selling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. 

Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[INTERVIEW]

[0:01:30] LA: Adam Berger is the founder of StateBacked and he's my guest today. Adam, welcome to Software Engineering Daily.

[0:01:37] AB: Thank you so much, Lee. It's really great to be here. And I'm excited for our conversation.

[0:01:41] LA: Great. I've been looking forward to this one. This should be a lot of fun, I hope. In it's basic form, StateBacked is a database. I know that definition really doesn't do it justice. It's a whole lot more than that. But what's a better definition? 

[0:01:56] AB: Yeah. It's kind of interesting how we got to the point of thinking about StateBacked as a database. It might be helpful to kind of talk about how we ended up there and give a better definition. We started off – I guess, first, what we're doing is we are running state machines on the back-end and we're making them globally available, accessible via API. You can subscribe to real-time updates for each instance of your state machine. 

And when we really started this, we thought about state machines in a very process-oriented way. We were thinking about kind of orchestrating a set of steps that you wanted to take within your back-end. 

And as we built more and more systems on top of that, we realized that, in a lot of cases, we weren't thinking about these state machines that way. We were thinking about these state machines as representing the entities in our systems. We were thinking about this machine not as much as a workflow, as a user entity, for example. Or this machine as a document entity. Or this machine as an organization entity. 

And that's really when we started shifting our perspective more from we still kind of support this idea of workflows and that's a big use case. But in a lot of instances, we think about our state machines as analogous to a table in a data store. And each instance of these state machines is really analogous to a record in your data store. 

Kind of the interesting thing about thinking about it that way is, in a typical data store, you have a lot of different facilities to define what the static shape of your data should look like. You can say I want to have these different columns or these attributes. And here are the types that they should have. And we should have foreign keys in this way. 

But when you start thinking about a state machine or an instance of a state machine as a record in your database, you add in the ability to really understand how your data evolves over time. And that's been kind of the exciting thing for us and thinking about it as more of a data store.

[0:04:03] LA: That's a great way to put it. And that actually helps a lot with some of questions I've had after I looked at your site and figured out a little bit about how I would use a product like yours. And let me see if I can wrap my head around this a little bit more just to make sure. Your customer comes to you. They have an application they're building. They build the front-end code, say, JavaScript for the web or whatever for iOS, Android. Some mobile application. Whatever it is. They build the front-end and they completely own that. 

And the idea is your StateBacked provides the entire back-end for the application. That means not only state storage, but data storage. All data storage. State transitions provides a connection to the front-end. It does all the communication with the front-end. You talked about notifying and change and those sorts of things. All of that capability provides, I'm assuming – we haven't talked yet about it. But I'm assuming security, and those sorts of connections and all those sorts of things.

What else does it provide? Typically, when you think of the back-end of an application, there is real logic that does go on. You can do it all in the front-end. But that's often not the most efficient place to do it. You want to do in the front-end what the front end is good for. And you want to do in the back-end what the back-end is good for. 

And I think for risk of getting on a little soapbox here, I think a lot of engineers focus on doing too much in the front-end or too much in the back-end depending on whether they're a front-end engineer or a back-end engineer. But that aside, most applications have both. And so, you are trying to completely replace the back-end with your capability. Is that correct? 

[0:05:47] AB: That's right. We certainly can replace kind of the full back-end. And if you're starting an application from scratch, I think that we can replace large percentage of what you would typically be building on the back-end and do it in a way that allows your code to be really easy to understand and evolve over time. 

But you don't necessarily need to build the entire application or the entire back-end for your application in StateBacked. There is kind of a nice way to incrementally adopt StateBacked for some parts of your application where it makes sense if you have an existing app.

[0:06:18] LA: Great. Can you go into that into a little bit more detail? How exactly does that work? 

[0:06:23] AB: I think one thing that's kind of interesting about a lot of back-ends as a service is they tend to be bundled with authentication, right? It tends to be you use us for authentication and then we're able to authorize your requests and do a lot of the work on the back-end for you. 

We decided we wanted to take a bit of a different approach. We wanted to disaggregate authentication and the rest of the back-end that we were providing. We allow you to keep whatever identity provider you have today. Authenticate however you'd like. And then we allow, essentially, token exchange. You get your identity token. You can exchange that for a StateBacked token. And then we're able to do full authentication, full authorization based on your existing identity provider. 

What that means is if you have an application that's built using one of the existing identity providers and you want to build a feature that you think would be a good fit for StateBacked, you can essentially build one or a few different state machines. And these state machines can communicate with each other to encode whatever application logic you want. And you can authorize access to these different state machines based on who the user is. And you can deploy one feature if you want to StateBacked. You can deploy these back-end machines and interact with them directly from your front-end within your existing application. 

[0:07:43] LA: And you can interact with it from the back-end as well, right? 

[0:07:46] AB: Oh, yeah. Definitely. 

[0:07:47] LA: You still have authenticated to the front-end user mechanism because it's simply a token exchange at this point. You can use that token on the back-end and communicate and talk back-end to back-end as well.

[0:08:00] AB: Exactly. That's right.

[0:08:01] LA: Would you call that the primary use case that people are using your product for today where it's kind of an incremental to their existing back-end? Or is it the back-end replacement? And the related question is where do you want people to get to? 

[0:08:16] AB: For sure. What we're seeing right now is it's mostly incremental adoption, right? Because I think, just by nature, there are more applications that exist who are looking to add features than there are kind of new applications that are being built from scratch. 

We're seeing a lot of cases of people trying this out on a particular feature. Though we are seeing some folks who have heard about StateBacked and got excited about it and just happen to have a new application that they're starting up. Whether it's a multiplayer game. Or we've seen a couple different things in that area. A few different kind of workflow-oriented applications that are starting to come up that are using StateBacked as kind of a more of their core back-end for everything. 

We're happy to be a part of projects kind of however it makes sense. I think that there are – I would have a hard time saying someone should rewrite everything that they have right away to use StateBacked. But I do think if you're building a new feature, we want to make sure that we can support your exploration and ability to try StateBacked in a really convenient and natural way.

[0:09:20] LA: Let's talk about types of applications that would work really well in StateBacked. Now, like you say, you can completely replace a back-end. You could write almost any application using almost any type of database including what you guys have. But let's talk about the focus and what you're good at. 

I can easily see the use case with video games where the state management and the workflow logic that's typical in a multiplayer online video game would work very well in StateBacked. I can also see basic workflow apps like you see very, very commonly in large enterprise apps. That would work very well in that environment. And I can point to some government apps in particular that I wish you would rewrite for us. But that's a whole other situation. Can you go into some more detail in the types of applications? What types of workflow apps work well for you? What other types of applications would be ideally suited for – not that they would work in StateBacked, but are ideally situated, ideally use the strength of what you provide? 

[0:10:27] AB: Sure. I think there a couple different categories that kind of come to mind. I think one is a lot of online transaction processing works really well in this model. That's your typical e-commerce site or your typical SaaS product where you have a couple of different entities or potentially many different entities. And each one of those entities is kind of small or medium-sized. 

You have things like users, and documents, and processes, and workflows and policies. That type of thing. And for those things, we're able to represent those entities as state machines that can communicate with each other. It just ends up being a pretty natural way to write those types of applications and to be able to really look at it and understand how everything's working and kind of get your head around how the system works well. 

I think another type of application that works well within StateBacked is any type of long-running workflow, when you have short steps – for example, if you have some processing job that takes 3 days to process just a single step, that might not be the best fit for StateBacked today, right? I think that's more of a – there are other big data platforms where that works well. 

For us, we're really focused on a workflow that might take days, or weeks, or years to complete. But each step is a shorter duration where you might be reaching out to one service that you integrate with and taking the data that comes back from that and storing it in your data store and then reaching out to a series of other external parties. That type of flow, where you want to make that really reliable, tends to work very well on our kind of model. 

[0:12:12] LA: Anytime when the request processing time for an external request takes a significant period of time would be a good use case. Whether that's – it doesn't return right away. You have to wait for an asynchronous response. And therefore, flow through a state machine. Or if the external processes has a human involved would be another good example. Any of those sorts of cases would be a great use case for a state machine like what you provide.

[0:12:39] AB: Definitely.

[0:12:41] LA: I'm thinking like fulfillment centers. Perfect example. You've talked about e-commerce, order processing. Obviously, the order processing side, but also the fulfillment side works very well with this. Credit card processing companies. You can imagine how they would take advantage of those sorts of capabilities. Because a lot of the workflows there often are very fast and straightforward but have exceptions. And those exceptions can take hours, or days, or longer even. And that works very well in a typical workflow like what you're talking about here. 

Let's talk about – we've been using the term, well, state transitions, state machines. I'm not sure if we actually used the term state machines yet. But let's talk about state machines in particular. For our listeners, let's define exactly what we mean by a state machine. What is a state machine and why philosophically is a state machine important? Kind of a step back here.

[0:13:37] AB: Definitely. And, yeah, it's super important to kind of contextualize that. I think the first thing is, when we talk about state machines, technically, we're really talking about what's called state charts. And the idea with a state chart is states can be nested. And you might be in multiple states at one time. You can have what's called parallel or orthogonal states. 

When we kind of define a state machine, a state machine consists of three things. You've got your states. And you can imagine those as boxes in a diagram. And that identifies where your system is at a particular point in time. And then you have your transitions between states. And you can imagine those as arrows between those boxes. And those transitions happen in response to an event that comes in that the state machine processes. And they might have conditions that say, "Hey, if this is true, don't execute this transition. Or if this is true, do execute this transition." So, they can be guarded by those conditions.

And then, finally, when we talk about state machines, we want to make sure that we have some piece of arbitrary data that that state machine owns. And when you have that, that allows you to represent any type of computation that you want, any type of system within the state machine. Not just things that can be represented with a finite number of states. So, now you're able to represent really any class of system with a state machine that looks like that. 

I think the big question is, like, "So what?" Right? Why is that a good way to structure your code? And I think that the key thing in my mind is really understandability and readability. In my experience, and I'm sure in most folks' experience, if you have – when you're building a system, when you're building a product, maybe you have some English description or some diagrams that define here are the requirements or here's at a high-level kind of how the product should work. 

And then you have your code. And in most cases, the code is so detailed with so many kind of additional complexities just dealing with making sure that the code itself is structured properly. That it's really hard to get a sense of, at a high level, what is this thing encoding? How is the product supposed to work just from looking at the code? 

And when you look at the product requirement document or you look at some of these diagrams, they leave out too much detail to really understand at a sufficient level of detail how the product works. And when you start to build a state machine, you can show that to someone who's non-technical and they'll look at it and really understand the behavior that that state machine encodes. 

And it really helps us to be able to look at our code structured this way. And it ends up being a really nice kind of intermediate language for how your product works that a lot of people can understand. Not just engineers. But for engineers, it really helps you think through all the different ways that your logic might cause your data and processes to evolve. 

[0:16:32] LA: It's a way of documenting business logic in a way that is both consistent with the implementation and hard to get deviation from the implementation. Because the diagram is part of the implementation, so to speak. Virtually anyway. But also provided the business logic in a format that non-developers are able to understand. Because everyone understands the flowchart basically. It's what we're talking about here.

[0:16:59] AB: That's right. Executable documentation. 

[0:17:02] LA: Exactly. Yeah. That's one big advantage of state machines. I know that there's other advantages. One of them that is kind of near and dear to my heart is the ability of scale, right? It's very easy – since a state machine represents a given entity and where it is in the flow, and you can have multiple independent entities at different spots within the state machine all working independently, it becomes a very easy process to be able to scale that from having 10 people in the state machine currently to a million people in the state machine concurrently or to 100 million people in the state machine concurrently. And it works exactly the same without necessarily huge scaling issues behind it. 

Because, typically, the state isn't very large. Scaling up the number of users doesn't involve scaling up large quantities of data, at least within this business logic itself. As well as the execution, since most of it is asynchronous, it's not highly tightly timed. And so, if you get sudden spikes of number of people using the state machines, there might be a little delay in getting things done. But it doesn't have a direct impact as much as it would in a traditional programming environment where you keep control even when you're not doing something if you so to speak. Your resources are allocated more often. 

And maybe a good way to describe it is you only allocate resources when the state machine is doing something. And when it's not doing something currently, which is the vast majority of the time, it uses very, very little resources and such. As such, you can scale it very easily? 

[0:18:39] AB: That's right. 

[0:18:40] LA: You agree with that. Good. Good. What other advantages are there with state machines? 

[0:18:46] AB: Sure. I think an interesting one, kind of playing off of what you just mentioned, is instances of the state machine are inherently isolated, right? Which means that you have this nice failure domain. You don't see cases where too many request going to one instance is causing lots of issues with other instances and things like that. They're very easy to separate. 

I think one of the other pieces that we find really important is state machines provide some constraints around how your code executes effects and how your code might update data. 

Essentially, what happens is, when your state machine makes a transition, those are really the only times that you can execute some effects, right? Maybe you're going to call out to charge a credit card or update the data that you own as a state machine. 

And what that allows us to do is to provide consistency guarantees for the state machines that we're running where we can say, "Hey, we can ensure this linearizable history of transitions that any individual instance of your state machine has taken." And we can make sure that your state in the external world never gets out of sync with your state within your state machine. You always have a consistent view of did you charge that credit card or did you not charge that credit card? Which becomes really important when you want to build reliable systems. 

[0:20:08] LA: That's a great way to think about it. You got on to something there which I hadn't thought of before, and that's the synchronization aspect. Are you saying it's easier to synchronize data when using a state machine? Or is it more a matter of that synchronization isn't as necessary when you use state machines? 

[0:20:26] AB: They interact in an interesting way. There are two way of looking at it. I think from an application developer perspective, when you're writing a state machine, you don't have to think about synchronization at all, right? You just have a very easy execution model, which is I'm defining what happens when I transition from one state to another and I'm defining which transitions are valid. 

From a platform implementation perspective, working on StateBacked, we are able to take advantage of the fact that that's the execution model to ensure that we can preserve that consistency, right? We know exactly when any effects might take place, so we can make sure that we only execute those effects once we've stored the fact that this transition has occurred and that you're now in this current state. That allows us to really preserve those invariance for applications where applications don't need to think about them.

[0:21:24] LA: Is this a validation layer, so to speak? Since you can only do certain state transitions from some states to other states, and that's part of the business logic definition for an application, you can control what happens, which keeps the application from getting out of sync with that. 

[0:21:42] AB: That's right. 

[0:21:43] LA: Is that a good summary? 

[0:21:44] AB: That is. That's a good summary.

[0:21:46] LA: Okay. Well, let's go into that a little bit more. I know one of the things you talk about on your website is that, by using state machines for your state management system – and I use that very loosely. But you know what I'm talking about there.

[0:22:02] AB: I got you.

[0:22:02] LA: By using that, you're able to make assumptions about the system. And those assumptions help you in the back-end become higher performant, higher – I'm not even sure what all advantages you get. But you get advantages and you also can pass some of those advantages to your customers and higher performance, that sort of thing. Can you talk a little bit more about, specifically, what are the restrictions that using state machine requires and what benefits you get out of that?

[0:22:32] AB: For sure. I think the core thing is all of your logic is centered around defining your states and transitions. And then you can define different effects. And effects could be updates to the data that the state machine owns. Or effects could be external calls that you're making to other systems. But those effects can only occur at certain times. Those effects might happen when you make a transition. Those effects might happen when you enter a state or when you exit a state. And we know that those are the only times that you can initiate any kind of effect that might change something in the world. 

It turns out that that's kind of a natural way to model systems. When you're building a state machine, that doesn't end up being a burdensome constraint. It actually kind of ends up fitting with the way that you think about modeling whatever it is that you're building. 

But on the platform side, that's what allows us to ensure that you never make an external call to charge a credit card and end up with the user still in the free trial period state, for example. You know that you have consistently evolved your internal state and made whatever external calls might affect the rest of the world. We're able to use that to make sure that we provide this consistent history of transitions for you. 

The other piece that's interesting is we're able to essentially give you long-running timers and reliable timers. If you want to say, "Hey, in one day, I want to make sure that I send out this email." You can just schedule an event very easily to happen in a day within our system and then we'll deliver that event. And we know that that can be done reliably, et cetera. That's kind of the way that we're thinking about these constraints giving us the ability to provide some guarantees to applications.

[0:24:30] LA: The restrictions really aren't restrictions on what you can do. They're restrictions on how you think about designing your business logic. Is that a fair – 

[0:24:38] AB: That's right. That's right. 

[0:24:39] LA: Okay. And then by thinking about how you design your business logic in this model, you naturally build in these constraints. And these constraints are what give you a higher level of reliability and also helps you with performance and other benefits. 

And now, the benefits, getting to the benefits, there're benefits to you as a platform, but also benefits to your customers. To your customers, the benefits are a truer understanding of what your system is actually doing. Because it's well mapped out and well – it's doing what you said you wanted it to do. And you have evidence that it's doing that. And you are providing that as a service to help make you more reliable. You're gaining performance. You're gaining readability. What other benefits does the customer get? And then what benefits do you get on the back-end? How is building a state machine platform easier than a generic database platform, for instance? 

[0:25:36] AB: For sure. I think one of the benefits touches on something that we spoke about earlier, which is we know when your code is no longer doing anything, right? We know when we have to keep your code running actively. And we know when we can hibernate your code. So, we don't need this process running anymore. And that allows us to obviously get higher density. That allows us to get better utilization. And allows us to, in the end, kind of provide a cheaper service where your code runs when it needs to. And we make sure that it is able to run whenever it needs to. But we're able to kind of very seamlessly hibernate it when it's no longer needed. These things are still cheap to turn up, but they don't need to run all the time. That's one kind of piece of it. 

I think one big piece we haven't spoken about much is we're able to kind of visualize your system in some interesting ways. We're able to show you the set of transitions that have happened. And we're able to show you what your machine looks like at any given point so that you can better understand how your data is evolving and how your system works.

[0:26:41] LA: You have reports that allow you to analyze where things are. Here are your state machines. Here's how many customers are in this state. How many customers are in that state. That sort of thing. 

Okay. Those are all great benefits. That makes a lot of sense. And I love the – it's kind of the serverless mantra, too, where it's like we only run when we need to. And we share resources very, very effectively. And that reduces cost. To the point where it sounds like, and correct me if I'm wrong, that for you to implement a single customer state machine, it's very, very, very inexpensive. So, it's very easy and cheap for customers to spin up state machines and not have it be a huge burden. 

[0:27:19] AB: That's right. I think one interesting kind of side note on that is, if you want to allow – if you're running a SaaS where your customers should be able to define some logic on their own, that can be a very difficult thing to implement as a SaaS provider. Because state machines are very cheap and because instances of those machines are cheap within StateBacked, that's something that you could kind of offload to StateBacked. You could have a customer define a state machine that represents their custom logic for how they should interact with your SaaS and create a state machine on StateBacked to spin up instances of that kind of in ways that make sense for your application. That's kind of an interesting area that we're starting to explore as well. This idea of allowing customer code or customers to define their logic within state machines and deploy that to StateBacked. 

[0:28:10] LA: What you just described is very near and dear to my heart. Because I'm actually working on an application myself right now that requires me to define SSL certificates for my customers. My customers are using a SaaS application. And so, I need to get those SSL certificates. 

And if you're familiar with Let's Encrypt, it's a free service to get SSL certificates, but there's a process involved in getting them that requires verification, validation, a whole bunch of steps. And it can be kind of an involved process that I have to execute over and over and over again for every single customer. It's a state machine processing. 

And I ended up having to hardcode this mechanism. But in hindsight, if I would have just waited another couple of weeks and I could go and use something like what you have. Because that's exactly the use case you were talking about. I have a workflow that I need my customers to use that has really nothing to do with my business logic. But it's an important workflow for my customers to go through in order to get something they need in order to use my SaaS product. And that is an SSL certificate. 

And so, building that out with a workflow just for that could take it completely out of my code. So, I don't have to worry about it at all. Lets you deal with a state management. And I know, because I've built the StateBacked and describes the flow exactly how it's going to work. And that it will work. 

And so, all the issues that I was dealing with were things like, "Well, it failed the first time. Now I need to retry." How do I do a retry without staying in the loop and staying connected? How do I do that in this sort of model? And it gets kind of complicated after a while if you aren't thinking of it generically like with a standard state machine like that. That's a good use case for this sort of a product where it's an incremental use case. It's not all of my data. But it's a good part of my application that would work very well there.

[0:30:01] AB: I think that's a great use case. Yeah, anytime you want to create this long-running reliable process, that can be very difficult to do. But representing that as a state machine, super easy to just go and deploy that as a machine running on StateBacked. 

[0:30:14] LA: I imagine things like notification delivery is another example. I want to notify my customers of some event. And now each customer has a different way of getting the notification. Whether it's an email or an SMS message. And each of those takes various periods of times to execute. And sometimes they want more than one. That whole thing can just be a process flow.

[0:30:33] AB: Definitely. And if you implement that, you will end up batching, and doing different rate limiting and all of that. And I think that that's a good example of something we could help with.

[0:30:42] LA: Exactly. Exactly. Talk a little bit more about what's the process for creating these state machines. What is a state machine in your product look like?

[0:30:53] AB: Sure. We've actually adopted XState as our machine definition library. XState is a really popular library on the JavaScript side of things and TypeScript side of things. Has really nice TypeScript support. 

You basically define your state machine as an XState machine. And what that looks like is either a very data-oriented kind of object that you define within JavaScript or you can actually design it visually. There are some visual editors that are really good to describe how your state machine should look and create all the transitions and things like that. 

Then you'll take that state machine as a file in your Git repo probably and you'll basically define two other functions to turn it into a StateBacked machine. And those two functions are allow read and allow write. And that's it. You've got your state machine. And then these two functions allow you to authorize different operations based on the context of who's making the request. 

We take all the claims about a user that we know. Things like what's their user ID, for example. And the state of the machine. So, what state are you in? What event are you trying to send? What is the kind of owned data within that machine look like? And you can decide, "Is this a legitimate request or not?" And that's the whole definition of a StateBacked machine.

And then we have either a command line tool or you can use the web dashboard where you can upload that file. Or essentially, we kind of build that into a bundle and then upload that bundle. And at that point, you have a machine that lives in the StateBacked cloud. And then the next step would be integrating that with your application. 

We have a client SDK that you can use, as you said, either on the back-end or on any of your clients to create instances of that machine to send events to read the state of any running instance of those machines. 

[0:32:53] LA: That's how it works when you're talking from a front-end to StateBacked. How does that differ when you're adjusting your back-end to take advantage and use StateBacked? 

[0:33:04] AB: I think it is essentially the same. The only real difference comes in the authorization side of things. For example, we talked a little bit about token exchange earlier. That's kind of our preferred method to authenticate and then authorize from the front-end. 

On the back-end, because you have your own private back-end, you can just mint tokens yourself to make whatever claims you want about your service. Whether you want to make requests acting on behalf of a user and make sure that that authorization is kind of end-to-end, you can implement that. Or if you want, you can create a token that says, "Hey, I am service A, B, C." And then within your authorization logic and your machine you can say, "Well, I accept requests from service A, B, C. They're allowed to send whatever event to me and I'll process that."

[0:33:57] LA: Got it. It depends on the application and how you want to do the authentication, but you allow that flexibility. That makes a lot of sense. Let's talk about concurrency. We started talking about that a little bit. And we've talked about how individual state machines essentially are separate instances that run independently. And that allows lots of scaling benefits and concurrency benefits. But let's look at the individual state machine itself. How much is its performance impacted by the performance of the rest of the system? What I mean is the rest of the state machine and the other instances. How parallelizable is your service and your ability to handle these requests? I think that's really what I'm asking there.

[0:34:40] AB: Sure. Got it. That's something that we're investing a lot in kind of overtime. Right now, we are able to kind of get down to the granularity of a machine if we want and could run kind of individual machines within individual servers in our clusters. We don't do that, right? We do collocate machines within servers. But we have a lot of protection in place to make sure that we have security boundaries between customers. We kind of have a process per customer model where we never run different customers' machines within the same process. And between specter, and meltdown and all that, we like to have that separation. 

But we don't restrict a customer to one process, obviously. You might have many, many processes running within our back-end for any particular customer. And each one of those is running potentially many machines for that customer. 

[0:35:33] LA: I think I'm asking something a little differently. That's a great answer. But I think I'm asking something a little bit different than that. Philosophically, it's easy to see how state machines are scalable. Let's look at a very specific instance. I have an application that I just built and it uses a single state machine. Make it nice and simple. 100 states, whatever. And everything runs with it. Everything's wonderful. It's a great app. It goes viral. And the very next day, I have a million instances of it running. That's a huge increase in usage in a very, very, very, very short period of time. That's not typical. Usually, you get much more wrap-up than that. But it does happen. Those sorts of spikes do happen. How scalable is the state machine method? And I'm separating it out from your implementation for the time being. But I want to get into both. How parallelizable is the state machine mechanism to allow that sort of a spike and not affect the individual user, the individual person? Of the million people, how are each of those independently impacted by the fact that there are 999,000 other people doing the exact same thing? And then how does that work within your application? And what levels of scalability like that are you able to do today? And what do you expect you'll be able to do with technology? 

[0:37:04] AB: Sure. I think there are kind of two pieces when you think about scaling state machines. The first is to think about how you're scaling the data storage and the updates to the data that each state machine owns. And that, you would apply the standard set of techniques in terms of things like sharding and kind of using scalable data stores where rights to a particular record don't affect rights to all of the other records. That is kind of a well-studied area that we're confident we can scale quickly on. 

The second piece is how do you scale the processes that are running your state machines? This would be kind of akin to if you're running a functions as a service platform, well, what's your cold start time, right? How quickly can you create a new instance of this function? And do you reach a point where you've exhausted your quota and you have too many functions running at once, right? I think those are kind of the two vectors that we think about scalability in terms of or the two kind of pieces of our scalability story. 

We on the data store side – as I said, that's kind of well-studied and I think we have a good answer in terms of how we can shard that and very quickly scale the number of instances of state machines that we're running. In terms of running individual state machine instances and processing that client code that's provided to us that defines that state machine, that's something where right now we're running a significant number of state machine instances on each one of the machines in our back-end. But each machine in our back-end does take a number of seconds to a minute or two to come up. 

In terms of ramp-up, we're able to ramp up fairly quickly, but not instantaneously. And that's something that we're spending a lot of time thinking about now where there's some pretty cool things that we're thinking about in terms of, "Hey, how can we scale up in one second to accommodate another few thousand state machines?" That's kind of an interesting area of research that we have some cool ideas for in the coming months.

[0:39:13] LA: Great. Great. Yeah, that's exactly the sort of thing that I was thinking about. Getting back to the data scalability standpoint a little bit, you mentioned sharding. And one of the things with sharding is sharding is an excellent mechanism for scaling data when all of the data is roughly the same size for each instance. Once you start having large disparities in the size of the data set, that sharding becomes a lot more difficult. 

Now, I got a hunch that that's not a problem you have. That most of your individual state machines are probably reasonable in size as far as the amount of state that they actually have. And you're not talking about storing large quantities of data within a single-state machine. Is that a true statement and is that effectively one of the "restrictions"? May not be a formal restriction, but a logical restriction on how you use your services. It has to have a rational amount of size to the state. 

[0:40:15] AB: That's right.

[0:40:15] LA: What do you consider rational? 

[0:40:17] AB: Yes. And it is actually a formal restriction that we have. We think that by putting a limit in place in terms of how large the data for any instance of your state machine can get, we're able to provide kind of the set of guarantees that we want to provide. Right now, we've basically kind of looked around at similar services. And I think DynamoDB is kind of a good example where you can store an infinite amount of data in DynamoDB. But each record can only be 400k. 

We've kind of adopted a similar limit on our end where we've said, "Hey, we're going to allow 400k of data within an individual instance of a state machine. Store as much State as you want across instances. Create as many instances as you want. But we think that that helps to kind of give people a good idea of what types of data you want to store within your state machine versus what types of data you might want to store elsewhere. And you can still interact with it from within your state machine. 

[0:41:18] LA: So, that helps with your sharding issue very, very effectively. Those sorts of limits, those sorts of very small limits. I mean, in the grand scheme of things, 400k is a tiny limit for a data set. But that limit is practical from what you're trying to accomplish. Yet, it allows you the flexibility of lots of different sharding options and lots of different data-sharing options. And that's great. 

It does beg the question, and I know I know the answer to this, but I got to ask it anyway just to make sure. An individual application can have multiple different distinct state machines running simultaneously. And those state machines can interact with each other. 

[0:41:58] AB: That's right. 

[0:42:00] LA: One state machine can trigger another state machine and transfer data back and forth or at least flow of control back and forth if not data back and forth.

[0:42:10] AB: Exactly.

[0:42:10] LA: And that's all true, I'm assuming.

[0:42:12] AB: That is true. And we think of that – like we said earlier, we kind of think of these state machines as you would think of entities in your system, right? You might have a user state machine that interacts with an organization state machine, that interacts with a permissions state machine, that interacts with a document state machine. Each of these entities generally has a small amount of data. But in aggregate, they can represent kind of any size data that you want. And they can communicate in this structured way by sending events to each other. 

[0:42:44] LA: In a traditional MVC model view controller stack mechanism, really, a state machine is the model. It's not just the data. It's the model. It's the logic flow that goes with it, et cetera. And some of the interconnections between state machines are that your controller logic. And, of course, your views are the code that you're writing in the front-end. And more to it, it's not that simple. But, basically, that's kind of the idea. 

It's very easy to have many different state machines with different scopes of influence. And so, you can have user data that's private, user data that's shared, count data that the user has access to that's shared by other users of the same account and et cetera, et cetera, et cetera. All by having different state machines with different scopes attached to them. 

[0:43:34] AB: That's right. That's right.

[0:43:36] LA: Cool. That was really kind of the last of those questions I had. I was thinking of the dual-purpose case. We have some shared and non-shared data. But that really answers that question. That it's really an application. Isn't a state machine. It's potentially a large number of state machines. 

Let's look at a single-state machine and what is a rational size for it. Now I know, saying, what's a rational size for a state machine is kind of like saying what's the rational size for a state in the country, right? Or lots of other things. There's no one right answer to this question. But our state machines that you deal with typically, tens of states, or hundreds of states, or hundred thousands of states of distinct steps? 

[0:44:21] AB: We typically see tens. When you get to hundreds, that's quite a big state machine. I think when you get to that point, it probably makes sense to start breaking things up not from a system perspective, but just to be able to understand how this thing is working and operating.

[0:44:39] LA: Yeah. The diagramming approach allows you to have perhaps larger systems than you practically should.

[0:44:48] AB: Yeah.

[0:44:49] LA: But it can get out of a whack pretty quickly if you're not careful. And it's just like anything else. 

[0:44:53] AB: You could have a 30,000 line code file. 

[0:44:56] LA: We're going to have to have another discussion about the effects this sort of mindset shift has on the overall software architecture process in a company. But, unfortunately, I think we're going to have to save that for the next time we talk, which hopefully isn't going to be long. Because I've really enjoyed this conversation. And I'm really intrigued by what you guys are doing. And I'd love to find out more as you go along and see where you stand and how you improve and change over time. 

But for now, Adam Berger is the founder of StateBacked, a scalable state-driven back-end for modern applications. And he has been my guest today. Adam, thank you very much for joining me on Software Engineering Daily.

[0:45:36] AB: Thank you so much, Lee. This was just a great conversation. I really, really enjoyed it. It was great speaking with you.

[END]