EPISODE 1814

[INTRODUCTION]

[0:00:01] ANNOUNCER: A major challenge with creating distributed applications is achieving resilience, reliability, and fault tolerance. It can take considerable engineering time to address non-functional concerns like retries, state synchronization, and distributed coordination. Event-driven models aim to simplify these issues, but often introduce new difficulties in debugging and operations. Stephan Ewen is the Founder at Restate, which aims to simplify modern distributed applications. He is also the Co-Creator of Apache Flink, which is an open-source framework for unified stream processing and batch processing. Stephan joins the show with Sean Falconer to talk about distributed applications and his work with Restate.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:03] SF: Stephan, welcome to the show.

[0:01:05] SE: Thanks for having me. Hi, Sean.

[0:01:07] SF: Yeah, absolutely. Thanks for doing this. I'm excited to get into it. I just wanted to start off with a bit of your background. What was your journey and experience from working on Flink to now being the CEO and Founder of Restate?

[0:01:21] SE: Yeah. Most of my professional life was Apache Flink so far. As part of the team that started it in 2014, and in a way, probably I'm responsible for a lot of the early architecture of Apache Flink around the way that the data plane, the coordination, the snapshots, and all of that worked. The journey actually started even earlier. In a way, it started when I was still at university in grad school and we were working in this intersection between Hadoop and databases, and some of the very, very early steps of stream processing had just come up. Like, Storm was a new thing back then.

After that, we actually took the project that we worked on at university and turned it into an open-source project with a mix of a pipeline batch processing system and maybe some early steps of a streaming system. As part of the open-source journey, we found our sweet spot with users and stream processing; turned it into a stream processor and then I kept writing that way for Kafka, Flink, the advent of real-time stream processing, stateful stream processing, unified batch and stream processing, and so on.

I left that space roughly 2021 to focus on something new and a project that, or a set of problems that caught my eye back then was similar to the problems that we're trying to address with Flink. Flink being an analytical system for robust analytical real-time pipelines. We're more and more being asked for how do you build more transactional event of an application? Not the applications that aggregate events and join events and so on. Yeah, feed dashboards, feed recommenders and so on.

The type of pipelines that in the end actually process payments invoicing orchestrate orders and shipments and so on, these types of applications that folks were stitching together manually with databases and queues and lots of custom logic. It felt like folks were looking for a solution. They even turned to systems like Flink to implement that. It's not a great match. You don't use an analytical system for transactional processing as a general rule of thumb. I guess, this question came up more and more without, okay, like, we should probably start looking into that space and building something, and that is when we started working on Restate. We've been doing this for around about three years now and there we are.

[0:03:34] SF: How do you describe Restate? Do you term it, or do you bucket it into this class of durable execution frameworks?

[0:03:42] SE: Yeah. It's definitely puts durable execution as one of the main ingredients on its list. You're right. There's this big bucket of durable execution engines. It's almost seems like there's a Cambrian explosion of those right now. There's a new one every few months. Restate is definitely a good candidate if you're looking for a durable execution engine. It's a little more than that though. It's really, I would say, a more holistic platform for building distributed resilient application. It doesn't just include durable execution as in being able to journal different steps in your process and being able to reliably recover them, which is this notion of workflow style logic, but implemented in general code, in general purpose code, not in ideas also.

Restate goes quite a bit beyond that. Restate tackles the more holistic problem of like, what if we try to apply this idea of durable execution, not just to a single workflow, but what if we incorporate concepts, like distributed communication state that outlives an individual workflow, or an individual durable execution? How would all those things interact? How do you build a more general platform that applies that level of durability and resilience to distributed services in general and not just an individual workflow?

[0:04:57] SF: Going back to the rise of durable execution frameworks and the idea that there seems like, there's a new one every six months or whatever. What do you think that is the case? This is something that people are investing time into and it seems like there's growing interest in it.

[0:05:14] SE: Yeah. I think this is because the state of the art is not feasible, I think. It's more and more developers and companies reaching that conclusion that the challenges that you're facing today when you're implementing distributed apps, it's just not something that many software development teams can handle, and the ones that can handle it, they're not really using their time well, because they're spending most of their time really on problems that have nothing to do with the business logic. They're spending their time on problems like, figuring out race conditions and how to avoid split brains and how to avoid lost updates if a zombie process appears, all those things. They should be focused on adding features to the application and not creating workarounds around distributed systems problems.

I think this has gotten particularly bad with the rise of microservices and I think it's a big part of why there's a little bit of a backlash even against microservices right now. For all the benefit they give you, I think many people realize how challenging distributed infrastructures with lots of microservices are. Some are just saying, "Okay, let's go back to the monolith. It was just a bad idea in the first place." Then there's a whole group of people that say like, "No, no. We actually like a lot of the benefits that microservices give us. We just want a stronger foundation to build them on. We want something that freezes from dealing with many of these problems." I think this is where the whole wave of durable execution systems get started, that movement.

I would say, it gets actually more and more necessary to have those systems, because applications get increasingly distributed. It's not only the services you've built yourself, but more and more functionality that your access is hidden behind APIs provided by SaaS vendors, and so on. These are all services that you interact with. They become part of your microservice architecture, even you don't really own them, and so on. They add to the complexity of the problem, and that's a trend that's only increasing. I think that's not going back.

[0:07:04] SF: Yeah. Even though he said like, "I hate microservices and down with microservices. I'm going to go back to the monolith." Even if I build that monolith and I deploy it and I'm able to manage that, I doubt that application exists in isolation is going to have interdependencies to third-party services, which then are going to reintroduce, essentially, all these distributed system problems. If I'm connecting up microservices, or even I'm calling a third-party API, then there's all kinds of things in a distributed system that can go wrong. There can be certain outages. 

Without using some framework to help me solve those problems, what are teams typically doing to do that? Are they just making those requests and then at best they're doing some retry scheme with exponential back off to see if they can push that through and they're okay with sometimes that not happening? Or what is that companies are doing to try to solve these problems now?

[0:07:57] SE: Yeah. I think there's a lot of different approaches to that problem. I think the first observation I would throw in is like, there's a lot of companies that actually don't really get it right. Just the fact that there's still so many websites that tell you like, don't hit F5 while you're undergoing a booking on order processes. One indication that they can't really handle these things well, like concurrent requests. You still see lots of artifacts, that if you are a developer, you can understand, okay, something has gone wrong in there backend and whatnot.

I would say, just first of all, a lot of times, it's not actually getting solved correctly. Actually, heard a quote in another podcast from somebody who works at a food delivery startup who said like, for many, many years, their solution was just ignored and sent a voucher, until it at some point just became really expensive. Ignore the problems and send vouchers. I would say, if you want to actually solve the problem, one of the ways to do it typically is to stitch together different systems. The typical ingredients will be use a queue, a database, put your own retry loops with back off. It's not just as simple as implementing a few retries with back off, right? You have to always worry about, okay, what actually if - I guess, the retry is actually happening, if you're triggering this as an RPC call. You might be truly trying it, that process might actually go away.

Then, there's a next step you probably put a queue in front, actually. Say, even if the process gets away, the event gets redelivered somewhere else. Now, you have to actually worry about the fact, you might actually have two processes that work on the same event twice. Are they overriding each other with their retries? You want to throw in a lock? You want to introduce versioning and conditional updates and so on? Then, you might be interacting with APIs. You might call them, get a result, crash afterwards, recall them, might get a different result the second time you call that. The next retry actually follows a different control flow than the first one, and things go completely haywire. It doesn't stop with a retry.

I would say, it's very often you start with a queue and a retry and then you incrementally just add bits to guard against that bucket that you discovered and that, and then incrementally just grows really complex and it makes a hidden assumption on this is exactly how that queue behaves, and that's exactly how that API behaves. Then somebody changes that and everything breaks again. Then you're back to fixing this. I don't know. I don't think they're really good solutions. If you go to the extreme end of saying, okay, here's an extremely sensitive high value process, then sometimes folks throw in workflow engines as one solution, right? Let's say, here's an order process that we really don't want to go wrong, because that can actually cause a lot of money.

You might pull in a heavy weight workflow orchestrator, but it's really not something you typically pull in for small microservice logic. First of all, because it's really a complicated component to have in the stack. Second of all, it's really foreign citizen. It doesn't interact well with a lot of the other logic.

[0:10:31] SF: Yeah, and it's probably a little heavy weight for the majority of the types of calls that you might be making in between microservices, or even to external services.

[0:10:41] SE: Yeah. This is actually one of the interesting things that durable execution brings in specifically in an implementation, like we're looking at at Restate. If you can actually make durable execution cheap, cheap as in low latency, having a durable step introduces a very moderate latency overhand, then what you can actually do is you can start assuming this workflow style guarantees for a lot of code in your application.

It's no longer prohibitive to do that from two sides. It's no longer so slow and expensive that you say, "Oh, I really don't want this here." This is in the synchronous path of the user interactions is going to make everything really sluggish, but it's going to feel still fast during that. The second thing is you're still writing code. It still fits in with all your tools and with all your deployment and pipelines and all your versioning, or your schema registries. You can still keep using that. It feels like, you can keep doing mostly what you are. You're just adding this fine-grained reliability to your functions and get a lot of the problems out of the door. That's actually the ultimate goal of systems like Restate.

[0:11:40] SF: Okay. Going back to 2021 when you started working on this, where did that project start? How did you even begin to try to tackle this problem?

[0:11:48] SE: Yeah. We started initially, actually trying to solve this from the set of Apache Flink. As I mentioned, we were working with a bunch of users on analytical pipelines. Then this question came up, users building transactionally matter of net pipelines on Flink. We really didn't find that a good match and very moderate success. There were a few that could make it work with specific approaches, but not generally good experience. Then we started a sub-project in Flink. It's still around. It's called stateful functions. The idea was just like, let's do the thing that folks really like, which is just reliable communication, transactional state, and encapsulate that into an individual piece, an individual function.

Think of it as like a lambda function, but when it's invoked, it has a contextual state. It's invoked in the concept of a key, attached to a key. When it's invoked, it's hydrated with a state of that key. You can interact with that, modify that. It can basically produce a set of RPC, or messages that go out to other functions. Then this is all transactionally committed, like the messages are sent to other functions, the state is committed. It's almost a stateful, disaggregated serverless active system. I think of it like that.

That in principle raise a lot of interest. There were a lot of folks that did like that as an abstraction. Could see, this is great for building anything from digital twins to, well, yeah, transactional state machines, which represent orders, even invoicing payments, and so on. Of course, it did actually build payment processors on that. We learned that in only years later. It was quite crazy. It was just one linchpin that this had, and this is it was built on Flink as an analytical system. Flink is really throughput optimized. It's low latency for an analytical system, but that is mostly a few use the at least ones semantics and the things. Meaning, you push events right when they come. If you want actually transactional results, you're introducing a huge latency.

Namely, Flink is checkpoint-based and you have to wait until the next checkpoint happened. That's the latency you introduce if you want to say that they want to take a second step before the first step is really durable. That is in the seconds, right? Imagine using this as a foundation for workflows dialogic. It means every workflow step is a - let's say, 10-second latency. It was completely impossible to do that. It is also something that the Flink architecture could never fully remedy. We thought if we really want to make that happen, if we want to make that vision happen of durable execution being something that's so low latency, that you can use it without worrying about introducing latency overhead, even in latency critical paths that are synchronous interaction paths, then we'd really have to build a new stack.

We'd have to start from the bottom building on a low latency log, on a low latency architecture that emphasizes faster ability and not analytical throughput. That's how we then got started with Restate.

[0:14:30] SF: What are the core building blocks of Restate, both from a user perspective? What am I stitching together from the developer experience? Then, what is the architecture behind the scenes that's helping me, essentially, support that in a way that is going to shield me against these outages, or other issues that you might run into in distributed systems?

[0:14:50] SE: Yeah. There were different levels from which to look at. Let's look at it first from the infrastructure side, with this Restate actually sitting in your infrastructure. You can think of it. It takes a similar place as a message queue, or a message broker, or a workflow orchestrator. It's a marriage of let's say, the Kafka-esque event-driven application and the temporal-esque durable execution workflow world. It sits where a broker would sit. You're writing your logic as in service handlers. The abstraction is we try to keep microservices really as the abstraction. You're writing services almost as if it would be like a spring boot application, or express JS application, or so. Like, handlers grouped into services.

Then restate is the queue through which those services get triggered, right? If you want to actually trigger a handler, you put an event in that queue, that's supposed to invoke the handler and then Restate invokes that service. In that sense, classical queue in front of the service. The abstraction that we exposed though is not really that of an event, but it's more that Restate looks at the services and their handlers re-exports them and becomes a reverse proxy. We're really trying to get away from people thinking in terms of queues and events and trying to keep thinking in terms of synchronous and asynchronous RPC. That's really how you build it. To think of it, it sits in infrastructure like a Kafka services and then it exposes itself as a reverse proxy that sits in front of the services.

[0:16:12] SF: Is the main advantage in comparing this to doing something like event-driven architecture with Kafka, is the main advantage is just having a level of abstraction from having to think about events and queues and so forth as the person doing the implementation? If I'm doing the implementation, I can just focus on the work that I need to do and that stuff distracted away by Restate?

[0:16:34] SE: I guess, that's one way to think about it. I think if we go on the details of what the programming model has, maybe we'll see this. But in general, you can think of it as, it's a level up from Kafka style event-driven application. You're not thinking in terms of queues and events. You're thinking in terms of durable stateful, resilient invocations, or functions. Yeah. That sounds like maybe an academic detail, but it actually is a world of a difference, because that actually means that Restate takes on a lot more responsibilities. It doesn't just take on the responsibility of say, okay, I'll deliver the event to the function is triggered and it's really leveraged, retriggered on a failure.

It also understands, okay, how do I fence retries against earlier executions? How do I log contextual state, attach contextual state? How do I track progress if I have multiple steps that happen as part of a function invocation? I want to actually understand that I record the result of the previous step before I start the next one, just to give ourselves an easier life when it comes to implementing in our complex control flow that would be thrown up if it's on different results during different retries.

All those things, if you implement them manually, you're typically not just looking at a queue at Kafka. You're typically looking at combining a queue with a locking service with a database, with a scheduler, and so on. Restate wraps that all together and says, "We're going from queue and event to durable stateful, resilient function executions. Then, as I mentioned before, the core programming model is services that are meant to mimic RPC style service frameworks. The simplest building block you have is really, service handlers that get durable execution, and then resets of layers of few things on top of that. One concept is virtual objects that are stateful handlers that remember state across individual invocations, shard around keys, and then more high-level workflow constructs, where you can actually add signal handlers and query handlers and so on. All of that is built on top of the general service abstraction.

[0:18:34] SF: If I'm implementing one of these handlers and it gets executed, what is the life of that process behind the scenes?

[0:18:42] SE: Let's assume you're implementing a payment processing handler, or something like this. That gets invoked. Let's say, the logic that you have in there is first, I have to check the status. Was that already processed? Was it maybe canceled? Was it blocked before? Let's say, the payment's identified by an ID. I might want to call the fraud detector. I might want to update the database and send them an auto message and so on. The lifecycle of executing this would be the following. Some external triggers, some external appliances, I want to execute that function. That enters Restate. The Restate server, the broker component, as an event, and Restate will understand, okay, where does that service live?

You can think of it that the service has to be registered at Restate. The service endpoint, where is that deployed? Is that here? You're along lambda? Is that an HTTP2 server endpoint? This Kubernetes deployment and so on. You have to register that at the server and then the server connects and pushes the invocation. You've worked with, for example, Amazon EventBridge, or things like that. It's a very similar model. Restate will then look up, okay, that enters on that endpoint and I'm connecting to that. Let's assume it's an endpoint on Kubernetes or so. In this case, it would open a streaming connection, HTTP2, put the invocation and then hold on to the connection. That, the lifeline to that single invocation, or execution attempt, which allows the service to stream back things like general progress, state update, outgoing messages.

The function, let's say, or payment handler would also get when it's invoked, if it's a stateful handler, a virtual object, Restate would attach all the contextual data it knows for that individual handler to the invocation, so that the handler could directly look up things like, okay, what's the status, the previous status that was committed? Okay, it's still new. Let's start and execute that payment. Then, let's say, we're calling things like the, let's say, we'll call an external fraud detector API, we get the result and we say, okay, this is actually a durable step. Then the handler would put the result of that step into that stream that goes back to Restate. Restate internally has a consensus log that persists all the things it receives, and it has a bunch of logic around this to understand, okay, is that information that still comes from a valid execution attempt, or does that come from an attempt that has been fenced off in the past?

Yeah, it has an elaborate consensus log that supports a conditional append of that operation to the journal. It links that operation, or that entry, that's the result of calling the fraud detector API onto the original event. If we'd say, okay, there's a failure after that point now, that failure could be just the connection is ruptured, the process goes away, or there's a timeout, then the Restate server would understand, okay, that event hasn't been completed at the execution. I didn't actually get an acknowledgement back for that yet. It would send the event to another process. It would retry sending it to that endpoint, and it would attach everything it has to that event. That's the contextual state last time.

Now, also, it would attach things like the journal entries that it already collected like, here's the result from the previous step. It wraps that all up and sends it there. Then let's that service basically say, "As I'm going through the code again, I can skip over steps that have already completed." This is what the SDK library basically does for you. Understands, okay, that has been, that's already found in the journal. We can ignore this. This is a new step. We actually add an action, or an event for that in the journal.

Then then it goes on. That applies to pretty much any operation. Recording the result of an API call, updating state, sending out a message. All these things basically become events that are streamed to the Restate server, and the Restate server understands how to process these events. They all get attached to the original invocation, but sometimes they also represent more, or they present an outgoing event that is then routed to another service, or they represent a state update which is supplied to an internal state index and so on. It's generally an extensible event driven event-driven architecture on the server side that synchronizes over streaming protocol with this service.

[0:22:41] SF: I install this SDK. I set up this client. I'm wrapping my call, essentially, around some of the SDK semantics, or whatever. The restate server, restate is going to do its magic to make sure that call is able to, essentially, be facilitated in a way that it's reliable, durable and so on. How do you make sure that the call from, essentially, the client to the server is done in such a way that it's reliable?

[0:23:09] SE: If you want to just, the initial event, the initial call that triggers our durable handler, you have a bunch of ways to do this. You can do this with through HTTP, through a client library, or you can actually just connect Kafka and it will just pull these events from Kafka that represent these invocations. There's a few ingredients in there that make this reliable. Number one, the Restate server will not acknowledge anything before it has persisted that in its internal consensus logs. Even the original event has to go through the consensus log first before even an asynchronous submit, or so is acknowledged. We already have that durable.

The second thing is you can attach out impotency keys to the invocation and then the event processor inside Restate server can use that to deduplicate invocation events. All the goodness of saying, we do duplicate steps inside the durable handler doesn't really help you much if you can't deduplicate the invocation. The item potency key support is there to do that. Then if you integrate this with Kafka, it automatically does the Kafka offset mapping to our impotency mechanisms and basically, gives just end-to-end and exactly one integration.

[0:24:16] SF: If I want to start using something like Restate and have an existing project, do I have to think about re-architecting everything to start with? Or can I do it bit by bit based on where maybe my most critical workflows are, like a payment system, for example?

[0:24:34] SE: Yeah. We've really built it to avoid having to re-architect everything. That shows in many of the core abstractions. we Do have to adjust the code to use the SDK to have access to some of the durable execution mechanisms, like execute this code block as a durable step, or access the built-in transactional state, or let Restate deliver that message to another service. You have to use the SDK library to do that. There's some adjustment in the code. The way you're deploying this, the way you're generally packaging this is meant to be very much in-line with what you're doing anyways, hence, the idea to abstract it, or to give it the shape of microservice, service handlers, the way you deploy it.

From the outside, you can very often just say, okay, this was a non-Restate service. I'm importing the Restate SDK. I'm starting using this Restate, or actions inside my code, connecting this to the restate server, which becomes the reverse proxy. Now, these services that initially used to call the service directly, they call the Restate server, which becomes the reverse proxy for the service. It's really meant to allow you to plug it in incrementally to look at it once service at a time.

There are a few things that really become very powerful only once you start attaching a few more services, like between services that are attached to the same Restate server, you get, enter and exactly once RPC messaging, which is pretty nice. Even in the absence of that, you're still getting a lot of goodies. Yeah, it's totally meant for incremental adoption.

[0:26:11] SF: For someone adopting this, or for teams that are adopting this approach, does it take some work for them in terms of their thought process and the way they've traditionally developed to come around to this mode of operating and calling services?

[0:26:25] SE: Yeah. I think it does a bit. I would say, mostly, it almost requires unlearning a few things that they have learned in the past. If you're coming from a traditional workflow system, we often have folks asking, okay, I'm writing this, but how do I make something a persistent activity now? Or, yeah, just looking for the concept of a workflow and activity. Then the interesting thing is, like in Restate, every durable step is like an activity, or if you want to separate it out, then make it a separate service that you call.

You don't really need workflows as a special construct anymore. It gets similar guarantees just from your regular service abstraction. Even the same observability and telemetry, you just get out of that. You have to maybe take back a step from looking for exactly the concepts you might know and just understand that a lot of the reasons why you were using those concepts, the guarantees you were really looking for, they're everywhere now in almost all the code you write with Restate, so you don't have to go to these special constructs anymore.

The second thing is understanding that many of the operations you do are now durable across failures and crashes. For example, if I'm doing something like an RPC call, a sequential RPC call request response to another system, and the caller actually fails and gets recovered into a different node, that's not something that you usually assume still works, right? Because the network call might be lost, or even if something gets sent back, the code that actually issued the call is now waiting for the response was recovered in a completely different process. It still works in case of Restate, because all the building blocks are actually durable persistent as you would in building blocks. The RPC is basically connected to a persistent future that gets recovered in a different process and can complete it there.

The entire code that made the call gets recovered, restored to the point where it made the call, and then completed with the result of that call and just works even if it moves around. This is something that a lot of people don't expect to work and that's why they're trying to code ways around that and then they come into the Discord and say like, "Okay. I'm not really connecting the dots here." You basically tell them, "No. Just delete that. That just works." That's an interesting experience.

[0:28:36] SF: Are there certain kinds of projects that this makes more sense for than others? What stage it makes sense to go with an approach like this, versus something alternatively?

[0:28:47] SE: I think there's a few cases where it does not make sense, and then there are a few cases where probably lots of durable execution systems could make sense and then there are some cases where I would say, that's a really good use case in particular. I mean, in general, durable execution makes sense for workloads that orchestrate many steps, that update stuff. If you have mostly read-heavy workloads, or read-only workloads, it's like, doesn't really make sense to plug in a system like this.

Then, there are workflows that with something like durable execution is a nice, convenient piece, because it helps you encapsulate retries. You don't have to do them yourself. It helps you to implement asynchronous primitives a bit easier. But if anything goes wrong and some state gets lost and everything gets retried and recomputed, there's really no big deal. Let's say, a rack pipeline or so, which you log into generation. You do lose something, you recomputed, worst case is you call your LLM a few more times and it adds half a cent to your bill. Maybe that's not a big deal.

Then there's cases that where that actually really matters, where you absolutely care about transactional correctness, where you say, "Okay. No matter what funky failure happens, I can never go back to a before previous step." Or cases where you do explicitly need transactional state that outlives individual workflows that you can rely on that other services can integrate with them. This is a very good Restate use case, because we've architected it with that level of resilience in mind. Restate is implementing really its own stack. It doesn't build on a database. It implements its own consensus log. It's in a process on top of that. It's a complete self-contained single binary. You just deploy it in one and it has internally and extremely well-thought through a consensus architecture that allows you to make very strong assumptions on your semantics.

I think payment processing is a good example. If you want that, there's a good use case for Restate. I would say, specifically also, when you want something that works both in the cloud, but has also, I guess, a credible story for self-hosting. Then the converged single binary architecture is actually feasible to self-host. It's not just theoretically you can. It's open source and you can host it, but it's actually fun to operate.

[0:30:57] SF: You mentioned rag pipeline there, maybe not being the ideal use case, because if it fails, you can run it again, or something like that.

[0:31:06] SE: It's a use case, but it's a good use case even. It's very convenient to do that on top of Restate. It's just not a use case where you would rely on strong transaction correctness, I guess.

[0:31:15] SF: Right. What about a user-facing application that leverages a foundation model of some sort? Especially if I'm doing something where I'm making multiple inference calls, or some agentic workflow, I would think it would make a ton of sense there, because you could be calling the tools, the various data systems, multiple models and so forth. Is that a use case that you're seeing?

[0:31:38] SE: Yeah. I think that's actually a very interesting one. As soon as you come more into the AI agent space, I think it becomes a lot more interesting for a couple of reasons. Number one, I think agents are a good match for durable execution in general, because they are a bit like dynamic workflows. Workflows with the control flow is not known upfront. It's determined by the responses of the LLM. Durable execution has this flexibility that you don't need to define the sequence of steps in the control flow up once. You can create dynamic control flow, just record it and replay it after a failure.

I think durable execution in general matches agents very well. Then the second thing is agents are usually contextually stateful. They map really well to these virtual objects in our concept that we have in Restate. You have these exclusively scoped state that you have access to, that you can use to remember, basically, not just previous steps, but also previous context, but it's still not something that it's hidden in the workflow, but it's still an open state that you can probe from other services. Yeah, that you can even interact and put additional context in from other services if that comes up. The whole abstraction just matches really nicely.

[0:32:52] SF: What are some of the unexpected use cases that you've seen of people applying Restate?

[0:32:56] SE: Yeah. There's some very expected use cases, like classical workflow, sagas, distributed state machines. The unexpected ones, it seems there are lots of folks that have fairly complicated, distributed queuing setups, where they're starting with something like Kafka, and then they're also pulling in a RabbitMQ and they have some, I don't know, some routers and some actors in between. I think, often, this is a workaround to build something like, you have maybe a common log and then you fan this out into more fine-grained entities that you interact with. We have a bunch of users that basically, could replace a whole zoo of the distributed queue orchestration just with a single Restate service. That's something we hadn't quite expected to happen so often.

The second one that I've found fascinating is that we've seen folks do in fact build a lot of custom workflow engines and custom rule engines. Apparently, that's the thing. Many companies build for internal processes, internal tools, and so on. That's a quite common use case that we've seen. My favorite one is actually, folks building custom workflow and rule engine that they ship into factories to evaluate sensor data and trigger actions that controls machines. That was not on my list for one of the earlier use cases. That was quite fun to see.

[0:34:13] SF: What would you say is the biggest challenges that you face when designing and implementing Restate?

[0:34:19] SE: I mean, there's technical challenges, right? The mission is extremely ambitious to say, we're building a full stack that starts on the bottom with a consensus log that has low latency, but it's also, you can deploy it in extremely complicated setups across availability zones, across regions. It tries to make good use of modern cloud architecture. Like, objects goes with at the same time, bridge the gap to low latency. That's a technical challenge that we still, we've worked quite some time, like actually, over two years by now on making this happen.

I would say, beyond that, the biggest challenge really is, I'd say, education of the space. Durable execution is becoming more and more known, but it still is not necessarily a mainstream concept. A lot of folks still associated also primarily with workflows. If you're doing durable execution for workflows, maybe you get more and more folks that like, "Okay. Yeah, I know that." But if you're trying to say, okay, no, we're actually talking about durable execution in a more general way. It also includes state, communication. Think of it as a microservice paradigm, not a workflow paradigm. That's like, okay, well, I need to think about that a bit.

I think this education is something that is a big challenge. Also, I would look at it positively. It's also something that is making progress. Most folks, after they've gone through the initial, okay, I hadn't expected that. Let me think through it a bit. Once they actually crack it, they usually get quite excited about it. That helps spreading the word, so that's good.

[0:35:49] SF: Yeah. I mean, I think that partial with any new category creation, this is not the way that people are used to doing things. Then, it's hard for people to even know that they have a problem and maybe a better way of doing something. They're not necessarily actively searching for, until you cross the barrier of this educational awareness, essentially.

[0:36:11] SE: Definitely. I'm not sure if I would go as far as to say like, this is a brand-new category that we're creating. Durable execution as a category existed before we started. We're bringing a bit of a new twist into it. Definitely treating it as more than a workflow paradigm is probably something new. Then adding this low latency capabilities that actually allow you to use it in places where you might previously not have thought it being applicable is maybe something new as well that people need to wrap their heads around. Yeah. We're also working with other folks that have worked on creating this durable execution category, are basically leveraging their work for sure.

[0:36:51] SF: What do you think overall, the impact will be to how we design distributed systems in the future if more and more people adopt this approach of durable execution?

[0:37:03] SE: I would venture guess and say, we're going to - this type of solutions are going to be very, very widely adopted in a couple of years. I think they're going to replace a lot of workflow queuing and another distributed orchestration systems that are out there, just because they're an analyzer, more approachable way of solving these problems. Yeah, they just interact better with the rest of your application stack. They can actually do things like, they can actually support use cases that you might not have been thinking of before. Vice versa, not using these systems. As we said before, it's getting harder and harder. This is one of the driver.

I would actually throw in a second element why I think this is going to be extremely widely adopted in the future. That is, if you look at the whole AI trends and AI code generation, you can actually see that these systems are getting increasingly good at doing things like, even complicated business logic. Assuming you have all the domain context, you really need a bunch of steps, a bunch of non-trivial steps to happen. 

Those systems are not the ones that solve distributed race conditions for you, or understand, okay, hey, here is a case where if that process stalls just here and then a retry happens and then force off a copy here, and then those are going to interfere in a weird manner. I don't see that happening. Even if you think they can conceptually do that, that's probably a waste of compute power. I think if you just use a foundation, like verbal execution and say like, it's just an incredibly good target for our foundation for AI generated code. Because it's a solid semantics. A lot of the problems that you really don't want anything unexplainable, semi-unpredictable to be reasoning about, and then put the much simpler generated business logic on top of that, it's a nice package.

[0:38:55] SF: Yeah, that would be great. What's next for Restate?

[0:39:00] SE: At the moment, we're working very hard in releasing the next version, which is our first distributed release. I guess, by the time this comes out, it's probably going to be released already. We're targeting two to three weeks from now. The moment if you use Restate, you can think of it as it deploys like a single node database, like a PostgreSQL. You give it a persistent volume and good. The next version gives you the complete distributed deployment power distributed replication, scale out and everything. That's actually a big thing that we've seen a lot of excitement building up for, and we're pretty anxious to get it out there. That's the biggest immediate step.

Then after that, we're at the moment, at the phase where we're really just excited to be working with as many users as possible, learn from them what they're using it for, what they see as good use cases, how they think about the problem, how would they explain it to others, how would they explain this category, how would they explain the abstraction and the mental model you'd have to have. Really, share this with the world and work with whoever is excited to work with us.

[0:40:02] SF: Awesome. Well, Stephan, thanks so much for being here.

[0:40:05] SE: Thank you for having me. Cheers.

[END]