EPISODE 1592

[EPISODE]

[0:00:00] ANNOUNCER: There are countless real-world scenarios where a workflow or process has multiple steps, and some steps must be completed before others can be started. Think of something as simple as cooking dinner. First, you look up a recipe, then you write down the ingredients you need, you go shopping, and then you cook. These steps must be run in a certain order, and the state of the workflow must be tracked throughout.

Workflow management is everywhere in the software world, and today, it's common for teams to engineer custom solutions. This makes sense because creating a general-purpose solution for workflow management is a hard-conceptual problem, and perhaps an even harder engineering challenge.

Maxim Fateev has a deep background in engineering distributed systems and workflow management services at Google, Amazon, and Microsoft. In 2015, he joined Uber and helped create the open-source project, Cadence, which is an orchestration engine to execute asynchronous long-running business logic. The success of Cadence led Max to co-found Temporal, which is an open-source programming package for workflow execution.

Max joins the show today to talk about the engineering challenges at Temporal, the concept of durable execution, how he organizes his engineering teams, and more. This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean’s work and where to find him.

[INTERVIEW]

[0:01:32] SF: Max, welcome to the show.

[0:01:33] MF: Thanks a lot for having me.

[0:01:35] SF: Yes. Thanks so much for being here. I'm really excited to talk about Temporal and some of the stuff that you guys are dealing over there. But maybe we can start with some basics. Who are you? What do you do? And how did you get to where you are today?

[0:01:44] MF: So, I am Maxim Fateev and I was an engineer all my life. I worked in the – last 20 years, it was Amazon for eight and a half years, couple years at Microsoft, four years at Google, and then four years at Uber. Then, we started the company Temporal and I'm CEO of the company and founder, co-founder.

[0:02:01] SF: Amazing. That's quite the work history, a lot of heavy-hitting companies there. I'm sure a lot of that influence some of the, or inspired some of the work that you've been doing at Temporal. So, I want to talk a little bit about some of the challenges of systems in the cloud. I think one of the things that happens a lot of times is you end up with all these like disparate systems that need to talk together and to complete some workflow or task.

Back in the days when everything kind of ran on a monolith in a lot of ways, like it was kind of easier to coordinate work between different parts of the system, because there was just like less moving parts. But when it comes to these modern systems, I guess, what are some of the challenges organizations run into in terms of attempting to automate tasks across these like distributed systems that maybe are running like a whole bunch of different microservices that kind of run independently?

[0:02:48] MF: I think, when we say microservice, there are like two types of microservices really, right? Where like request, reply, and then the moment you say, “Oh, I need consistency. I need to make sure that things complete and I cannot just return 500 there to the customer.” You practically say, “Okay, now you're in the event-driven world.” There is no alternative. Then, you can start composing all these pieces, like queues, durable timers, and all of these other things.

The reality is that event-driven systems are awesome at runtime, right? Because they just disconnect your services from each other, and you can get much higher availability, flow control. So, queues, and just to give you some background, at Amazon, my first five years of career, my work there, I was working on pub/sub practically, team which owns frameworks and pub/sub, and I was design lead for the Amazon messaging platform, which practically was broker-based architecture, which delivers all the messages within Amazon. And later simple queue service toolkit as a backend. So, I know pub/sub pretty well. I'm still an expert in that.

But one thing you learn pretty fast that [inaudible 0:03:52] at runtime is an awesome technology, but design time, and when you think about your system, when you operate your system, it is extremely hard to use and like it's just wrong abstraction. So, events and queues are just wrong abstraction when you think about your systems. The main problems you arise is that you lose consistency across services. Complexity is very, very high. Most of your logic and code is not business logic. That is kind of the main problem that you need to solve.

[0:04:18] SF: So, is it that you end up writing essentially, a ton of your code is just to handle different conditions of these event systems where they might not respond within a certain timeframe, or maybe they even fail?

[0:04:28] MF: It's a lot of things, right? First and most important thing is that your business logic is scattered across multiple places. So, you cannot go and say, “Okay, this is what my system is doing.” You have to run that system to actually see what's going on. Then, you cannot even imagine your full control flow because you just scatter it. Then also, certain things like compensations like sagas, or other one would be just looking where your system is, is practically impossible or cancelations. These types of features become almost impossible, because the state of your system has messages in these queues, and state on these databases and my services. So, it's extremely hard to make sense out of your system, which makes it extremely hard to maintain and operate.

[0:05:10] SF: Yes. So, you mentioned at the start of the introduction, you've worked for a lot of companies like AWS, Google, and so forth that have really large, complex distributed systems. A lot of those companies have developed kind of their own tooling to address some of the challenges of actually running these like big distributed systems. I know from my time at Google, for example, there's a project you might be familiar with a called Manifold for workflow orchestration. And of course, they also have their own version of pub/sub for event-driven systems and so forth. But was your prior experience to some of these companies, what inspired Temporal and how did the company start?

[0:05:51] MF: Actually, it started as a simple workflow service, whereas AWS/SWF service. Actually, our team own pub/sub for Amazon. So, I was in the middle of every design discussion on every code review, like not obviously, every code review, but every high-level review for every new service. I was kind of advising teams how to build these large-scale distributed systems using queues or topics. It became clear that we need an orchestrator. We started to work on that. We have multiple internal versions, and I was a tech lead for the public AWS version of AWS simple workflow. There, we found that we’re actually were able to – as we are building an AWS service, we can afford it from the beginning how to make it highly scalable and available. We iterated a lot on developer experience.

I don't think we nailed it, because probably most of you still didn't hear about simple workflow service. It's out there. It's still public service. But most people don't use it, because developer experience wasn't actually very good. But what then happened is that co-founder of Temporal, Samar Abbas, he went to Microsoft and he worked with me on a simple workflow. He went to Microsoft and tried to do something like simple workflow with Microsoft Azure.

Given that Microsoft had quite a few workflow projects back then, he didn't succeed to make it a high-level service. But he released the open source durable task framework as an open source, and then it got adoption internally at Microsoft, especially, so high that then later, Azure functions took dependency on that, and Microsoft has Azure durable functions right now, based on that idea. Later, we met at Uber, and we – actually, our first project at Uber was actually a pub/sub system called Share Me. It was open source. You can still look up an article about that.

We realized that for that system, we needed background jobs, and we knew how to do background jobs. So, we started to kind of create something like a simple workflow, but using completely different software stack. Then, this project was called Cadence, and this project became successful within Uber. Within three years, we've got over 100 use cases running on that. Later, we just started the company with the same open source – we forked the project, so it's open source project, MIT license called Temporal. This is what we are still building as a company, and we obviously monetize that through our cloud offerings.

[0:08:05] Right. Yeah. So, at this point, there's, I think, probably, my understanding is quite a few parts of Temporal. You started with the open-source project. Now, you also have a managed service version. There's been a heavy investment on the sort of developer experience side with CLIs and SDKs. And you said that you started with a simple workflow orchestration. But can you kind of walk me through some of the engineering history? You started with the open-source project, but where did you go from there? Essentially, what was the history of the project to where you are now?

[0:08:37] MF: So, the Cadence project which we started at Uber, we build it as open source from the beginning. We practically did development in the open. And Uber was actually a very cool place to do open source. They allowed to do a lot of open source, very lot of interesting projects out of Uber, which came out of Uber. So, we are super grateful to Uber that we were able to do that. The first two years, we've got practically zero external adoption. It was out there, it was on GitHub, we've got some stars, but no real usage.

But we had a lot of use cases internally. So, we were growing that service, bringing more and more use cases, and just focusing on internal Uber adoption. As I said, we've got over hundred use cases within the first three years of the project. At the meantime, what happened is that when we did the version of Cadence at Uber, we actually were able to iterate on developer experience to the point where it actually became pretty reasonable and in pretty good, compared to the present the simple workflow. Three things can be done together.

First one was much better developer experience, and we can talk about that a little bit more. Second was that it was open source project. People like open source, especially if it's MIT-licensed, real open source license. The third one was because we were running the systems at scale for a long time, at Uber, people trusted our codebase, and we also had experienced from Amazon and Microsoft. We can have our developer experience, plus running these things at scale allows external adoption. And when we actually go to early external adoption, it wasn't from some small companies. It was from, like very well-known companies. One of the first adopters were Airbnb works HashiCorp, the coin base. So, those companies certainly like top tier companies, and they weren't adopting that for like small use case. We’re adopting for like very high value use cases. This allowed us practically to start growing that community, and then when we decided to start the company, we realized that staying at Uber, we cannot really spend so much time on the external community, because we obviously need to solve internal problems. We would never get a such a big team as we have now that we will have in the future. So, that's why we took easy money and started the company.

[0:10:39] SF: Right. I think, some of the early adopters, like Airbnb and some of the other companies you mentioned, I think it makes sense to me in my mind, because there are going to be companies that are similar in terms of running large distributed systems. So, they're going to be, I think, acutely feeling like the pain of either trying to do this themselves or not having something in place to manage this. Whereas, you know, maybe a smaller company isn't operating at the scale where this is the absolute P zero top 10 problem for them to solve right now, and they're okay with either handling through some sort of bespoke method, or, I don't know, taking some sort of other project off the shelf to kind of put a Band-Aid on it for the time being. Is that something in terms of your original go-to market, you saw, was these larger companies that sort of the pain point that you were addressing with Temporal resonated with, more so, than maybe the smaller companies.

[0:11:32] MF: We actually got quite a few smaller companies as well. I just, I don't remember the names on top of my head. But I think there's a huge misconception about Temporal. First, we actually call it Temporal, not Temporal. We decided to sanitize in Temporal for sort of interesting reasons. But Temporal helps you to write your code faster, because it eliminates 90% of code-related reliability. Even if you're like one person, or two people like startup in your precede mode, and you build an application for your customers, but you cannot just lose data or lose customer trust by dropping things on the floor. I can guarantee you, will write your application faster with Temporal, than doing an ad hoc solution. I think this is important to understand.

The good news is that you don't need to rewrite your application if you become successful, because companies like Snapchat, or Airbnb can run on that. Every Snapchat story, for example, goes for through service. So, they have very extremely high rates. Our value proposition is you write less code, you write more reliable code, your life is easy as an engineer. And at the same time, you don't need to rewrite your code if you are successful. Right now, we have a startup program when we give credit to startups, because we are pretty popular with early-stage startups. It's different from a lot of other products. There are certain class of products which only is needed if you like to scale. Or you can use Postgres, but if you outgrow a single Postgres, and you use our solution, which scales better, that makes sense. But if you use Temporal, it doesn't apply. You absolutely get value even if it's a small deployment and small scale. You have a lot of users, which do, like I don't know, 10 transactions per day, or 50 transactions per day, and they still get value out of that. So, it's not about scale. It's about complexity and developer productivity.

[0:13:13] SF: Okay, great. Thanks for clarifying. One of the other things that you mentioned when you're talking about the recipe of successes, or combination of factors that led to some adoption was improved developer experience. You mentioned earlier that with the simple workflow service that you developed at Amazon, the developer experience wasn't at the level that led to a lot of adoption. What were the challenges that had to be solved from a developer experience standpoint that you solved with temporal that led to success that you saw?

[0:13:44] MF: Okay. Let me first declare what Temporal is, because we talk about Temporal, I’m pretty sure not everyone understands the value of what it actually does. Let me give you some kind of technical explanation. So, if you're an engineer or architect, you probably should be able to get it. Basic idea is extremely simple. We introduced new concept we call durable execution, and the idea is that we give you a runtime which preserves the full state of your code, practically your runtime state of your execution. Imagine function, which function, let's say calls three APIs A, B, and C. You're blocked on API B. And then process which hosts this function crashes, or network event happens when like you lose all the requests.

Temporal, we automatically recover the full state of that function in a different machine, and all your local variables, all the stack, everything will be preserved. You still will be blocked on B and B can return like one day later, and you will continue on to the next line of code. Practically, when you write your code, you don't even think about storing state, because state is already preserved inside of your local variables, and you don't need to think about process crashes because we will just preserve the full state. It eliminates most of the complexity, because we guarantee eventual completion of any business logic. You write this in the top-level programming language.

So, we support six SDKs right now. Java, Go, TypeScript, PHP, by phone and dot net. When we talk about developer experience, the most important part for us was how do we make it as seamless as possible integrated with the language? So, if you're a dot net developer, how do you write this code in a way that looks natural for dot net developer? If you’re a Go developer, how do you write this code, which is as natural as possible to the Go developer? This is where we spent a lot of time. For example, when to do Python, we use asyncio, which is built in in Python. When we do dot net, we use a [inaudible 0:15:26] from dot net, and so on, and so on.

So, it's more about being very, very close to your tools, because then you use the IDE, because you just write Java or Go code. You can use your CI/CD pipeline. You can use your existing unit testing framework. So, you can use J unit in Java, for example. Practically, being as close as possible to your natural tools, you just write code, and we even provide all this experience for you, without requiring any additional hoops.

[0:15:52] SF: So, sounds like there's essentially a lot of intention around the language level support that you're providing with your SDKs. I'm assuming these are like handcrafted to be idiomatic of the language rather than something that's like auto-generated from SDK automated.

[0:16:07] MF: That's why the key investment is extremely high effort investment for every new SDK. So, that's why for them at Uber, we had two languages, Java and Go, and took us years to get all these six languages. Dot net, for example, is still in preview mode, like Alpha. It's not production yet, because it took us pretty a long time to get there.

[0:16:25] SF: Then, you mentioned, as you were describing like how essentially Temporal works. You talked about essentially like your calling, sequence of functions. Potentially run on different machines and machine goes downs, and it'll still essentially preserve the state and be able to return that call at some point. Can we get into the details? How does this actually the product prevent data loss in the case of something like a machine dying?

[0:16:50] MF: Obviously, I don't have time to go to every like, low-level technical details. But conceptually, what happens is that we have a backend cluster. We call it Temporal Service. And then Temporal Service relies on a database. So, in the open source, we support MySQL, Postgres, and Cassandra, and technically there is a binding API. You can add more bindings if you need to. Then, when you write your application, you take Temporal library and link to your application, and then users write this business logic. Then, you deploy your code. We don't run your code ever. Even if you use our cloud service, you run your code, your own the workers, you encrypt all the payloads before you send to the service.

But then, what happens is there is this interaction. So basically, we record the process state, and what it means is that if process dies, we can recover on separate machines, so we automatically detect which machine is available and the state is in the database. So, it means that even if the cluster goes down, all your worker processes, we call them worker processes, which contains the business logic go down, the state will be in database. Assuming the database is not corrupted, and everything comes back, eventually, the state will be recovered, and we will continue executing. By the way, even if the database is corrupted, we already have multi-cluster, multi-region support. It means that you can use fully region with database and everything, and you will be able to continue executing those functions in a separate cluster.

[0:18:08] SF: Who owns the database in this scenario? Is this the customer or is this you, as a managed service?

[0:18:14] MF: So, when you do use our open source, open source includes both service and the SDKs. So, in this case, we call it self-hosted, obviously. The self-hosted version means that you're on the database, you're on the cluster, and you're on the workers. If you're using our cloud product, we run the database, we run the service, and we just give you an endpoint. It's gRPC endpoint, forcibly for private link. And then you're on the workers which need to connect to this gRPC endpoint. We actually wrote our custom database, which is very highly optimized for this use case for cloud service. But it kind of uses the same bindings. So, practically, from the OSS point of view, just another database.

[0:18:52] SF: Then, are there situations where a rollback needs to happen?

[0:18:56] MF: Okay. When you say rollback, obviously, there are a lot of layers. But let's talk about application logic. The way people use us. Practically, think about it, you write the function, and we guarantee that this function will complete. Imagine you're doing money transfer. You say, take money from one account, put money in another count, and then account idea of the second account is wrong, and you already took money, so you need to return it. People can [inaudible 0:19:17] they didn't run compensation. With Temporal, it’s very easy, because it was just kind of say, try-catch block around second call. And in a cage block, you can go and execute the compensation. Because we guarantee that this code will execute, including error handling logic, writing compensation becomes as simple as just coding basic business flow of that compensation.

[0:19:37] SF: I see. Then, what are some of the common use cases that people use Temporal for?

[0:19:43] MF: It's actually a question which people get confused about, because we have a very generic technology. We practically allow you to write reliable, resilient distributed systems. So, it applies everywhere when you need resiliency. It's not about use specific use case. It's about every time you need to make sure that this business process completes, can be very fast, it can be hundreds of milliseconds. For example, in some payments, or it can be something which runs for months, or even years, we are good fit. But it's all about guarantees. But from like specific use cases, everything.

For example, we have users which need to upgrade [inaudible 0:20:16] on their data center. So, rebooting every machine is a flaw, right? You need to implement that, and that is something which – actually we call these durable execution functions workflow, just for legacy reasons. So, rebooting every machine is a workflow in the data center. But then deploying applications to Kubernetes clusters is a workflow. Rolling out new data centers, or rolling out and managing Kubernetes clusters is right your workflows. HashiCorp Cloud uses that for [inaudible 0:20:43] orchestrate the internal process in the cloud.

But then you go up the stack payment systems, money transfers, instant payments, customer onboarding workflows. Practically, if you take a bank, probably 90% what they're doing is a bunch of workflows and we can solve all of that as a single unified solution. We have banks, which are standardizing on Temporal as a practical backend engine right now. Then you go up, like you start doing, whatever is your business, right? If you are Airbnb, probably there'll be booking process, right? If you are DoorDash, it's probably delivering orders. And if you Uber, it's almost everything. So, it applies to almost anywhere when you need guarantees.

[0:21:17] SF: Yes. So, it's essentially giving you a simple way to put guarantees around the execution of some essentially, service, or function. Or it could even be like a third-party service, like a payment system or something.

[0:21:29] MF: Absolutely.

[0:21:29] SF: What about machine learning? Are you seeing, there's obviously a lot going on in the world of AI right now. That's something that you're seeing more use of Temporal in ML workloads?

[0:21:42] MF: Yes. We have a joke about that, because last year, it was a lot of crypto workloads. This day, we’re getting a lot of machine learning workloads. Yes, absolutely. Because we are kind of infrastructure, right? So, there is a gold rush there. But we are providers for that gold rush. But think about this way. Temporal is a control plane. We are not big data system. We’re not a machine learning system. We don't have any vertical solutions. But a lot of companies and startups build vertical solutions on us. We have perfect system to do practically, build end-to-end solution for your machine learning training, for example, in deployment and orchestration.

So, existing solutions that only piecemeal. Oh, we can do this for some sequence of steps, and then you need to use cron. We need something else. And then you to link these things together. With Temporal, for example, some of these functions like workflows, they can do long run, and they can be always running, and just react to events, more like durable actors. So, you can implement the full lifecycle. If you have a model, you can have lifecycle model which lives for years. You deploy it, you retrain it, you get it back, you get new data. The whole lifecycle can be implemented through Temporal. This is where a lot of companies can start using us because it's pretty amazing. They can have just one back-end solution for all your needs.

[0:22:51] SF: Yes. That makes sense, especially in given like level of – the amount of servers and distribution models that people have to be using to run some of these machine learning workflows or training cycles right now is Bassett. So, they need some level of guarantees of execution.

[0:23:05] MF: Exactly.

[0:23:05] SF: How does the actual managed service deployment work? So, is this essentially, if I want to use the managed service, am I running that as like, is it like a multi-tenant model? Single-tenant model? How am I sort of connecting my code base to have Temporal available to me inside like AWS, or Google Cloud, or wherever I'm sort of running my services?

[0:23:28] MF: So, we try to make it real cloud service. What I mean by that, it's a real serverless. We hide all the underlying complexity. So, you don't see clusters, you don't see any underlying machine. It's like the same, like [inaudible 0:23:41] or SQS. You don't know what's behind those, right? It's just a bunch of APIs, which we all can scale. We are trying to do the same thing. So, for example, when you go to our service, you provision is so called namespace. This namespace is just logical bucket, right? The same thing, bucket in S3. Then, that namespace is used to implement your application, and right now, we give you a primary DNS address for your namespace. What you can do is just specify that address and obviously, the security certificates because we need guarantee in TLS. And then you just can start using that. So, you can run this code from your laptop and just connect to a cloud and it will work. There is no difference between open source and the cloud besides the address.

[0:24:23] SF: I see. Then, in terms of, is there a sort of, I guess, like as a managed service for companies that are going to be security conscious. There’s are certain security guarantees, using the managed service?

[0:24:36] MF: Yes. I think that we actually have extremely awesome security story. It's amazing, because you come to any large company, and they practically say, “We want to run our core business workload on you.” We are, “Okay. You use our cloud.” And they are, “No, no, no, we don't do cloud.” It's like very standard action. Security API, all these things. And then we go and talk to them and explain our model and we are perfectly able to close almost every large company and pass the security reviews without problem.

Why? First one, we don't run user code. Because all these worker processes which contain both workflow and activity logic, like all this logic, they are running in all this user data center in the VPC, and using the deployment system. We don't even know how they run. Secondly is that before sending any data to the cloud, you can control encryption. We have pluggable components we called practically data. It was called Data Converter. I think now it’s called Data something. But the idea is that you can specify your own with your own keys, with your own encryption algorithm. So, you fully own encryption. Now, we need to look into the payloads. It’s different from database. For example, if you have database in the cloud, you have to put data there because the database, the way it functions requires database. We are more like pass-through. You encrypt data, you give it to us, we give you back encrypted payload, and then you can decrypt it any way you want.

Even if you put the eye there, which we don't recommend any way, but you control encryption, and you trust encryption, you can trust Temporal. Then also, you don't need to open any ports or anything in your firewall, because all connections to Temporal are outgoing to our cluster. So practically, all your SDK needs is practically connection to Temporal gRPC cluster. And this is it. So, we don't run your code, you encrypt everything, and all you need is connect to us, and you also can use private link to connect to us. This is usually enough to practically, even for very security conscious organizations. Obviously, we also have SOC 2, type 2, and all these other things, but the most important one is just our security model in the actual product.

[0:26:30] SF: Right. Yes. So, you're never ever seen the code or executing in essentially your data processor. But even in that context, essentially all your –

[0:26:36] MF: You can compare us to something like some providers, which provide Q Technologies. You encrypt message, you get encrypted message. And what happened there is we are kind of similar in the regard, right? We are certainly much more complex than Q. But at least from payload point of view, it is very similar.

[0:26:53] SF: Yes. Makes sense. Then, in terms of like competition in the space, obviously, Amazon has their project that you mentioned, the simple workflow service, and Microsoft has their version, but there's not really my understanding is like a direct competitor with Temporal other than someone sort of building something themselves. Especially in the early days, when you were starting, you're defining this category of product. How did you go about navigating that from a go-to-market standpoint? It’s always a challenge when you're bringing something new to market that people might not even – they don't even know what to look for, essentially, what you're offering.

[0:27:31] MF: It is still a problem. Because durable execution is a category, is a new category, most people still are not aware about that more and more, obviously. We just had our conference, we had quite a few users there. Certainly, everybody's super excited about that. But again, we're still not like a ubiquitous technology. We absolutely want to be there. So, that is our challenge. The good news is after you understand your own execution, and you get some experience with that, it's almost impossible to go back. People cannot just go force themselves just start using old approaches, all event-driven approaches after they learn about Temporal.

So, all our go to market is based around open source. What do I mean by that? We don't go around and try to come to company which we never heard about Temporal say, “Okay, use the portal.” It's more about, okay, if you're part of our developer community, you can come to us and say, I don't want to run, for example, this backend server on the database. I prefer you to do that. Then, we practically just become helping people to run this backend server. But all our adoption, and obviously, especially initial adoption came from open-source users and open-source customers. We still have a lot of pretty large companies. Just today we signed one, I didn't want to probably name that. But very well-known big company, which was using open source for over three years, maybe four years. Now, they just, maybe, decided to become our cloud customer.

So, we have some pretty good value proposition there. But at the same time, it's important. Our open source is fully featured. They were able to run it for four years successfully, because all the features in the kind of – there is a guarantee that if you run against open source, you should be able to run it against cloud and vice versa. So, we even want to provide in the future the live migration back and forth. That is very important guarantee what people want.

[0:29:10] SF: What is the main reason that people end up sort of switching from – if they start with the open source project to the managed service?

[0:29:17] MF: The reality is that if you run your business critical processes, imagine if you're doing something even relatively simple, but it's business critical, you need on call, you need the team, you need the database to support that. So, organizations which are successful in Temporal at a large scale, not one project, but then you have dozens of projects. When you have large-scale use cases, they require a team. They require a team which operates this infrastructure and you absolutely can do that. There are quite a few companies doing that. At the same time, more and more companies learn this time that just outsourcing it to us, and as we're experts, we know the technology, and we also have this backend database which can provide you much better scalability and performance. This time, they also, it's actually more expensive to run it yourself than run our cloud.

We actually got – because we have this highly optimized engine, if you compare apples to apples, you can end up paying the same, but you get so much more without using our cloud. That's why a lot of companies just finally migrate there. But usually, it grows with usage, right? If have just one or two use cases, open source is fine. But at the same time, you spend a lot of time learning how to run the cluster. So, most companies right now prefer just go to the cloud directly, because they always can switch to open source if there is a problem with the cloud. But so far, I don't think we've had such cases

[0:30:28] SF: I see. How is the engineering team structured? You have, essentially, you're supporting this open-source project. You have your investments from a developer experience standpoint, supporting all the various languages across these different SDKs. Then, you actually have everything that's going into essentially building and running a managed service. How is the structure of the team shape out?

[0:30:50] MF: Obviously, it changes all the time. But one thing we don't want to do, we don't want to separate out very – I don't want to make this mistake of having this kind of commercial offering versus non-commercial offering, because they will diverge, right? We've seen cases in the industry, which actually led to pretty bad results. So, we actually want to make sure that the core open-source service stays exactly the same, and the main features in the APIs are compatible. So, that’s why we actually have one team building features for open source, and they work closely with the cloud team, mostly around deploying things and making sure that they run successfully.

We certainly have a separate cloud team, which deals mostly with control plane. Because imagine, we already have 12 or 13 available regions. We run on AWS only right now. But we started to work on GCP. So, we will have other cloud providers soon. Obviously, Azure will come after that. So, running this huge infrastructure very reliably requires full automation. We absolutely have team which deals with this automation, control plane, permissions, routing, and then we have infra team, which deals with kind of underlying infrastructure, and there are a lot of concerns there.

Just one idea is that as we run a multistrand service, metrics is a problem. Because we have so many dimensions. Imagine if you have 100,000 customers in the future, or more. And then metric engines don't like high dimensionality of those. How are we going to provide metrics to every customer? But at the same time, if you have so many customers, there are so many namespaces. These types of problems we have to solve on an infrastructure level. Then, we obviously have SDK teams, which directly deals with this open-source SDKs, and they work with customer-facing features all the time.

[0:32:24] SF: In terms of like they – I feel like this is a behind-the-scenes, very complex engineering project with a lot of moving parts to do this at scale with all these guarantees. What are some of the hardest engineering problems that you've had to overcome?

[0:32:39] MF: I think the hardest one is reliability and resiliency, because we cannot lose data. We cannot lose even single task because the transaction will get stuck. So, just making sure that for example, it works with existing databases, and we can work around the limitations, that was actually very, very, extremely hard. Guarantee consistency, we are fully consistent service, guarantee uniqueness and still provide reasonable throughput, it was extremely hard. For example, one of the offerings we have is we run on top of Cassandra.

Cassandra is an awesome tool to do highly scalable kind of transactional – okay, actually not very transactional. It actually has problems with transaction, exactly, right? But if you need scale a lot, Cassandra is very good. But it's a very sharp tool, right? It's very easy to cut yourself. You don't want to get developers directly using that. We spend lots and lots of effort just making sure that we can implement all our business logic on top of that very efficiently. It was extremely hard. Then, the other part was certainly around the developer experience.

I think the hardest part wasn't even making this system scale. But making sure that people like our APIs, and they are natural and learning curve is not insane. So, I can guarantee you. If you ask me about any field in our API, besides there are thousands of them, I can explain why it's there. Because a lot of what was put into our APIs and execution models.

[0:33:56] SF: How does the API design come together? It sounds like a very important part of your success and adoption is having like great APIs, great SDK, great developer experience. How does that design come together? Where does that begin? And are you following API design, first philosophy? What’s the product design cycle?

[0:34:19] MF: Yes, here, API first is the only option. There is no other way. Because our API, for example, which represent to the users versus API, which we use internally, or even on the SDK and API of the service, they are absolutely different APIs. They're not related in any way. They are very, very far apart. The SDKs are very complex state machines. It took us over one year to implement this pretty good team. So yes, it was API first, but it was a lot of iterations. I wrote probably five or six client-side frameworks for that. First attempts were pretty common to what we have in the industry. It was more similar to step functions and I quickly learn that these things don't scale in complexity. You cannot write really complex applications using any kind of [inaudible 0:34:59] or any other kind of – even if you instantiate objects in code like abstract, syntax in code, it doesn't work.

Our solution was just write your code and your code is fault-tolerant and you don't need to do anything else. It’s just called out of the box and filtered. But it took some time. Then, we wanted to make it as natural as possible. For example, a simple workflow forces you to do full asynchronous Java. As we know, Java, especially 10 years ago, when we did it, didn't have any support for asynchronous code like good asynchronous code. So, we had to invest all our own framework to do asynchronous code. The problem is that Java developers don't understand asynchronous Java code, because it's not a thing. Then, you give it to them and they get confused.

Now, we are actually were able to implement Java, which actually provides normal synchronous Java code blocking API calls and everything. And it's natural to people and that's why we've got such a good adoption and people like that. But when we can use async, we use await async in dot net. We use async in obviously TypeScript. So, if you can leverage normal language constructs, we will do that. But if it's not natural for the language, we're not going to force it on people anymore. I think this is what is being discovered. But then it’s lot and lot of details like every API call, we think through very, very carefully.

[0:36:08] SF: In terms of one of the other things that you mentioned in terms of challenges is essentially like the uptime and the consistency around guarantees. So, what are the uptime and like latency guarantees that you're giving to companies that adopt the managed service?

[0:36:23] MF: Okay, this is actually one area where there's a lot of confusion. We are very conservative when we give these estimates, because we came from the big companies like Amazon, and we measure per-request uptime. I know a lot of services just, “Oh, services is just reachable”, its uptime, right? But given how we position that, I think for requests, we give, I think, three nines on a single region. But for like, with total service uptime, we give four nines. But this new multi-region capabilities we are given, we will be able to give four nines or even higher maybe in the future. We look for highly critical applications, even in presence of failures. But we are working on that right now.

Again, I don't like these nines in general, because people have very different kind of view on them, right? We want to put them out, and we want to explain what they are. But usually we explain people how we operate, how our system works, and we give them more details, and they're comfortable, even for example, if I say it's three nines, but people understand what we mean by three nines. People are much more comfortable than like just throwing a number out. I see too many people just throwing these numbers out, without actually understanding deeply what it means.

[0:37:24] SF: Then, as we start to wrap up, is there anything else you'd like to share? What's kind of next for Temporal?

[0:37:31] MF: I think the most important thing for us is getting adoption. Please, if you've never heard about us before, just please go to our website, temporal.io. Actually, we just last week, what a dot site. So, it is now temporal.com, which will probably redirect to temporal.io. But go there and try to understand the role of execution, try to understand the model. After you understand the model, I can guarantee you, you certainly will consider for your next project. I think this is most important thing for us is just getting started. But then, our goal is becoming ubiquitous, because I believe that every company should at least understand what's there and use it. Then, we have a lot of plans in terms of specific features, specific things about availability, more clouds, and obviously growing the business. But so far, at this point, I think the most important thing for us is just the more developers learning about us and getting converted to our way of thinking about the backend systems.

[0:38:20] SF: Awesome. Well, Max, thanks so much for being here. I think you're solving like a really hard problem for people and it sounds like in a way that is easy, essentially for the adopter of the technology. So, I think it's an exciting thing to talk about and to learn about it. Hopefully, people are listening. We’ll go and check out the new dot com site and try it out.

[0:38:36] MF: Awesome. Thank you.

[0:38:39] SF: Cheers.

[END]