EPISODE 1874 [INTRODUCTION] [0:00:00] ANNOUNCER: Modern software systems are composed of many independent microservices spanning frontends, backends, APIs, and AI models. And coordinating and scaling them reliably is a constant challenge. A workflow orchestration platform addresses this by providing a structured framework to define, execute, and monitor complex workflows with resilience and clarity. Orkes is an enterprise-scale agentic orchestration platform that builds on the open-source Conductor project, which was pioneered at Netflix. The platform coordinates AI agents, humans, and APIs with a focus on scalability, compliance, and trust. It further expands on the conductor core by adding features like security, governance, and long-running workflows. Viren Baraiya is the founder and CTO at Orkes, and he's the creator of Netflix Conductor. Viren joins the show with Gregor Vand to talk about his building Conductor at Netflix, the challenge of orchestrating microservices, rule-based versus programmatic workflow orchestration, agentic orchestration, MCP integration, and much more. Gregor Vand is a security-focused technologist and is the founder and CTO of Mailpass. Previously, Gregor was a CTO across cybersecurity, cyber insurance, and general software engineering companies. He has been based in Asia-Pacific for almost a decade and can be found via his profile at vand.hk. [INTERVIEW] [0:01:45] GV: Hello, and welcome to Software Engineering Daily. My guest today is Viren Baraiya from Orkes. So we're going to be talking all about what Orkes does, and especially Orkes Conductor. And some of you may already know the word Conductor from some other companies, and we're going to be talking about that. But yeah, welcome, Viren. [0:02:03] VB: Thanks, Gregor, for having me here. And I'm very excited to be here. [0:02:06] GV: Yeah. As we do on this podcast tradition, we just like to kind of get an understanding of your background. You've definitely worked at quite a few interesting companies that I think our audience will be fairly familiar with. Yeah, just walk us through what's your kind of career path, I guess, to Orkes. [0:02:23] VB: Yeah, I started Orkes about close to getting four years now. And prior to Orkes, I spent in a very different set of industries, right? Before Orkes, I spent almost few years at Google, mostly working on developer products, Firebase, and Google Play. This is one place where I got to kind of work with developer as one of the audiences. How do they interact with systems, and how do you build for them? Which was also kind of one of the reasons why - which kind of motivated me to build Orkes. More interestingly, before Google, I spent my time at Netflix, which is where I was part of an infrastructure team that was responsible for building the platform for Netflix's, back then, very ambitious Studio project, right? Basically, the goal was to build the largest production studio in the world. And to enable that, there is, of course, the whole production side. But at the same time, there was the engineering side in terms of building out the products and tools. And my team was responsible for building out the platform. This is where Conductor originated, amongst the many other things that we built. And interestingly, before I moved to Netflix, I was in a very different kind of industry, right? So, I spent almost 6 years at Goldman Sachs working in the investment banking technology side of it. That was a totally different experience altogether. I was in East Coast, moved to West Coast, all the way from investment banking to entertainment, and then internet consumer, and then, now, SaaS. [0:03:48] GV: Yeah. Yeah. That's very interesting. I don't think we have many guests who have actually been in investment banking tech and then managed to move into, I guess, sort of Silicon Valley tech. That's super interesting. And, obviously, Netflix, especially the time that you were there, which years you were there, that was such a pivotal time for that company. And speaking of Netflix, some of the audience may be familiar with Netflix Conductor. And I think we should kind of talk about that to begin with. You were working on that part of the technology. But let's just talk about what was Netflix Conductor, and what was the challenge it was solving, and sort of what was that all about? [0:04:30] VB: Yeah. If you look at Netflix's history, Netflix has historically kind of been the pioneer when it comes to some of the frontier technology, right? They were one of the very first big tech company to be completely on cloud. They invested very heavily into microservices. And when I joined, one of the things that was very surprising was that it really had embraced microservices, right? There were a lot of microservices. And that enabled teams to kind of move very fast. At the same time, one of the biggest challenge was how do you essentially coordinate the work across different microservices, right? Because by definition, a microservice is not going to implement the entire business flow. It just implements part of it, which means that you have to stitch together multiple microservices. And the very kind of traditional way of doing that is through some sort of eventing system, right? Back in the day, that used to be message buses, service bus, and so forth, right? And given Netflix and its scale, we had our own internal implementation of the service bus, pushing through billions of messages a day. And that worked very well. However, the major issue that you run into with something like that is not when things are working fine, but when things are not working fine. And this is where you start to see the kind of brittleness of the system. What happens when something is not working? Now, to understand exactly what's going on, you have to kind of either dig through to the code, talk to 20 different people, manage hundreds of different cues and things like that, right? That is where the motivation for Conductor started, to say that we want to continue investing into the same principles of building through microservices. We want and we like this whole loosely kind of coupled aspect of building distributed systems. What we don't like is coordinating directly through code. And let's have an orchestrator do that thing for us. At the same time, we wanted the orchestrator to be working at Netflix scale, built using chaos engineering principles. And that's how we started working on Conductor. And that's what Conductor started to do, right? And the whole idea was let's tame all the microservices, bring the order to the chaos without losing and sacrificing what it delivers at the end of the day, right? Yeah. [0:06:45] GV: Yeah. And for those that maybe aren't familiar, chaos engineering is the sort of principle of just switching things off and not telling anyone, and seeing what happens. Watch sort of how our system fails and how it's responded to. I think it'll be interesting maybe to get into that maybe a bit more as we go through what Orkes is doing. And with Netflix, am I right in saying it wasn't originally open source, but then it was open sourced at some stage? Is that correct? [0:07:11] VB: Yeah. When we started, of course, we started building it at one point in time, what we realized was that the product was good. There was a lot of adoption inside within the company. And what we saw was that, "Hey, if we open source this, we get the benefit of community contributing to it." We can also leverage it as a way to kind of recruit good people, build kind of collateral for the team itself. And more importantly, Netflix has been always very active about pushing things back to open source, especially things which are not proprietary, not key to the business, right? And Conductor is a very generic piece of software. It's nothing to do with encoding or streaming or anything. So we decided to kind of make it open source. And since then, it has been open source for quite some time now. Yeah. [0:07:57] GV: Yeah. We'll come back to kind of then the transition from it being kind of associated with Netflix, and then sort of what happened there. And then Orkes coming out as a company on its own, so to speak. But I think we should also just take a step back. You already mentioned orchestration there. Let's just kind of give a baseline of what is workflow orchestration. [0:08:21] VB: I think that's a great question, and the reason is that orchestration is a lot of things. In the end, the moment you start coordinating work across different systems, that's orchestration. But then you could have orchestration of containers if you are talking about infrastructure. It could be messaging, it could be data pipelines, services in our case, for example. And in general, if you look at the workflow orchestration and workflow engines, it's not a new concept, right? It's been around for very, very long time, decades probably, if not longer. And the primary reason for that is that when you look at the business processes, everything is a workflow. And one thing that I like to say is that whether you like it or not, whether you know it or not, but everybody is building a state machine one way or another. And the hardest part of building a system is not how to implement a particular business logic, but is how to maintain the state in a way that it remains consistent and coherent with the business goals. If you can offload that entire responsibility to a workflow engine, then how do you develop systems, how do you add resiliency and scale and everything else becomes much easier to kind of deal with. In the end, if you think about it, if you have a serverless lambda or a completely stateless service, if you want to scale up, you can just horizontally scale because it maintains no state. But the moment you add a state, now it becomes challenging. Because if there are failures, you have to recover from the state failures. When you are scaling, you have to ensure that state can be managed in a distributed environment. Scaling stateful systems are much harder. And this is where one of the things that we have also seen is developers end up spending a lot of time. Workflow engine kind of solves that problem, the right ones especially. Because you also don't want workflow engines to be a single point of failure. Because if the workflow engine is down, then everything is down. That itself has to be resilient. [0:10:14] GV: Yeah, absolutely. Yeah, we have this concept of like a workflow engine. And before we then get to pure Conductor, there are kind of two sort of ways you could still go about a workflow engine. Is that right? And we've got like rule-based and Conductor style. So maybe just walk us through what rule-based - I assume the rule-based came before Conductor style. And maybe people are more familiar with that. But let's just talk about that and maybe its limitations, and then we can sort of then go on to Conductor. [0:10:39] VB: Yeah, I think the rules for the rule-based workflow engines are in the fact that if you are able to define a set of rules, how the process should be orchestrated. And then somebody who owns the business, this could be a product manager or a business analyst, they can come and define the rules, which works well if that is how, in practice, things happen. But it also simplifies things because now you can write rules based on your business-specific needs. Where it starts to kind of break down is that, one, you are constrained by the rules that you can write and underlying implementation, right? If you were to make systems a bit more complicated, then that starts to become challenging. The second is that for simplest cases, that is completely okay. But as it gets more complex, it becomes much harder to understand what is going on here. And then you are constantly translating those rules into the code to see you know how this is going to get executed. That kind of choreography is constantly happening in your mind when you are debugging things, when you are trying to understand what's going on there. That becomes a bit of a challenge. And the third thing that we realized was that oftentimes what happens is that, in theory, it looks great that, as a business owner, you can come and define rules. In practice, they will write up a doc, and a developer is supposed to then go and implement the rules and write the rules. As a developer, you deal with the code, not the rules. Conductor takes a different approach to say that, "No, let's keep that idea of being able to define orchestration as a DAG, a directed acyclic graph. But instead of making it a very business-specific or rule-based system, make it programmatic." Just like you write code, you should be able to write a workflow, and it should follow the same principles and the same kind of semantics like a code in terms of being able to run things in parallel, decision cases, loops. As a developer, the way you would think about it, you can just write the workflow as it is. And the fundamental principle and the thumb rule that we follow was that if you can write code, you should be able to define a workflow. It should be one-to-one. There should be no missing cases there. And it should follow the exact same principles as variables, states, and things like that. Essentially, making your code completely durable. That has worked very well. That has worked very well because, one, as a developer, when you are implementing it, you are not constantly fighting against a different type of system. You are doing exactly how you think about it. When you're debugging, it becomes pretty clear as to what you are debugging. And then to give the business owners the visibility, you can always transpose that into a business-specific set of dashboards. So, it keeps everybody happy. [0:13:11] GV: Yeah. Because, I mean, when you mentioned or when you talked about rule-based, it sounded like the focus was more for like non-technical people ultimately. And Conductor style is where actually the developer has a lot more control over that process. [0:13:26] VB: That is correct. Yeah. [0:13:27] GV: Yeah. Awesome. Let's kind of, I guess, look at how Orkes kind of came to be. I believe there was something around Netflix decided to stop supporting or stop contributing, I guess, to Conductor as an open source project. But I think there was a group of you that weren't actually working at Netflix by that stage but then sort of came back together. What was that story? [0:13:49] VB: Yeah. When we started Orkes, we weren't at Netflix, right? I had left Netflix for about 4 years back then. I was at Google. And when we started Orkes, one of the things that we did was we started working with Netflix. We started contributing back to Conductor. At some point in time, we started to become one of the larger contributors compared to Netflix. And it just made sense that like let's take it out of Netflix umbrella and put it into its own project repository, right? Giving community more control over the project and increasing the velocity. Because in the end, the motivation for maintenance of the open source project between Netflix and open source community and us is going to be very different. And we have a much bigger team in terms of being able to support and take this forward. That's where we kind of work with Netflix to kind of have them kind of archive the repository. We took over the code base, and we did it in such a way that we are not losing all the previous contributions, contributor history, and everything. It remains that way. And that has been kind of the case. And it remains completely compatible because it's the same source code in the end. And we see that as an evolution of the open source project, right? Many of them kind of go through the similar case where they start at a particular company. Kafka is a good example, right? Going from LinkedIn to Apache, and then mostly shepherded by Confluent, a little bit Databricks and Spark. [0:15:05] GV: Yeah, exactly. We've seen this sort of, as you say, across a bunch of open source projects that sort of, yeah, usually come out of a company. And then, yeah, there's just various reasons why it makes sense to - well, either to just kind of fully open source it and say, "Hey, we're just not going to support this anymore. Whoever wants to come in and work on it." That makes a lot of sense. Let's then talk about Orkes. And if you could explain what Orkes is. And maybe let's start with what is it - ultimately, what has it built on top of Conductor and the Conductor project? [0:15:36] VB: Yeah. When we started Orkes, the primary motivation was that we have the open source project. There is a very clear fit between the project and the market because we had seen by then thousands of companies, some of the very well-known companies as well using them in their production flows. Personally, I was getting a lot of pings on LinkedIn. Sometimes people asking me to review their PR or asking for some help. And in the end, we felt that the market was ready. Typically, what tends to happen is that companies like Netflix, they are on the bleeding edge of the innovation. The problems that they see and that they solve, the industry starts to see them 3, 4 years later. In a way, the timing was very correct. That, like, "No, this is the time where companies are going to start looking at it and realizing that as you move your systems to cloud, you need to break down your monoliths." At the same time, cloud, yes, it gives you the elasticity of infrastructure. But at the same time, unlike data center, you have to start thinking about resiliency aspect also. And this is where they will start thinking about workflow engines. And we were kind of right in that sense also. That's how we started kind of the Orkes as a company. And in terms of the business model and like how we go about it, by this time around, other open source companies had already paved the way in terms of how should you think about building an open source project to monetize that. How do you kind of differentiate and things like that? That kind of become the foundation for how we think about Orkes as a company and the Conductor as an open source project that we kind of monetize on. And then bringing back to your other question is how does it differentiate, right? What we have been doing is that you have the open source project, we use open source. In the end, Orkes is open core, right? Where I think enterprises wanted us to be able to support them was in terms of adding enterprise features. Because if you want to run a project inside your company, let's say a bank or a healthcare company, you need to have things around security, compliance, governance, those things. Open source, I would like to think about it more like a Linux kernel, right? You can take a kernel and build your own distribution. But as a company, you probably want to get a distribution that is vetted by a vendor and has all the security features and everything. That is exactly kind of the way we did it. The other part of Conductor is that Conductor is a very plug-and-play system. It supports multiple different backends. And what we do is we take the right ones, we optimize for the performance, and the cost, and everything, and the manageability also on top of it. And that's essentially what we deliver, right? What our customers are essentially paying us for is, in the end, how do we take the project and run it reliably in their environment? Because that's the key challenge that we solve for them. If you were to give them three 9s or four 9s of availability, that's one thing that we can deliver them without them having to worry about it. [0:18:26] GV: Yeah. Yeah, again, kind of walking quite a - well, I think now we have seen it's quite a familiar path where an open source project can just benefit hugely. Actually, if there is a sort of commercial arm around it where, as you say, it's able to provide the security, the stability, the compliance, and especially in the enterprise setting, which is what Orkes really caters to, having all that just kind of taken care of, as you say, there's a huge need often for that on the basis that the original project has got a bunch of people using it. Yeah, I track back to sort of - it was interesting. Sort of in Hacker News at the time, there was a lot of comments. People saying, "Oh, this is awesome. It's so great to see that some people are taking this on as like proper company. We can just kind of buy it from them as opposed to needing to try and run it ourselves now." Yeah. [0:19:13] VB: Yeah. As a matter of fact, our first few customers, the open source adopters, were like, "We're glad that you guys started the company. Can you help us?" Yeah. [0:19:22] GV: And especially, because quite a few of you were very much the core contributors in the first place. That's awesome to see. Let's move on to Orkes. Maybe walk us through. How has Orkes evolved? Because I think we're talking sort of back in 2022ish is kind of when that started what we've just been talking about. Maybe just walk us through how has Orkes evolved? We're going to get into agentic orchestration and AI because that's where it can help in a big way in things that are very pertinent now. But maybe just walk us through kind of how's the product kind of evolved. Yeah. [0:19:55] VB: Yeah. I think when we started, our initial focus was that, "Hey, here's open source. How do we make it enterprise-ready? Run it in cloud?" We kind of focused on that one, right? And of course, our customers tend to be mostly in very regulated industry. Quite a few of them, right? Which means that you have to support different modalities in terms of whether it's running fully hosted by Orkes. Or is it bring your own cloud? Or in some cases, running in data centers. That is one area where we spend time, and making sure that we can take the software, we can run it at a highly reliable scale for the customers. And then we started to kind of start thinking about, as a company, if you are leveraging something Conductor, they don't want different tools for different problems. When you think about work orchestration, you want to be able to do any number of things with it, right? We started to kind of add some of the features that we got as clear feedback from our customers. Sometimes, they had kind of built out their own internal versions of it, but they wanted us to kind of support them by adding it as a proper feature inside Conductor. Some of the things that we did was workflow engines, traditionally are asynchronous orchestration. Meaning the workflows can run anywhere from few minutes, to hours,to days. We added support so that your workflows can run for much longer period of time as well, like months and months or even years for some cases. And we do have some use cases like that. And then on the other extreme was if you are orchestrating services. You have HTTP services, you have gRPC services, and you want to orchestrate them, those are going to not run for seconds. They are going to probably finish the entire flow in tens of millions. How can we run workflows synchronously, right? Through microservices orchestration, but very much synchronous and very low latency. That's another area that we focused on. And I think that's one of our key capabilities that is very unique to Conductor that you typically don't find in other workflow engines. And then as kind of the industry was starting to think about AI and LLMs, that's where we started to kind of invest into how can we let workflow engines orchestrate language models. I mean, today, I think that has become like a very common place that you need a workflow engine to orchestrate your agents. But back in the day, people were still writing Python code to just call LLMs. And that's where we started to kind of build integration suites and everything, right? Core to our nature, right? In the end we are not a solution, we are a platform, which means we want people to be able to kind of use it in whatever ways and format they want to use it. One area where we focus on is that let's start to integrate and provide support for pretty much every foundational model that is out there. And today, I think we support pretty much every possible model out there. You can switch back and forth. You can run them together in the same workflow and things like that. That's one of the areas where Orkes has evolved into a true LLM orchestration platform if you have multiple agentic models. And that then allows you to kind of build - if you think about traditional workflows, those are deterministic flows, right? You could have switch cases which could take different paths. But in the end, it is still very deterministic. Given the right input, it is always going to produce exact same output path. Now if you add language models and LLMs inside that, you start to kind see the non-determinism aspect of the workflow. Because even for the same input, it could take a different path. And then we started to kind of support those things. And I think that's one area where general industry also is moving towards. And we are continuing to kind of invest into that area as well. [0:23:34] GV: Yeah. I mean, maybe, again, just to sort of baseline this, I'm sure majority of the audience are kind of familiar with what an AI agent is or sort of what they maybe think it is. But at the same time, I think it's always helpful to kind of get your definition as well because I think you could probably pull up five different sort of definitions of what an AI agent is and especially in the orchestration sense. I mean, for example, are we talking - when we say agentic orchestration, are we saying, "Well, these are multiple agents that get orchestrated." Or are we saying that, ultimately, an orchestration could be termed as an AI agent or you - help me out here. [0:24:12] VB: I think, yeah, that's a good question. I think agent is a very confusing term because it's a very general-purpose thing, right? Pretty much anything can be thought of as an agent. But in the end, I think the textbook definition of agent is that agent is something which is an agency. It has its own autonomy in terms of how can it plan and execute its goals. And now, if you translate that into agentic systems, it means, I think, three different things in my opinion. An agent essentially could be purely an orchestration where you have language models deciding the path. There has to be some sense of autonomy. Otherwise, you just have a very deterministic system. Agents, by definition, has some level of autonomy. And therefore, nondeterminism kind of built into it. Now, you can think about a workflow with a single language model or an LLM that is taking either running in a loop or in a single execution path. In that case, you have a single agent that is operating inside that workflow. Now, we have heard a lot about like humans-in-the-loop and guardrails, right? As humans, you can think about humans also as an agent. The moment you put a human inside a workflow with an LLM, you are starting to think about multi-agent systems where now you have two agents and they have very clear responsibilities. Maybe the LLM has a responsibility to come up with a plan, and human has the responsibility to kind of vet that plan or approve or reject the plan and then continue executing on that one. Similarly, you could add more LLMs and build true multi-agent systems where LLMs are participating. And each one has a pretty well-defined role. I think a very good example I would say is what we see with AI coding tools like Cursor and Windsurf is you could think about an agent, one of the agents which takes your instructions, generates the code. A second agent could be actually responsible for compiling. And third one could be responsible for testing against and checking against your input goals, right? And they are all coordinating, running in a loop, until it achieves the goal. That's a true multi-agent system in the end. And as a human, as a developer, you are also an agent who is kind of saying, "Yeah, this looks good. Approved it. Commit the code." That's now a true multi-agent system. But in the end, agents are - if you think about heuristic workflows, the way I would like to think about is that you don't have a very set defined part, but you have a very high-level definition of this is how you should do. Will you do it or not? It depends upon how LLMs are thinking about doing it. [0:26:36] GV: Yeah, I think that's really helpful. And the code orchestration example is a good one. I think also through the sort of Orkes website, and sort of there's like examples of flows. And I think this is an example that sometimes you pull out around inventory management or like claims management, for example. Maybe that would be quite interesting to sort of understand now how does it differ compared to, say, like a traditional rule-based system. What are the things that can be done differently and better, I guess, when we're now talking about AI, agentic, orchestrated, and then applied to these kind of quite clear business use cases? [0:27:14] VB: I think, see, the biggest thing that I think that can be done better is if you have a non-agentic system. And because, by definition, it is a very deterministic system, every time you have a different use case, you have to build a new workflow, a new system around it, which essentially creates an explosion of different use cases, which is what you see is that, Hey, if I were to approve a claim, for example, right? And depending upon different requirements, you have different claim systems or different parts of the claims and things like that. But if you were to add something new, again, you go back to development mode, rebuild or build a new feature. And it takes time and things like that. With agentic system, I think the biggest change is that instead of writing the entire system end-to-end, you focus on writing tools. A tool can be something that sends an email. A tool can be something that looks at the claim information and pulls up the customer information, or the claimant information, or looks up the policy. Now, if you think about it, you can put those tools in any particular combination. Now we're talking about combinatorial explosion, right? If you were to build deterministic systems, you end up building large number of different use cases and paths, which is why most of the software projects takes months and months to develop because you have to cater to all different possibilities and everything. But if you break it down to say that I have got N tools, it can be used in any combination. And an LLM can decide which one to use. Now your thinking changes, right? You're no longer thinking about putting them together by yourself. You are building stateless tools. Very similar to microservices, if you think about it. But instead of as a developer, you're kind of putting them together. And LLM is taking your input and deciding on the fly how should I do this, which means that your development process becomes simplified. You can introduce a new tool without having to change everything and start incorporating them. The way it differs from traditional rule-based systems is that it now allows you to go from 0 to one and one to N very quickly by just incorporating more and more tools. But you are no longer catering to kind of the combinatorial explosion of different use cases, right? It can just do it out of the box. And we are starting to see that, right? That's how I would say a lot of new systems are starting to build out is through agent. Agent can do pretty much anything as long as they have the right tools and context given to them. [0:29:33] GV: Yeah. I mean, I think to sort of use a slightly overused term is sort of this idea of basically setting the first principles of what can be done and then letting the orchestration aspect kind of then deal with how it wants to then go about that. And you touched on it. Obviously, the deterministic or non-deterministic, especially in this case, aspect. And I think that's something probably a lot of the audience is curious about is how does that then work. Because basically kind of the crux of all this is how do we allow the system to take its own decisions? And what sort of constitutes this, say, a first principle that can be laid down, and then the rest is allowed? Yeah, how does Orkes deal with this? And how does somebody using Orkes, I guess - how can they kind of feel confident that the non-deterministic aspect is kind of taken care of, I guess? [0:30:27] VB: I think the analogy that I like to think about it is when you have a car that can do self-driving. There are two aspects of it, right? One is the notion of control, that I can take on the steering wheel at any point in time and do whatever, right? Guardrails. Humans who can be in the loop wherever you need to be. Second part is which is I think more critical. If you think about it, if you just treat LLM as a black box and say, "Here are all the tools. Just go and do it and come back with the result." How do you know what was the thought process there and what did it do? Second part is basically showing me what it sees. Saying this is what I'm thinking. This is my plan. And this is exactly how the graph of this execution is going to look like. As a human, now I can look at it and say this makes sense that you're going to execute step one, two. And based on the output of step two, I can take three or three prime and then go and execute step number four. Now this graph is something that I can see and say this is what you are thinking about doing it. This makes sense for me that you should do it this way. And go and do it. Humans in the loop becomes critical along with that entire aspect of being able to visualize the execution graph. I think that's tremendous. Because now, you start to build confidence that this works. The second part is when to apply guardrails. A good example that I like to give here is if I'm building a DevOps system and I'm using an agent to manage my Kubernetes clusters. When it decides to execute an operation to get the list of pods and deployment, yeah, nothing bad is going to happen. Just do it. Even if you execute that command on a wrong cluster or a production cluster, I mean, you are just going to execute a read operation. Nothing bad is going to happen, and that's completely fine. But if you are going to destroy a cluster, you better check with me first. Maybe you send me a Slack message or an email and let me approve it because you might hallucinate. You might end up taking wrong decision or a typo and destroy a production cluster. I don't want you to do it. When to apply a guardrail is another aspect. This is where we are spending a lot of time to say, as a builder of the agent, you should have full control. Instead of saying here is the LLM. You give them the tools and let it execute everything. Our approach is fundamentally different in the sense that here is the LLM. We give LLM saying, "Here are the tools that you can use. Tell me what you are going to use." And then based on the outcome, I can decide and build that inside my workflow. Now the workflow becomes a combination of some set of algorithms and some set of nondeterminism, right? You add determinism when you need. And otherwise, let nondeterminism take care of everything else. [0:33:05] GV: Yeah. And I believe exactly that Orkes is really focused on sort of this trust aspect effect, because I think that is what everyone is - I say everyone, but especially enterprise, is sort of concerned about the potential productivity gains around allowing agents to run a bunch of stuff is, in theory, fantastic. It is just kind of that. Well, that developer example you just gave of a cluster being destroyed or being some typo somewhere. That's, I think, what a lot of - especially I would say the non, maybe, technical folk in companies are very concerned about. They're sort of like, "This all sounds great. But there's no way that this could actually do it reliably." Can you maybe just walk us through maybe a few mechanisms? Or sort of how does Orkes - or maybe - I don't expect it's all kind of solved today. But how is Orkes actually approaching this? And what kind of tools and mechanisms are there to help the developer? And then what could that then help the developer say to the non-business person to sort of help them feel more at ease about all of this? [0:34:09] VB: The way we are approaching is as I was kind of trying to explain in two ways. One is being able to add guardrails. And be able to add guardrails when you think this operation is going to be something that you want someone to take a look at it. And guardrail doesn't have to be necessarily a human, right? There are a lot of systems for automated guardrails. You can also use agent as a guardrail. You can delegate it to another agent, which can get tricky because what if that also hallucinates and two of them agrees and does something bad? But depending upon the use case, depending upon the contextual need, we allow developers to put the right guardrails. And adding guardrails is a deterministic step. We are not asking LLM to decide when to use guardrail because that then brings back a cyclic dependency and trust aspect. Instead of that, we let developers add and say that this is where you will add a guardrail. And that's a very deterministic step, that if you have a guardrail set up for a specific tool, it will get executed. That is one part of it. That takes away the whole aspect of LLM doing something bad without your approval. The second part is understanding what really happened. It's completely possible that the LLM did exactly what it was supposed to do. No hallucinations and everything. But then there are questions about why. And that's important, right? In terms of, let's say, if it's a claim processing system and if I approved or denied a claim and if there's a question as to why, the answer cannot be that because my AI said so. It has to be that, "Hey, this is the thought process. This is how we evaluated the claim. And therefore, it is." It's less about LLM's making decisions. It's more about LLM defining the flow. But you need to have the complete visibility. The other aspect that we give is that we give the full-blown graph of exactly what happened step-by-step. Every step. What was the input given? What was the output that came out of it? And exactly what was the decision made based on that? Now, as an operations person or a human, you can look at it and explain exactly what happened and why it happened. That takes away the other aspect of it, as to I can't explain what happened. Now we can completely explain what happened. You can control. And these two things combined, we think, kind of gives you enough guardrails. And of course, one other aspect of Conductor and workflow engines in general is that it keeps trail of everything. Every conversation, every execution that happened is captured, stored, and can be kept in the storage for whatever is your retention policy, right? If you were to go back and see what was happening, how things were happening, those things can be later queried on. The other part is access control. One thing that is built into Orkes is, just because you have a tool, it does not mean anybody can use it. Even to use a tool, you need to have the right access control. Which means a good example I like to give here is that if you are building an agent that can do all stuff HR for you, you should not be able to ask agent to give yourself a promotion unless you are an HR admin. And it goes through proper approval process, right? That is also built into Orkes. With the right level of access control, visibility, and the human guardrails, we think that that's going to be enough for someone to say, "Hey, we can trust the system." [0:37:16] GV: Yeah. And you've mentioned a good explanation, the graph sort of aspect of it. Does it present the kind of the same - and let's just mix this in with the access control for a second. Does it present the same sort of across all types of person? Or is it you're able to present different kind of views that make the most sense of the explanation? Is this an ops person who's going to understand what happened in this way? Which is quite different at times to a developer who wants to kind of see it in a slightly different way. But maybe this has been solved in one pane of glass, again, to use a slightly overused phrase. But, yeah, tell us about that. [0:37:53] VB: Yeah, I think the short answer is yes. Slightly longer answer is that, as a developer, you are able to see every step but depending upon how you construct the whole thing as an ops person. Either you can look at the high-level block saying, "Step one, step two, step three." Step two could be a lot more complex, which, as an ops person, you may not need to understand and know. And of course, because one thing about Conductor is that it's pretty much API-driven system. you can then go and build very business-specific views of it, which might make sense for your business users and follows your kind of process flows and definitions and everything around it. It kind of decouples those two aspects. As a developer, you don't have to think about building everything with a rule for the business. And as a business user, you don't have to think about, "I don't understand this. Somebody come and explain to me." [0:38:37] GV: Yeah. Bringing this kind of back and forward, I guess, to where the developer sits. One bit that we haven't touched on, before we kind of get on to sort of just getting up and running, so to speak. But one thing we haven't touched on is actually how MCP and MCP servers come into this. And I believe you have open-sourced your MCP server for Conductor. I think this was the thing. When I was sort of getting my head around what Orkes and Conductor does in the first place, I instantly started to think about MCP. Because I thought, "Well, isn't this what MCP is sort of for?" Maybe talk to us about where the intersection is and how they kind of work together. [0:39:13] VB: Yeah. MCP focuses primarily on how do you expose your tooling and API to something that LLMs can understand, right? And that simplifies a great deal in terms of LLMs being able to call the tools. And what we do is that we allow developers to bring their own MCP servers. We are also working to kind of bring in most of the common ones as a part of the out-of-the-box capabilities inside our enterprise edition. And then it can basically use them as tools. If you want to send an email and the LLM decides that, "No, I need to notify the user." And if you have an integration through MCP via Outlook or Twilio, it can send you an email. That's the primary role for MCP. The Conductor MCP server does a very similar thing, but it acts as a tool to generate the execution graphs. To say that, "Hey, I have this goal. Can you give me a Conductor workflow for this that I can then go and execute?" That's a stepping stone for us to build a fully autonomous systems. Because one area, where if you think about today, LLMs, essentially what they do is, in the programming terminology, they do a look ahead of one. They look at the current context and say, "What's the next set of tools that I'm going to execute?" Where we are going with is I can look at the goal and say, "I can define the entire execution graph with a look ahead of N." N being decently finite number. That improves both performance, the cost aspect also, because you are making less LLMs calls. And more importantly, reliability, because that output can be pretty much deterministic or deterministic enough for multiple iterations. That's the primary goal of the MCP server. [0:40:48] GV: Got it. I mean, it is effectively completely optional. It's not sort of - yeah. It's not sort of required in terms of - I mean, MCP, obviously, it's a protocol that was ultimately developed by Anthropic. Are you looking to do any support for any of the other competing, shall we say, protocols? Or sort of does MCP make sense as the one to kind of sit with? [0:41:11] VB: I think MCP is a great one for being able to call the tools. We added some features that are kind of gaps. Or some of the things that MCP as a protocol definition lacks, things like access control, right? It does now support a notion of authentication. But Authz is the other parts that we have added. Then the other one is A2A. When you start thinking about multi-agent protocols, I think A2A is coming out to be something that people are starting to think about as agent coordination. That's another area where we are going to add support pretty soon. [0:41:44] GV: Nice. Awesome. Let's just sort of talk about, I guess, sort of up and running. First of all, where does a developer kind of go? And then maybe could you just talk us through what is a sort of high impact, say, first 10 minutes of getting started with Orkes? I believe there's like some kind of template-type workflows you can kind of run out of the box. What's like a high-impact 10-minute place for someone who's never used this? And let's just say, for argument's sake, they've never even used orchestration before. This is the first time they're actually approaching this. What does that look like? [0:42:16] VB: I would say there are three main categories, right? One is if you are looking to orchestrate APIs, you can create a workflow. Conductor has notions of system tasks, things which are like pre-built. You don't have to write code for it. And even if you write the code, you're going to do the same thing. You can orchestrate multiple HTTP endpoints and see for yourself, right? There are a lot of example API endpoints available on the internet. You can just put them together and then see you know how it orchestrates them and gives you the visibility. That's the API orchestration use case that you can very quickly test it out. Second part is if you are trying to build an agent, you can try and build out a simple chat complete agent. You can put a loop and chat complete inside it, and it will keep on running until your loop terminates. And you can actually put two LLM agents. You can take two chat complete, two agents, and give them some instructions, and you will see that they start talking to each other in a conversational way, right? That's pretty fun to see and quite interesting sometimes. And the third part is if you are building a workflow, you can take an existing business process that you have, let's say order management or claim processing, right? And we have templates for it. You can try it out, mock up the actual implementation, and see for yourself how easy is it to change, modify, get the visibility into it. But I would say those are some of the things that can be done in the next - in 10 minutes. And we have a developer edition. Anybody can go to developer.orkescloud.com and get started pretty quickly without having to worry about how do I download, run locally. Which if you want to do it, you can always do it. But nothing beats one click, go to this URL, and start working on it. [0:43:55] GV: Yeah, exactly. I think that's kind of where I went. That's developer.orkescloud.com. Orkes is spelled O-R-K-E-S. Yeah, head there. Yeah, it's kind of pretty foolproof. You could just either choose templates or you could just hit start from scratch, sign up, and then off you go. Awesome. From what you can share, you kind of touched on where you might go. You say agent-to-agent things. But from what you can share just before we wrap up, what is the next, say, 6 months look like for Orkes? And what are you kind of looking to add or develop as well? [0:44:29] VB: I think I would say that the industry is slowly moving towards agentic workflows, right? Everyone is thinking about how can they incorporate language models into their business processes and leverage them to kind of accelerate the pace at which they can innovate, get ahead of the curve. And that's one area where we are focusing on. And most importantly, as I said, trust and safety aspect is the most important one. Because that's what enterprises care about more than anything else. And that's one area where we are spending a lot of effort, and seeing how can we simplify those things. And the other part that is coming up pretty quickly is when you start thinking about agentic systems, traditionally, when you think about software, it was like, as a developer, you will build end-to-end stack. That role might shift towards, as a developer, you will build tools, and the agents will be built by the business users. Going back to our original discussion about rule-based workflow engine side. I think they're coming back, and I would say this time with a vengeance saying, "Hey, we are going to let you do it. But now, no more DSLs, no more quirky rule engine. But rather, just describe what you want to do, and I'll figure it out and do it for you." I think that's going to be a pretty interesting area to see, and that's one area where we are also investing to see, "How can we bring business and developers together to accelerate the speed at which they can innovate for companies?" Yeah. [0:45:47] GV: Yeah, awesome. Sounds really powerful. I mean, especially, as you've called out, given that Orkes is super focused on enterprise, enterprise-grade reliability and trust in this sense, this is kind of where if it's going to be possible to do it in an enterprise setting, then this is kind of the place to come to. And obviously, it's been proven as sort of base-level, given it came out of places like Netflix and sort of used in big settings, big companies. Yeah, very exciting. Yeah. Well, thanks so much for coming on, Viren. I think we've learned a lot. Yeah, again, just for anyone who wants to just kind of get up and running, that's developer.orkescloud com. Just head there and give it a try. Yeah, Viren, thank you so much. I hope we get to catch up again in the future. [0:46:33] VB: Yeah, thank you. Thanks for having me here. [END]