EPISODE 1789

[INTRO]

[0:00:00] ANNOUNCER: Railway is a software company that provides a popular platform for deploying and managing applications in the cloud. It automates tasks such as infrastructure provisioning, scaling, and deployment, and is particularly known for having a developer-friendly interface. Jay Cooper is the founder and CEO at Railway. He joins the show to talk about the company and its platform.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[EPISODE]

[0:00:41] SF: Jake, welcome to the show.

[0:00:41] JC: Great to be here. Super excited to chat about a bunch of stuff. I know we've got a couple of things on the docket, so yes.

[0:00:47] SF: Yes, absolutely. We were chatting before we hit record here, but you and I both graduated from the University of Victoria in British Columbia in Canada. So, what ended up sort of pulling you south to the Bay Area?

[0:00:59] JC: Yes, I think there's almost like this brain drain kind of pull that I think we're both talking about it right before. I think like half of maybe my graduating class was just kind of like, "Yes, I want to generally kind of moved in there in general." So, I think I kind of always knew that I wanted to be in the US. I think I always knew that I wanted to start a company at some point. It was a more of a matter of when, not if, if that makes sense. I ended up moving to a few different places before I ended up working out a bit out of Amsterdam and Italy in between grad and then moved on to New York and then moved to San Francisco.

But yes, I think it's a pretty no brainer in my mind because you pay the same amount of tax dollars and you get twice X the ambition plus twice X the sum. So, there's a pretty strong payoff on doing that. 

[0:01:38] SF: Yes. As much as I love Canada, I think if you're in tech, and there is such a strong pull to the Bay Area, especially when I moved here now like 15 years ago. I think we will get into it today, but you're running a fully remote company. So, I think there's somewhat less constraints on companies and the opportunities people have in tech today all over the world. But that wasn't always the case. But back to what you're doing now around Railway, what was the driving factor behind the creation of it? Is the platform focused on streamlining, deployment, and management of infrastructure and dependencies?

[0:02:11] JC: Yes. So, I mean, I grew up hacking on random stuff. I started actually writing in - my first computer science stuff was writing aim bots and cheats for video games. There's always a very exploratory, creative thing like that. I ended up writing small programs or anything else like that. Then, every single time I moved to deploy something, it was just like you're switching from this world of, "Oh, cool," this joy, it's this like beautiful, happy kind of like thing where you're hacking around, it's nice. And you're like, "How do I like move this thing?" Obviously, there were like tools like Heroku at the time. So, like finding those tools was like awesome and magical.

But there's a whole like class of problems that exist actually outside of like actually getting something deployed that we call like deployment lifecycle. How do you make changes? How do you like go and get them reviewed? How do you go and add a database in another environment and then how do you make sure that you're going to actually have that database when you like go in and merge, right? This whole kind of split universe phenomenon of staging, et cetera, right? So, when you get into things like that, you end up having to wrangle a lot of stuff, right? It goes kind of back to this bit of trough of sorrow, having to go and figure out all of these things, right? In reality, a lot of the workflows that people end up having are very, very similar, right?

If you end up building a lot of those workflows, then most people, they want to split into a parallel environment, they want to test their stuff, they want to merge it, and they want it to kind of like automatically roll out, right? So, they just simply don't want to handle that. So, it ends up being that this class of problems ends up being both interesting from like a systems perspective. So, we're working some really, really cool like networking, storage, et cetera, all of those other things. We've got our own bare metal servers now, but also just like very, very applicable to the wide swath of people, right?

I also think that like the compute market is one of those things that will just continue to grow, right? It's just like, we will need more computers, right? Anything that we can do to kind of streamline people's like productivity in there, it's like, it's one of the highest leverage things that we have in general, right? We talk a lot about like leverage, like how do you build leverage? How do you build efficiency? How do you make it so that a user's action has an outsized return on what they're putting in, right? Because for us, that's the definition of magic, right? It's like you do a little, you get a lot.

[0:04:09] SF: Do you think the public cloud has increased that pain in terms of we have this joy of building something and you get that aha moment of building something maybe on your local machine and running it and then you got to get to a place where now I have to deploy it. Does all the boxes that you have available in the cloud make that even more of a challenge?

[0:04:29] JC: I think you're trading pain, if that makes sense, right? It's obviously very, very painful to go and procure servers, get them up, making sure that they don't fall out of the wall and that somebody doesn't bump the power cable and all that like class of problems of solving those, right? Then you kind of trade those for like, okay, how do I like manage these machines, right? And then you kind of have to play with the abstractions that the cloud providers give you. There's benefits and curses to that.

In the prior kind of like bare metal world, it's like, you want to spin up a server, you want to spin up something, well, it's like, you're measuring your time on the order of weeks or maybe even months, until I get this this thing, like up and running, versus you go to the cloud and it's like, boop, hit a button, and it's kind of just like up and running, right? But you do have to kind of like banish the primitives that the cloud providers give you and kind of like work within those walls in general. Those primitives can be, I think, made faster in and of themselves. So, from making deployment instance, moving your builds like immediately beside your compute, moving your storage, keeping all of these things together, that's kind of the whole end goal of like a lot of the stuff that we're building is like all of these things have almost been verticalized.

If you look at the AWS kind of dashboard, it's like every service, famously Jeff Bezos is kind of like everybody's going to interact with these things over an API. So, everything's kind of like vertically sliced, right? There's no mechanism to kind of like share these things together unless you really want to go and start composing. And then you have to again, start bumping into the abstractions there in general. I think it's like you're trading kind of like pain over time.

[0:05:49] SF: Okay. There's been a number of companies that have tried to simplify this process. You mentioned Heroku, there's other companies in the space like Render, your Netlify, these various past platforms. What is Railway's unique approach to this that distinguishes it from some of the other players?

[0:06:07] JC: Yes. So, I'd say there's a couple of things. We talk about wrangling complexity a lot in general. We've built out a really intuitive UI that is kind of like layered. It's a canvas. You essentially just go to it and you kind of just like spew out, "Hey, give me Postgres. Hey, give me Redis. Give me a deployment of GitHub." I think we've gone above and beyond on a lot of different things. We've built a system for kind of like automatically generating Docker images. So essentially, you give us your code, we will go and statically analyze it. We will go and figure it out and stuff like that. So, you don't have to actually like write anything to get started in general.

Additionally, we've built our own kind of like storage system. So, you can host literally anything on Railway, right? You can build a ClickHouse database beside your Python instance, beside your whatever. For us, it doesn't really matter. We've like built those primitives in such a way where they feel very, very quick and they're really, really easy for you to kind of compose together. So, I would say that what separates Railway or versus like something, Netlify or Render or anything else like that on the surface is they may kind of like look very, very similar. But over time, as you kind of like compose these things together, we've tried to almost like linearize the complexity of each of these things versus allowing that kind of complexity to spawn out until, "Oh, you have so many of these microservices and no tools to manage them and stuff like that."

[0:07:15] SF: So, you mentioned Docker there and being able to extract away the challenge of putting together a Docker image. What is that challenge that people typically run into?

[0:07:24] JC: Yes. So, I would say that there's a couple of challenge. Oh, and sorry, another thing to mention is like Railway is only, you only pay for what you're using because we've built our own orchestration engine. So, we'll go and place workloads. Normally, if you're on a cloud Friday, you pay for a four-gig box. If you don't use the four-gig box, you're billed for that at the end of the month. What we do is we basically allow you to run the code and we pack all of these instances together. As instance or scaling, we'll go and move them around in general, right?

So, that allows us to get an edge there in general in terms of like both pricing as well as us doing our own bare metals instances, which is both pricing and performance. But in terms of what makes the Docker process a bit more complicated, it's just another abstraction you have to wrangle in general. You have to go and figure out where to replace these binaries, what are the permissions, which ordering, all of these other things. A lot of people don't know. It's like Docker is layered. So, if you invalidate the cache at any stage upward, it'll invalidate the cache all the way down. Even constructing the ordering of your Docker commands has an effect on the build times, the image output, all of these other things. There's no science to it, obviously, which is I think on the surface, but there's almost like an art to it where it's like, oh, you have to almost again, understand the underlying abstraction. And our hope is that we can basically just say like, "Yes, you really just, you want to make sure that you have node in here, and then you want to make sure that you also have Python in here, and you basically like, you should select almost the packages that you want, if you ever use like something like Ninite or something." And then you have access to them, right? There's no messing with permissions. There's no messing with anything else like that. And that's what we've built the Nixpacks automated kind of construction engine on.

[0:08:53] SF: Some of the challenges I think organizations run into sometimes when they invest in like a past platform is that if they are successful, they reach this like graduation problem where they hit sort of the scale limits of that platform, and then they need to essentially migrate off of it, go to AWS, Google Cloud or whoever, directly and sort of to stand up a bunch of the infrastructure and run it themselves. How do you avoid that with Railway?

[0:09:15] JC: So, Heroku famously had this graduation problem. It's one of the main things that we chat with investors about. The interesting thing about Heroku is they built - the thing that was really, really great for them is also, in my theory, the thing that killed them. Not like killed them. They're doing a billion dollars in revenue and they had a successful year and all of these other things. But I think as Heroku goes, we all know that there is a massive, massive, massive business to be made in there in terms of impact, right? And they were only just like scratching the surface, right?

Anyways, back to the original point, about the graduation problem. I think the main thing that happens is you end up having kind of almost this like outsourced state problem, so that marketplace where it's like, "Oh, I need Postgres and I need Redis and I need to be able to deploy my - call it like Ruby on Rails, like API server, as well as my workers." Heroku is really, really great for the servers. You spent the stateless things, right? And they'll like scale up or down. Excellent. Then you end up going in and integrating with something like an external Postgres provider, or they provided one at one point or anything else like that, right? But it was a very, very bespoke offering.

So, they haven't like solved the generalizable storage problem. I think there's a few more primitives that have come out, namely like eBPF, io_uring, a bunch of those other things that allows us to kind of solve these things at a more generalizable level on the storage stack, which means you can spin up anything. So, instead of bumping into those edges where it's like, "Oh, I want this specific thing, but Heroku doesn't offer it, so I couldn't do like self-hosted click house on Heroku, because there's no way to like do in the Kubernetes world of things like a persistent volume claim." There's no elastic block storage. There's none of those things. You end up bumping into these limits of that platform and we've kind of invested right out of the gate, one of the first things, really didn't even host code at the start. We were just a database provider. We were underlying database where you can click and it was one database at the start and then it was four databases.

That's where we've invested in making sure that people can literally just do anything on the platform. Then we're making it really, really trivial for them to go and actually go and do that anything.

[0:11:10] SF: Can you explain this idea of infrastructure as Legos?

[0:11:12] JC: Yes. It's interesting. I think if you squint it kind of already exists in terms of like infrastructure as code. So, if you look at like a Docker compose or a Helm chart or something like that, those are essentially like infrastructure as Legos. They have like a variety of environment variables that you have to provide. They have a variety of like inputs and outputs in terms of endpoints that exist. And then they have a variety of like services in there. They also have versioning that exists over time, right? So, when you drop, if you just assume that this thing kind of exists as this like bucket and this Docker Combos file has maybe like four services, the aforementioned like Ruby on Rails service, the worker, the Postgres, et cetera, that's now kind of like a Lego that you can use and you can piece together, right?

Because that API endpoint actually has an input and there are environment variables that you can pull from. And there are environment variables that you can provide to. So, if you consider that as kind of like a Lego block, then actually you can basically say, "Hey, I want to go in and import that thing and I want to use it as part of my project." We allow people to one click deploy things like Strapi or Aki or any of like the analytics toolkits that are open source. We're big proponents in open source We have an open source kickback program where if you build the template, people run the template, you get paid for what people are actually using.

That's kind of the infrastructure as Legos piece of it where you basically, you take that Lego and you drop it inside of your canvas, right? And then you can kind of like consume or interact with it, right? And this ends up solving a like very, very interesting class of problems. It ends up solving like authentication authorization. It ends up solving sharding. It ends up solving security because you're not managing this like massive multi-tenant thing. It solves like API versioning, which is super interesting, right? So, if you go and push changes, you can actually go and roll out those changes. And we have health checks for your services. So, if any of those changes were to actually cause those health checks to fail, the rollout would fail in general, right? You can actually almost like split these things up over your canvas and consume. That's the Lego aspect of it in general.

[0:13:03] SF: Can you explain a little bit more detail like how does this help solve something like auth? 

[0:13:07] JC: Then you're kind of like not talking with the public Internet, right? We built this IPv6 wire guard mesh on top of all of our services. Essentially, you're not kind of either exposing your instance publicly, so you can just talk with it externally. And you're also not at risk where somebody says, "Oh, potentially you've leaked your keys and now there's a publicly accessible endpoint." The best level of security that you could possibly have is you just can't get to it without SSO. That's the default level of security that we're trying to provide here and I think it's inspired from the zero-trust mantra almost of saying, "Hey, let's give people the best experience the best practices right out of the box and we'll make sure that your database is within like single digits, ideally even like hundreds of microseconds from your instance. We're going to make sure that it's not accessible. We're going to make sure that you have really solid primitives to like go and access these things." So, we do automated service discovery based on your name.

So, if you have like your analytics service that you've deployed internally, it's just analytics.railyway.internal. You just make requests to it and then you're the only one that can actually access that within that environment, right? That's kind of how it solves like authentication authorization because you don't end up needing it. You don't end up having to put like an engine X server with basic auth or anything else like that in front of all your things, shoveling it into one pass and then saying, "Hey, everybody on the company network, go and do these things." Then, it invariably at some point that ends up getting breached and then you have security posture on those things.

[0:14:33] SF: Right. It's kind of like security by default approach. This is a zero-cost model, essentially. Let's take the best practices, bake it in, so it's like the guardrails are essentially in place, and that's how people will develop against it.

[0:14:43] JC: Yes, exactly.

[0:14:45] SF: So, can you walk me through, if I'm going to use Railway, what is that process like? Then can you explain what's happening behind the scenes?

[0:14:53] JC: Yes. So, this is funny, like rhymes with what happens when you type something into the browser question that's like a very common type of question. When you go to Railway, so if you go to like dev.new, we will drop you on a page that allows you to basically say, give me a Postgres instance or deploy my GitHub. If you hit a Postgres instance, what we do is we like go and we have a fleet of servers that exist all across the world. We will go and make a claim for that volume, create it and then go and bind that instance there. Then we'll go return to you like that running instance, which you can access either private network or you can click generate public URL and it will generate a public URL for you.

That's the kind of like state full storage one. If you go and do the GitHub one, what we do is we basically will parse your repository, figure out what applications you might have in there, maybe have a Docker file, maybe you don't have a Docker file. If you have a Docker file, we'll obviously just use it. If you don't have a Docker file, we kind of go down this like tree of decision making where we say, "Do you have a package JSON?" Because if you have a package JSON, it's very, very likely that you have a node application in here. So, let's pull in some of that information. Oh, do you have like, or requirements.txt? Okay, cool. That's obviously Python and stuff like that.

So, we have this kind of tool that we built. Again, this is the Nixpacks engine. It's all open source. It's on our GitHub browser. If you want to have a look at it. It's super cool. It's like this Rust engine that we built, but it will go and essentially figure out, what is your build command? What is your start command? What are all of these other things? And you can modify them after. But the whole goal here is to almost like compress all that knowledge so that the user, when they go to that platform, they just say, here's my GitHub repository. We say, "Excellent, it's already deployed." And then you say, "Wait, what do you mean?" Because that's not like the default experience of going to a cloud provider. You have to fill out reams of forms and select a region, right?

In the AWS dashboard, you go, "Oh, the name, that's pretty good." And then you're like, "Oh, which flavor of Linux do I want?" And you're like, "Oh, okay, I guess I have to pick that now." Then you go through these rings of stuff, right? So, our whole goal is to take all of that config and push it post haste. Anything that you can do later, you should want to go and do later. Even something like setting like a region for like a database. We built a system that allows us to like move these volumes around. If you spin up something in the region that's probably closest to you, which is a good default, if I'm spinning up from San Francisco, I'm probably in US West. If somebody's spinning up from London, they're probably in the Amsterdam servers, right?

So, we pick that, and then if you really want to go and move it, you just move it later. We take all the config and we push it later.

[0:17:12] SF: Then where is this all running? Is this running in my account, or are you sort of like running this within behind the scenes like Railway's access to a public cloud?

[0:17:23] JC: Yes, so we're running in a few different servers at this point in time. We have some straddling between Google Cloud and AWS, as well as our own bare metal service that we spun up over the last year, basically. It runs in a variety of the service that we have. And ultimately, the only time you should care is if you're trying to potentially pair it with something externally. So, let's say you have, I don't know, like a super base instance somewhere and you want it as close as possible, because like, ultimately, the only thing that matters is that your computer is beside your storage, just from a latency perspective, because the database calls are so quick.

Then you can potentially go in and select that region and say like, "Oh, I actually want to run it on these specific class of instances." Barring any sort of failover, we will go and do that because we built the orchestration engine to go and manage and drop the instances right beside it. The short answer is, like, it could be running anywhere. Ultimately, you shouldn't really care, except for those latency reasons at which point, you can add that constraint kind of like later.

[0:18:15] SF: How does Railway handle the kind of distributed system dependencies that you can run into as these - if you have a bunch of services, like this can get pretty complex. Errors can occur. How do you manage that aspect?

[0:18:29] JC: Yes. I think like in traditional systems, I guess like it really depends on the class of error, right? Because there's a variety of different errors that can occur. There's like, I pushed bad code and I flunked the instance and it got past all of the health checks. So, we give you a one-click automated rollback, like we'll keep the container around for a little bit. You click that and you say, "Hey, listen, my health checks didn't catch that. Something is down now. A user is reporting something." You click that immediately. You're back. We have automated health checks for going and managing these instances. So, if you define a health check, I mean, you just say, /health and then you go into return. It's like, "Oh, okay. I updated my Redis library and it no longer is able to communicate with Redis for some reason." 

Okay, cool. Now that fails. So, we'll just like actually flunk that deploy and then notify you whether you have like emails turned on, whether through the, the in-app inbox, right? At some point in the future, we'll do an app. Basically, get really, really close to notifying you as quickly as possible. And then there's also like things that we've done on top of it in terms of solving classes of problems that are kind of interesting and only happen like scale of complexity. Assuming you have a ton of different microservices, let's say that you've modified your gRPC or something like that. I think gRPC is a good one, because it's backwards compatible.

But let's say you've modified something and created a breaking change. You've modified a field on a GraphQL endpoint. You go and roll it out, things start flunking. What we'll do is we'll actually do a dependent rollout. If you are saying, "Hey, my front end communicates with my back end, and then my back end starts failing when it rolls out, we're going to flunk the whole class of that deployment." Because you've made a change to your front end and your back end.

So, we've done a bunch of things in the application that basically say just consume the criteria or config from various different services, and we'll almost construct for you like a dependency graph to like go and automate any of these things. You don't have to do the like basal thing where you're defining your dependencies as this and then you forget and then something happens. Or turbo does this, I think as well in terms of like build stuff. You don't have to do any of that. You just consume the properties of the services and then we will go and automatically figure that out, including cycle detection, which is super cool.

[0:20:26] SF: What about a Canary rollout?

[0:20:28] JC: Yes. So, we can do Canary rollouts. You can just do it on the command line if you just do Railway up. There's also kind of like a piece of a fill-in. We have obviously the Canvas dashboard or anything else like that. But we also have a command line because people like to interact with services in various different ways. So, the command line is really useful for a bunch of different things, including that.

[0:20:46] SF: Okay. Then, how do I - I think one of the challenges companies typically have with whatever sort of cloud reasons they're using is they'll sometimes have essentially provision too much or potentially too little, which will lead to problems. Like they're spending too much, or maybe they provision too little and then they run into challenges with throughput or latency or something like that. How do you solve for that?

[0:21:08] JC: I think that's a really important thing to solve for. I think it's also a thing that is a core differentiator for us versus other platforms, in the sense that the aforementioned orchestration engine that we've built allows us to basically only bill for what you're using perspective. So, in a production workload, that's pretty cool because assume that you have 2x standard deviation or 10x because you're on the front page of Hacker News or anything else like that. That's fine. We can go and handle that. We can go and scale up the instances. We can scale them down. We can go and do that for you.

The part where it becomes actually, I think, even a little bit more interesting is when you end up with pull request environments that can be server like serverless, right? I use the word serverless in like quotes. I know we're not on like a video or we're not going to like share the menu or whatever, but in a way that basically allows you to send a request to it, the request will spin up the container, the request will be filled and then the container will be finished. So, we can actually go in and kind of construct that parallel environment of yours with a copy and a database volume pretty soon, which is super cool. We'll roll that out in like Q1 of 2025, as well as like serverless spin up of those parallel environments.

Instead of spinning up something that's like, "Oh, I need 32 gigs for this service, I need 24 gigs for this other service," and you spin them up and you leave them around over the weekend and you just incinerate money for these things that are idle, we will actually only spin them up for the time that you need and only charge you for the usage that you have. So, ultimately, that means that you're going to avoid those random errant $1,500 bills because somebody forgot to actually terraform apply off of master instead of their staging environment that they were testing.

I think that having that posture in place by default means that companies get a lot more cost control, they get a lot more benefit, they get a lot more. But they still get the ability to move extremely quickly by having the ability to create copies of their environment.

[0:22:54] SF: How does monitoring, logging, observability, these types of things work?

[0:22:59] JC: We have templates that people have built that allow you to kind of like exfil logs to Datadog. We have a template for spinning up Grafana or like VictoriaMetrics, Prometheus compatible instances. So, you can do all of those things. We've also built from the ground up an observability system inside of Railway. Since we have the Edge network, we automatically can kind of add a request ID so we can kind of give you distributed tracing by default through all of your microservices. We can give you alerting for if things spike or stay high or anything else like that. We can give you information that you kind of wouldn't have in other kind of environments without doing, for lack of a better word, like a ton of plumbing. That's the thing that we've kind of like built from the ground up internally at Railway. 

I think that like, obviously, Datadog is like a massive business and there's tons and tons of stuff in there. So, we're kind of straddling the more of the 80/20 of like, let's just give people the baseline amount of things that they want. And over time, we're going to go in and ask them, "Hey, what else can we give to you?" But we have people who are just kind of using Railway entirely in terms of build, deploy, observe, scale, all of those pillars, and are actually extremely happy on using just that. I think it's bare bones in terms of like where we want to take it right now. But it's super exciting to say, "Hey, listen, we have all the building blocks right here, and we just need to work with our users to scale to the things that they really, really want."

[0:24:17] SF: What would you say are some of the constraints or limits today?

[0:24:20] JC: Constraints or limits. That's an interesting one. I would say that there's like, I mean, it's going to sound weird, but there's not really any constraints or limits right now, and that's kind of like we've really tried to solve that Heroku problem of, "Oh, I'm going to outgrow this thing." I would say that we're almost limited by trust if that makes sense.

You have AWS, you have GCP, and you have Azure. Those are the big clouds. Barring anything, those are the big clouds. You have Cloudflare that's trying to do things. Cloudflare is a $30 billion organization. They've been around for a decade plus. They've accumulated trust. They're the meme. Whenever Cloudflare has an issue, it's a like software engineering snow day, right? It's kind of like they've continued to build that trust over time and they're still working on it, right? For us, I would say like the main thing that's kind of the limiting reagent right now is not like what the platform can do, but it's almost like how much you can trust it, right? Because when it comes to software infrastructure, it's your livelihood. There's not much more that you can kind of like trust to people. It's your data. It's the fact that if that thing goes down, especially with your data, you're SOL until these people get back to you.

So, you're kind of hanging in the limbo, et cetera. It kind of pulls more towards the quote of like, nobody got fired for buying AWS. Nobody got fired for any of these other things, right? I think that's the main thing that we're kind of like consistently working with companies on and saying like, "Yes, you're going to get all of these benefits and we're going to give you like a higher order level of reliability and we're going to give you better service, SLA turnarounds than the kind turnarounds than the larger clouds." That ends up being a very difficult battle to have with people because they just say, "That sounds like BS," and you have to just show them over time. It's like, "No, we're going to continue to work on that and we will be available for you should anything occur, and we've also designed a system such that there are less and less fault points as you go."

[0:26:10] SF: If you're a founder of a business and you're building out a new product, then it could be a lot to bet your product life on another, essentially startup, going back to sort of the trust challenge that you're talking about.

[0:26:24] JC: I think, actually, the startups really, what we call from our growth master plan right now is we're stretching a market, right? So, startups, like our current ICP, like the normal distribution of that kind of go-to market, is actually 15 to 50-person teams. Those people seem to love us, right? They seem to be able to move like a ton of different stuff over. Maybe it's not literally all of their infrastructure footprint, but it's like a large swath of things that like are no longer kind of like legacy. And they're basically saying like, "Listen, we want to move really, really quickly on these things." It ends up being those kind of like large organizations that like move a little bit slower and like want that level like that higher-order kind of like trust bit really, really flipped.

So, when you start going to like organizations, it ends up being most of the sales motion at that point ends up being this kind of like how do we get past any trust or compliance or whatever objections which we've gotten really, really good at and show you the value to kind of like tie it to maybe one of your like top 8 like velocity initiatives where like we just want engineers to be able to ship faster and get more done.

[0:27:23] SF: So, if you are a larger organization, let's say that you're able to establish that trust with them, what is a starting point for them? L Obviously, I'm not going to just, if I'm hyper cloud or I'm on cloud today, I'm not going to go and replace everything. How do I get started, essentially, in a way that doesn't require me to boil the entire ocean?

[0:27:40] JC: Yes. The nice thing about Railway is you can incrementally adopt it. If you have services that you want to go in and spin up internally and you want to pair them with services that already exist, you can do that using something like Tailscale or we have a WireGuard binary that allows you to mesh it into your instances over there. We can also do potentially dedicated instances if you want after chatting with you.

There's a variety of different ways you can get started, but the main point is that people just incrementally adopt it. They'll start with something. They'll basically start with, usually, it's an EM messing around on the weekend to basically say, "How do I go and explore a couple of these things that I know are being pitched as these faster alternatives to X, but I need to be able to know that it's good." First of all, it's going to satisfy that faster initiative to X criterion that I'm looking for. And two, that it'll be solid for us to go in and make a case in the future.

So, what we do essentially is we go and we pull telemetry from people as they're signing up and we say, "Hey, when you start getting to points where you want to be activated," we basically just say, "Hey, if you want to go and chat with us, you can chat with us over Slack. You can chat with us over email. We want to be really, really available without being kind of all up in their face."

[0:28:44] SF: Then you've open-sourced like a significant portion of the Railway production stack. What was the motivation behind that?

[0:28:50] JC: I think the motivation for us is that we want that trust. So, if we can give you this kind of like ability to introspect the service, maybe even like self-hosted in the future or anything else like that, then realistically, there's not much more you can trust if you can see the guts and the internals and anything else like that. That's kind of the main motivator, I would say in general. It's also obviously excellent to have members of the community be able to like go and contribute. We have like people who are like submitting like Nixpacks PRs all the time, to like go and add new versions or new providers or new languages or new dependencies or anything else like that. So, getting kind of that tailwind of like open source, not like tailwind in the like, but like the benefit of the open source community to be able to like go in and help us like build this together, that's I think a big key.

I also like Kubernetes ends up being open source. So, do you want to potentially - you kind of have to go and meet people where they are. If they're self-host, if they're kind of self-hosting some of these things, at some point you need to be able to allow them to like self-host at least the data plane, right?

[0:29:48] SF: Does that help also overcome some of the trust issues?

[0:29:51] JC: I believe so, yes. Because you can see everything that people are doing. You can see what they're doing on the instances. You can see what code they're running. You can see all of the other things. So, I think ultimately that really helps with the trust issues, maybe not issues, but like problem there in general, right?

But at the end of the day, I think the main thing that you can do to make sure that people trust you is to once the stuff is up, keep it up and keep it running, and just make sure that it's like a bulletproof experience. We've been like working, especially over like the last like six months to make sure it's like, we're starting to get multiple nines on the board of like, okay, cool. This is like a level of reliability, especially with our new bare metal instances. We haven't had any issues so far. So obviously, knock on wood there, but like building something that you have a higher order control of so you can get that level of reliability. So, people can say, "Actually, I've used it for X. I don't really have a ton of problems with that."

Then we've also scaled the like cloud version of it to like, we're doing tens of billions of requests per month on the edge proxy. The orchestration engine in the cloud is managing like two plus million microservices on like, I think four clusters, right? So, that's kind of like been built out so that whenever we go to like have a conversation with larger companies like Uber or anything else like that, 5,000 microservices, that's a lot of microservices evidently, right? But the cloud version of what we built is like scaled like two million. It's like a few orders to manage it off. We can say like, "Hey, if we were to go and like do a self-hosted version of this for you, we can do that level of scale."

[0:31:17] SF: You previously worked at Uber. Did any of that experience inform or motivate you to start Railway?

[0:31:23] JC: Yes, definitely. There was this walled garden of platform teams where some people were responsible for getting code deployed and stuff like that. I would go and interact with these things at work, and it would be fine. Then I would go and interact with them at home, and they'd be varying levels of less than fine. It really seemed that Uber was a high-growth company at the time, still is. It's massive, and they had the ability to retain the best engineers to go in and build this thing. I still could potentially see ways that it could be significantly better just in terms of unlocking developer productivity. I would say that that is definitely a driving function for making the system.

[0:32:05] SF: So, you've been working on Railway for what, almost five years?

[0:32:08] JC: Yes, about five years. I think four and a half now.

[0:32:10] SF: As a first-time founder, you're running 100% remote. Was that a conscious decision? 

[0:32:17] JC: Yes, it was a conscious decision. So, I've worked remotely since 2015 maybe. I've always really, really strongly enjoyed working remotely. It's definitely not for everybody. I think it requires people to be almost like extrinsically motivated about the problem. People have to be like, they have to really, really like the thing that they're working on and they have to kind of like see the vision and they have to see all of this other thing, like the where it goes and everything else like that.

So, you can't kind of like get a lot of the benefits out of being in an office and kind of being able to breathe down people's spine and about saying like, "How we got to like do all of these things." So, I was chatting with somebody, one of my friends the other day and I was like, I think maybe one of my most controversial opinions is like, I think despite running a remote company, I'm like, I think it's a terrible idea for probably about 90% of people, right? Because you have to be extremely deliberate and you have to like hire people who are really, really excited about the problem space. We're going to like self-manage. We're going to go and do all of these things.

That's what we've done so far. We've got like 25 people who were like remote spanning all the way, like we have some people in like the Western hemisphere of Canada. I'm from like Vancouver Island as we previously mentioned. I'm in San Francisco now, and then we expanded all the way. We got Thailand, we got Dubai, we've got Japan. There's a lot of different times zones that we're spanning in general, and so we've had to hire people who are autonomous and who can push these things.

That's been excellent from leverage, but it does also mean that there's a specific class of individual who does really, really well at remote companies, and there's a specific class of individual who does well in in-person companies. It's kind of more of a chocolate or vanilla. But yes, it was definitely a conscious decision because I do enjoy the benefits that remote has, not in like a, you can sit around and kind of twirl your thumbs, but like you can almost like meet with people at almost various different times in the day. Their morning can be your evening. So, you can almost have this almost like time compression handoff of saying like, "Oh, we've really got to go and do this." And then you go to bed and then you wake up and it's done. and you're like, "Damn, I love working with excellent co-workers." It means that you almost get twice as much time in the week if you can get these handoffs like 100%.

[0:34:18] SF: In terms of hiring people that are extremely motivated by the problem and also are able to self-manage, I feel like that's kind of a requirement for any relatively small staged start-up, because you just can't afford to have to micromanage people in order to do their job. You have to hire people who are motivated to be there, believe in the mission, and it also can essentially just get things done and understand what needs to be done.

[0:34:41] JC: I totally agree, 100%. The corollary with that is that most people don't become good managers at the start, especially first-time founders. You'll hear this from Emmett from Twitch that's talked about how he was not a great founder originally. I think Brian or Joe have also talked about it from Airbnb. It almost forces you to develop those skills, while you're also trying to assemble the airplane. You don't get a lot of that benefit that you can kind of like smear this over time and learn those as you go through all of these stages. You really have to learn and compress a lot of those learnings and you have to do it in there.

So, like that's the only kind of like thing that I generally see is, yes, you'll have to do all things eventually. But we also talk a lot about internally about focus. How do we stay focused on just doing of things that we can be like top number one in the world at, right? Because that's where we're going to have all of the compounded returns. Those are kind of like some interesting benefits and drawbacks to remote that I think that maybe people don't necessarily consider. They consider the, yes, sure, I can go out and go for a bike ride in the middle of the day and work slightly later and stuff like that, and that's kind of like copacetic with my operating cadence. We have people who have families and they spend the afternoon with their families, and then they'll come back and polish up the stuff once the kids are to bed, right?

I personally, I'm a big fan of treating people. It's going to sound a little bit condescending, but like treating people like adults because everybody is adult, right? They're going to manage all of their time and they're going to like go and do all of these things. You shouldn't kind of be around and basically say like, "Oh, I'm going to approve all your PTO requests. I'm going to like go and do this." Because again, those kinds of people that you mentioned, they're not going to be successful at startups in general. So, we've almost kind of aired on the side of like, let's just pretty rapidly remove any of the guardrails in terms of onboarding and say like, "Hey, this is how we kind of operate. If you really, really like it, here we are. And if we don't, then we can hopefully find a really, really awesome spot for you to go and land in the future."

[0:36:31] SF: How do you test for that? How do you find these people that are going to be a fit?

[0:36:36] JC: Yes. So, there's a couple of problems that I like to ask people. There's a couple open-ended kind of like technical problems in terms kind of technical problems in terms of design. I don't think you test for this using lead code.

[0:36:47] SF: There's a lot of problems with using things like lead code.

[0:36:49] JC: Oh, yes. We prefer to go for maybe the Montessori School of Management of open-ended problems of saying, "Hey, listen, how would you solve this class of real-world problems?" I think that's one way to go and do it. I think also sitting down and chatting with people about what drives them, are they passionate about X? Stuff like that. It doesn't need to necessarily be DevTools, right? It could literally just be like, "I'm so passionate about networking. It's just networking. I got a wrist of switches in my basement and all these other things." And stuff like that. I think if people, they have that thing that they're really, really passionate about and you can almost like see it and stuff like that. It's going to sound like super holistic, but like you can almost tell when like people really, really care about these things because the moment won't you poke them, they kind of like expands and you're like, "Oh my God, that's so much stuff. Where did that all come from?" And where it came from is this kind of deep passion, this life experience, all of these other things.

So, that's one way to go and test it in the interview. And then as part of onboarding, what we do is we rapidly kind of will just remove the guardrails as you go. We have six weeks of onboarding, which may seem long, but the goal is almost pushing at the funnel on what you can do. Because by the end of the six weeks, like we consider onboarding to be like, how do we get you from like good to doing great work, and then being able to be fully autonomous, right? So, like that's the whole goal, is we do two weeks of tasks. It's like five tickets, you just like have them in linear, they'll be really straightforward. It's like, "Hey, this thing is like actually broken." You go in, you make a few lines of code changes and then, that's your first two weeks. And then you move on to the problems, which is, "Hey, our chronic experience sucks." Very different class of like problems that we've just given you. It's like, "Well, what sucks about it? What are the problems?" All those things. Then people have to like go in and maybe they'll ask their co-workers, maybe they'll go and ask the users, maybe they'll go and put together a document that says, "I think these are the things." And then people will be like, "No, what about - we probably drop this requirement." And then they'll go and solve those problems.

Then the third one is like opportunities, which is you've been here a month, what do you think the company needs? That's the only prompt. You get to that open-ended kind of like line of thinking by the end of it, and that kind of pushes people more towards like, "Oh, I can actually do pretty much anything here." So, it's just a matter of what?

[0:38:58] SF: Yes. I think there's a certain like, back to what you're talking about where you're trying to look for what is the person like really passionate about or interested in. There's a certain obsession behavior that I think people need to be successful in startups. It doesn't even need to be historically an obsession that relates necessarily to what the company is doing, but you have to be manically obsessed about solving something in your life because a lot of it is really about knocking down doors and solving problems.

The level of abstraction also hits you much faster earlier in your career, I think, at a startup because there's just less people around. To your point, by week six, you're asking questions about what could the company be doing better. If you're at a really large organization and you've worked for some, I've worked for some. You could really be in that first stage of like, here's a really small problem that you need to solve, and that could be like the first two years of your life in existence there.

[0:39:53] JC: Yes, right. So, we aim for trying to find those people because we think that - I think even when it comes down to like solving problems, excellently, right? That last kind of like 5% of the problem is kind of where most of the progress is, right? If you're not like focused and you're consistently oscillating and bumping between all of these different things, you're probably not going to like get actually to like the meat and potatoes of like what that problem actually is, right?

I'm a very, very big believer of sit there and run a ton of different revisions. And to figure out like why these things were either better or worse than your previous revisions, and make sure you have a clear goal that you're kind of like entering towards. I think that doing that and doing that well, that focus and that passion, it's almost like a necessary precondition. It unlocks pretty much all of those other things and I don't see how people can potentially do it without it. I've seen it once in a blue moon, but it's also very, very rare. So, we aim for kind of saying, "Okay, well, how do we generalize this class of problem of finding these extrinsically motivated people?" And saying, "This is in general the archetype of individuals that we've seen be quite successful here."

[0:41:01] SF: Yes, I mean, there's that quote about, I don't remember the exact quote, but it's like, "We've completed 90%, so now we get to start on the other 90% of the project essentially." It's really that last 10% where all the hard work, the meat and potatoes has to exist in order to actually, whether it's a product or whatever it is that you're building that to do that, that's where the polish happens. That's where you run into your scale problems and so forth.

[0:41:23] JC: It's also important, not just from an individual perspective, but from working with other individual's perspective, because that last 10%, it's the hardest. That last extra mile of going and doing the thing, it's like, "Yes, well, this is good enough. It'll work." If you work with people who have outsized talent or outsized standards or anything else like that, they'll basically say, "No, we can do better. We can push this thing a little bit farther." Then they'll have some tools in their toolkit. They'll have some skill or some emotional acumen or some way to kind of judo your thinking and basically saying, "What if we thought of it like this? I think this is really, really close, and this is where the reason why it's really, really good. Can we try this?" Versus, I think the conventional thinking at larger companies is like, "Okay, cool. It's done. Let's just move on to the other thing."

[0:42:04] SF: Awesome. Jake. Thanks so much for being here.

[0:42:07] JC: Cool. Awesome. Thanks so much for having me. This was great.

[0:42:10] SF: Cheers.

[0:42:10] JC: Cheers

[END]