[0:00:00] CO: Platform engineering is difficult to get right. In the age of DevOps and Cloud computing, software developers increasingly serve as platform engineers, while they're building their applications. This can be an engineering challenge, because organizations often require their platforms to provide fine-grained control and compliance management. Cory O’Daniel is the CEO and Co-Founder of Massdriver, which he started in 2021 with the goal of helping engineering and operations team build internal developer platforms. Cory's company was in the 2022 Y Combinator class, and he has been hard at work developing his platform. He joins the show today to talk about how he thinks about platform engineering and the challenge of abstracting away infrastructure.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author and thought leader on cloud computing and application modernization. His best-selling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com, and see all his content at leeatchison.com.

[INTERVIEW]

[0:01:32] LA: Cory, welcome to Software Engineering Daily.

[0:01:34] CO: Hi. Thanks for having me. I really appreciate the time.

[0:01:37] LA: Great. Thank you so much. I've been trying to think of a short phrase that describes what your company does. What I've come up with so far, and I'd love your thoughts on it, and have you completely changed into what you think is the right thing, but what I've come up with so far is you want to give your customers a simplified, automated, consistent infrastructure. How does that sound, or what would you change in that?

[0:02:03] CO: Yeah. I mean, that's pretty close. What we want to do is give people the ability to create standardized compute environments with guardrails and enough flexibility, so engineers can self-serve quickly, right? The reality is in platform engineering, there's two users, and they both have to love of the thing, right? The developers have to enjoy using the platform and the operations engineers have to be able to support and build that platform. What you see in mid-size organizations is this is hard. DevOps has become very hard to do nowadays with 100, 200 AWS services. It can get inundating to do that work, and so our goal is really to be the malleable layer between your operations and your software engineering team, allowing them both to focus on what they're good at and let the platform take care of some of the more mundane things that people don't want to handle, like registering, tracking metrics and alarms, controlling costs, tagging stuff, naming conventions.

Lit operations engineers focus on writing Terraform, writing Pulumi, writing OpenTofu, packaging their compliance, secure infrastructure into a platform for their engineers to just grab what they need and get right back to writing software.

[0:03:13] LA: You mentioned the two groups, the developers and the ops engineers, and you write about that, but there really is a third group, too, isn't that there? That's the management. Because they care about things like costs and compliance.

[0:03:25] CO: Yeah, for sure, for sure. We definitely, I would say that today, that's probably the group of users that we underserve. That's actually something that we're actively working on. We've had some feedback from some operations teams that that is what their bosses are looking for, right? And some of their team is looking for. What's interesting is we have a lot of really interesting data about some of these compute environments, and we can present it in ways that some other platforms can't.

We do have some cost and meantime resolution and functionality that's going to be coming out around re:Invent in the new year. That is absolutely key. There's a lot of people that need to understand what's running, how it's running, who's deploying what, where is it running, right? We got global customers. That's really the goal of the platform is we want to give something visual that's low fidelity, that anybody can understand. Even some of your management team that may not have near-term operations experience.

[0:04:19] LA: I guess, you could argue a fourth group, but you could easily lump into one of the other three, though, is security, right?

[0:04:26] CO: Yeah, security is definitely a big one. I think that's one of the things. To me, that's one of the big key differentiators between like, a lot of people feel like DevOps is being rebranded as platform engineering. I don't think that's the case. I've definitely got some hot opinions on DevOps. I am the DevOps's bullshit guy. I do have some opinions there. I don't think that it's rebranding. I think it is grown-up DevOps, right? It's scaling DevOps for some of these orgs. It's addressing some of the resource constraints that we have. We don't have enough people with operations experience. That's what's actually offensive to me about the idea of DevOps. It makes it sound like anybody can do ops, but ops job is hard. Ops people are rare. I think that we need to pay some respect to those roles.

That's one of the things that really grinds my gears about, the term DevOps and what it's become. But when it comes to platform engineering, I think it goes a little beyond that, right? It encompasses orchestration. It encompasses getting things from the cloud. It's not just the CI/CD infinity loop anymore. I've got to get some databases. I've got to get queues. I need KMS keys. There's so much tangential stuff and then security and compliance on top of it that it's very hard to just jam into this portmanteau that we're just adding to. We got DevOps, we got DevSecOps, we got MLOps now, right? How many more words are we going to jam in there? We're shifting a lot of the accountability left, but we're not shifting the experience left. We're just expecting all these things.

[0:05:56] LA: That's actually a great point. In fact, I would argue that the early cloud was – one of the goals of the early cloud was to try and solve that, right? One of the goals of the cloud was to make it easy to get an infrastructure up and running. In fact, in the early cloud, it was a lot easier to do that than it is now. But the cloud has matured, the cloud has grown, cloud has gone very, very, very complex. Now, you need a special level of qualification and certification just to understand what the cloud can do, let alone build an infrastructure. It really isn't a relevant thing to allow developers on their own to do everything it takes to build an infrastructure. They just can't do it anymore. It’s just too complex to be able to do.

[0:06:42] CO: Yeah. What's interesting though is they do know what they want, though, right? They can look at AWS, or Azure and they can say, “Oh, I need Aurora. Maybe I need a highly available PostgreSQL running in Aurora.” But should they be inundated with what high availability means? Like, disaster recovery means? There's a lot of operational know-how there. You can say, yes. Build it, you run it. Go figure it out, engineer. But now that person's not delivering business value. That's their job. All right, that's the engineer's real job is to deliver business value.

As we pile this DevOps term on them and say, “You've got to figure out more and more and more of the cloud,” we're delivering features less, right? That's the trade-off that it feels icky to me. That's why I really see the boon of platform engineering, if we can make it accessible to everybody. That again, the real rub there is we don't have a ton of operations. People on the planet, less than 10% of people in the most recent stack overflow poll said that they have cloud operations experience. That's scary. In a world where everybody's moving to the cloud, it turns out, not a lot of people consider themselves experts in it, right?

Less than 27% of orgs are even using IaC. That's from the latest survey on continuous delivery. These are a little freaky in a world where literally, everything is software now. All of our businesses are slowly becoming software companies and we're all trying to move to the cloud. We're moving there quickly. We're seeing whoever on the team can do it, and yeah, we'll deal with security and compliance when there's a breach, right? We'll deal with security and compliance when an auditor is mad. Can we have that stuff upfront? Can we deliver software safely and securely? That's what we really want to be able to help people to do.

[0:08:19] LA: I agree with everything you've said there. I think a lot of it still comes back to the complexity, right? It isn't easier. It isn't less complex to build an infrastructure now than it used to be. It is actually more complex. There's more things you need to worry about it. Some of it is because IaC is a thing and that has value, but also adds complexity. Part of it is because security is more important now and things like, principle of least privilege is very hard to get right if you don't know what you're doing. Focus on things like that, add substantial complexity to what is a simple matter, if I want a server, I want a database, connect them. Well, no, that's eight things that need to be connected in order for that to happen.

[0:09:07] CO: Yeah. I'm going to share a little secret here, but one of our outbound strategies for finding customers is we actually look at what people are hiring for. We do a lot of digging through people's job postings, like, what cloud services, etc. It's funny, talking about principle of least privilege, there's this role that has started to crop up over the last couple of years and IAM engineer. It’s like, jump out. It used to be so easy. The cloud was a couple of EMs, maybe an SQS queue, like back in 2008 or whatever. Everything was super easy. DevOps was plausible. Now, we have an IAM engineer. People will shake their head and be like, “Oh, that's ridiculous.” But IAM can be hard, right?

[0:09:44] LA: It is hard, period. I mean, to get it right, it's very hard.

[0:09:47] CO: It's like, a lot of times we'll do something, well, okay, we're tinkering around with it, we're trying to just get this work. I'm doing this DevOps thing. I've got 40 hours this week to ship a feature. I've played around this cloud service for 10 hours. I've got it configured and working right. I said everything to admin on the roll, give as many permissions as possible. It's working. Features got to roll out sometime this week. I'll get to clamping down on that IAM later. Let's throw that in the backlog, right? I feel like, that is a very kind approach, like how we go about –

[0:10:16] LA: Just put a star there and I'll start working.

[0:10:19] CO: Yeah. Yeah.

[0:10:22] LA: Stars make IAM easy. Just use this, and asterisk.

[0:10:26] CO: IAM engineers, just throw an asterisk in.

[0:10:29] LA: Yeah, exactly.

[0:10:30] CO: I'm working on their swag for re:Invent right now. That might become a sticker.

[0:10:34] LA: Perfect. Yeah. No, you have to send me a copy if you do that. In fact, the truth, right? I mean, I'll worry about this later means, I'll just open it up to a lot broader number of people who can connect to, let's say, my database, because it's easier to connect to it if I don't have to worry about all these restrictions. I'll worry about putting the restrictions in place later and they never get around to doing it. Many, many, many security vulnerabilities are created using that basic mindset is, I'll get around to dealing with security later.

[0:11:09] CO: Yeah. Hard to do it upfront, too. I mean, we talked to companies where they're like, “Security isn't a revenue driver for us.” I've heard that phrase multiple times and I get it. I get it. Security isn't going to make you money, if you're a B2C company, right? It's not going to make you money if you're a healthcare company, but morally –

[0:11:28] LA: It could lose you a lot of money though.

[0:11:30] CO: Yeah. “Well, luckily, we can just get Experian credit protection and we can buy cyber insurance. The insurance companies handle that for us, right?”

[0:11:39] LA: Yeah. Yeah.

[0:11:41] CO: What a bleak outlook. I'm sorry, everyone.

[0:11:45] LA: Your point being is that the platform infrastructure itself should be able to handle that for you. That's what you're trying to accomplish. Now, your solution, you take an approach of using a diagram-based solution, because it's easy to describe how do things connect and how to get things done. I watched some videos on your site and I haven't had it in person demo of your product yet, but I have seen the videos on site, and I see how those connections work. It's a really pretty easy to understand interface and easy to get going.

It is designed for the engineer to make the job of the engineer easy. I can see that. One of the problems with diagram-based interfaces is they tend to fall apart when you have a lot of detail, right? The more detail you add, the more complex the system becomes, the more attributes attached to each node, the more, the more, the more, the more, the more, the harder it is to deal with a diagram. It becomes much easier just to type it into a, yeah, more JSON file, than it is to click, click, click, click, click, click, click, to get everything just right in the diagram. Where is it, I have to look for this attribute and where do I have to do that attribute? How do you avoid that complexity with your diagramming approach?

[0:12:58] CO: We have a marketplace of infrastructure components. Similar to the Terraform registry, or Pulumi registry, where I see agnostics. Today, we support Terraform, OpenTofu, Helm. Pulumi is in beta and can be requested. Essentially, we have this marketplace and we publish a lot of the bundles that are in that marketplace now, but companies can publish their own private modules as well in any of those IaC tools.

What's interesting about the way that we work is there's two other aspects of Massdriver that sit on top of the IaC. They really make that flexibility possible. We have, effectively, a contract system between different Terraform modules, if you would. If you look at Massdriver, you'll see a box and there's a line connecting it. That line is not just decoration for your eyes. That line actually drives a lot. It does dynamic configuration. It'll inject credentials directly into your application’s runtime from your database. It will handle helping you do principle of least privilege with IAM, so you can actually project IAM policies across those lines to your downstream.

Now what's happening in your Terraform modules, or your Pulumi modules is you have a lot less inputs for your developers to worry about. To that interface, when you click on the boxes in Massdriver and you click configure, you can actually completely customize that interface. What we have a lot of times is engineers will come in, they'll design some Terraform, they'll make a much simpler interface where it's like, this is the only three things that my team needs to be concerned with when deploying PostgreSQL. They can set a version, maybe they can restore it from a snapshot and they can set what they want their availability to be.

All the alarms, metrics, everything else is codified and that Terraform is not exposed. You get these really, really simple interfaces by default that you can essentially add to. We also have this concept of presets. You can, as an operations engineer, you can build out three different variable sets for a Terraform module. This is how we run it in dev. This is how we run it in production. Maybe we have a production serverless version of Aurora that you can run as well. 

Your operations team can actually give you a lot of presets. Now as the engineer, you're just saying, “Okay, I need Aurora. It requires a VPC. It automatically connects to the nearest VPC. I have a very limited view of what I can control. The rest of it is guardrails that are in place around me to make sure the security and compliance is in place.”

The other aspect that we have is we have this concept of what we call remote references. It allows you to, A, use Massdriver with stuff that's not in Massdriver. You referenced it. But B, you can use the same concept to essentially make tiered projects. We'll actually have customers that have a shared infrastructure tier, where their networks and their compute clusters are. Then they have projects called ML team, API team, e-commerce team, whatever. They just borrow, essentially, compute and networking from the shared infrastructure.

You get these very thin layers, like views of your world, so you can see what the API team needs. You can just see that you're getting your network, or your compute from the shared infrastructure. We do a lot to make sure that when you're looking at your diagram, it is very much your world, not the entire business, right? Because that's what you're thinking about. You're thinking about changes. You're thinking about what's happening in my system. You're trying to troubleshoot my application environment. I don't necessarily need to see networks and what the machine learning team is doing. It's just my world.

[0:16:19] LA: That works well for the developers that are creating the diagrams, and I love that. What about the operations teams that have to validate, or that verify that what is coming out is correct and doing all those sorts of things? Certainly, there's a lot more detailed data that they need to make sure that the systems are correct. Is that right?

[0:16:40] CO: Yeah. Today, they have the ability to view everything that's in the system. We keep an audit trail of every single change that's happened. You can actually take different, let's say, you’re staging a prod. You can actually visually diff those, or you can take your production database and diff it over the last three hours and see what changes have been introduced. We have a lot of functionality for operations teams to be able to come in and see what's happening, including just a full audit log, where you can see literally every single change that's happened to anything, so you have this centralized worldview of what's going on.

We're working on some dashboards stuff now to even provide more insights to your operations team. We do naming and tagging conventions behind the scenes for you. We're starting to bubble up costs. How much do my data planes cost? How much does my production cost across the board? How much does the API teams production system cost? We manage all these tags on everything, whether it's going through Pulumi, Helm, Terraform, etc. Give you this consistent tagging convention, so you can actually pull back and query things with some pretty complex dimensions. We're going to start presenting that on the operations dashboard, which is something we're working on moving up towards re:Invent.

[0:17:47] LA: Yeah. I've seen a lot of different tagging systems that do a lot of that work. One of the problems that all of those systems have to resolve is the idea that everything has to be tagged for anything to be useful tagged. Do you enforce tagging when people require tagging? Do you have policies and things like that for how tags are applied? I guess, I'm trying to lead us into maybe the policy side of the diagramming process that's going on here.

[0:18:16] CO: Yeah. Any modules that we build are automatically tagged. We have a series of, I don't know if plugins is the right word, but essentially, integration points for your IaC. When your IaC executes, if you're not using one of our modules, Terraform modules, etc., there is a, essentially, a bag of metadata that you get about your current compute environment. Alarm webhooks, where it could be PagerDuty, or it could be Massdriver’s alarm system. One of those things is your tags. We automate a lot of the tags for organizations, projects, environments, application, purpose, etc., are automatically on there, so you can just attach it to your Terraform or your Pulumi.

We are also adding the ability to do, essentially, cascading tags. You can set organization, project target, and essentially, resource level tags that will also roll out in that metadata bag. What's nice is in Terraform, if you're in AWS, you can just set your default tags and take this bag of metadata that's coming out and that's got all of your tags in it. For some of the other clouds that don't have that nice setting the tags for all of your resources, you have to set the tags everywhere. I think we've got some PRs that we're going to be submitting back to one or two libraries that run Terraform code that allows you to do tagging for Azure and GCP as well to provide a level.

That's one of those things that really sucks. If you're in the AWS ecosystem, that default tags block in Terraform is amazing to get tags everywhere, but that doesn't exist in Azure GCP and some of the other systems. If we got that in place, that just makes it so much easier to make sure that people are doing this consistent tagging. Today, we effectively, we provide the metadata in this block and it's up to your team to attach it to the right place.

[0:19:56] LA: One of the things you mentioned and it certainly seems to make sense is in your diagrams, things like the connections between your blocks, they're not just diagrams. They are contextual information that creates a valid block of information that's passed to both the source and the destination. I think that's one of the ways that you implement the principle of least privilege in your IAM policies that you automatically create for customers. Is that right? Can you talk about that a little bit more?

[0:20:26] CO: Yeah. Essentially, we have in our contract system, we've got this defined way. If you're familiar with Terraform, it would be an output. Effectively, we have these contracts that you can apply to your Terraform outputs that can validate that those are correct and that that is a type that will fit into Massdriver’s type system. I can say, this is what my VPCs look like.

We include, I think, 200 types by default, but you can also customize. You can build your own custom types and publish it into our registry, just like you can publish modules. You can define your own concepts there as well. In that, we have effectively an IAM block, where you can essentially pass down named policies. You can say, “Okay, I'm making an SQS queue. It's very use case specific. Maybe I only want to express the ability to write to this queue.” Well, that's the only choice the downstream now has. When you receive this block, there is an SQS policy in the IM section called root. That's the only policy you can pick from.

Now, if you're working on, let's say, another SQS queue and there's read and write, you can put both of those as policies that are expressed in this contract that comes out. Now, the downstream will have the ability to pick, read, or write depending on what it needs to do. Again, it's this framework for helping people do Terraform at scale and write Terraform effectively. Not just send out ARNs and let people attach whatever they want. Me as the person authoring this Terraform, I can tell you, these are the two important policies that you should know and use. I'm not going to express an administrative policy to you, so you can't possibly bind to that one. I'm just going to give you a read and write. You can pick one, the other, or both to bind to.

[0:22:02] LA: Where do these policies, or contracts – I'm not sure if those are the same things in your terminology, but both questions that are different. Where do they attach themselves to? Or is it something that you create and derive from the diagram? Or is there a lot more contextual information you need to know about an object besides the diagram in order to create those contracts?

[0:22:26] CO: You produce it from your IaC tool. We effectively give you the means to integrate it into Massdriver and say like, this is an output. Let's say, we're just talking Terraform specific, because it's a little weird with Helm. But with Terraform, you can say that this output right here, we're going to call it food corps VPC. This is what we define all of the outputs that have VPC information look like. You can define that. Then right from that terraform module, you essentially just make that output. Now that bag of information can be transmuted into a Pulumi resource, or a Helm resource, or another Terraform resource. That's the other interesting thing about our contract system is it allows you to do effectively remote state across provisioning tools, which can be useful.

There's a lot of times where I've got Terraform and Helm possibly in the same run for configuring an application, right? My application is going to run in Kubernetes. I need to run a Helm chart, but I also need to do some IAM to get my role, bind it to all my inbound policies, etc. Yeah, it's again, it's one of these things where it's like, this is integration point that we expose to people and then just using Terraform, you can say, “Okay, this is the output that I want to have to be a very specific type. Now Massdriver has the ability to see that output and pass it off to other things.”

[0:23:35] LA: Is it fair to say that it's in a typical organization, it would be the operations team that would make those associations and make those modules that end up ultimately then, becoming diagram concepts that the developer can use when they create the specific application-level infrastructure. Is that a fair combination of the use cases?

[0:23:54] CO: That's what we typically see is operations engineers doing that, but it's up to the maturity of your team, right? The types are something that is extendable and you can publish them in the Massdriver. You could technically say, okay, the ML team, they've got all these ideas of how they need to connect all their ML apps together, right? They want to define some of their own types. You could say, okay, the ML team can make a couple of types that they manage that makes it really easy for them to maybe pass host names between different ML services, or maybe some information about the model building process and an ETL system, or something. You can kind of, as the operations person and Massdriver have a bit of control over who is allowed to publish modules and who is allowed to publish types into our type system.

[0:24:39] LA: Cool. If you were to try and describe, yeah, I'm going to ask this question about both your company and the industry. There's two basic types of problems in the creating cloud infrastructure space that I see. There's problems of, how do you make the job simpler and understandable? Then there's the problem of how do you make a conformant and compliant to the rules and regulations that I need in place? Which includes security policies, cost policies, all that stuff. Those are in many ways conflicting, but they're very different sets of requirements in those two areas.

Now, I think your product is more right now anyway, geared towards the simplify the job part of that solution. But you have a little bit of connection into the conformant and compliant infrastructure space. How important do you think that is in the future? Where do you think the industries needs really lie in that area?

[0:25:43] CO: Yeah. I mean, I think being able to reproduce environments is pretty important, right? I mean, I think we've all been on that team where you're sharing a staging environment. It's just like, “Okay, can I have the staging environment for an hour?” It's just like, “Oh, no. I need the staging environment.” That world sucked, right? We're starting to see this idea of preview environments, which is pretty cool. I can go on Vercel, I can spin up 20 preview environments. Every time I open up a poll request, I'm able to see 20 different versions of my application.

Our applications are becoming more and more cloud services. Massdriver itself, I went through and just – Massdriver itself has 27 AWS services that we use just to orchestrate our core service.

[0:26:24] LA: It's part of the problem, right?

[0:26:26] CO: Yeah. Right. I mean, it's a lot of –

[0:26:27] LA: The complexity. Yeah.

[0:26:28] CO: Right. It’s a lot of code that we didn't have to write, but it's a lot of dependencies that we've taken on. When you think about that, we're getting more and more cloud services as a part of our applications. Things start to get interesting. Okay, so how are we validating these environments? How are we doing testing? Am I doing mocks for every single AWS service? Am I standing up local stack in my CI next to my service? One of the interesting things that you can do with Massdriver is you can spin up preview environments that include infrastructure. If you have this shared environment, I can have a production environment and then I can have a shared production environment where my clusters are, or production workloads, and I can have a shared preview environment where our preview environment clusters are. I can share that out to every single one of my development customers, API team, LML team, etc.

Now as they're spinning up the different cloud services that they need, let's say, I'm doing a pull request for my app, when it spins that pull request up, it makes a preview environment in Massdriver and it spins up infrastructure alongside it. Now my QA team, who is wants to go in and see that this feature works, actually gets a fresh SQS queue and it stands up a database right alongside it. When that PR emerges, it tears down the infrastructure. You can actually start to bring in some of these cloud services alongside your preview environments to get real interactivity to see how this stuff actually works. Then you can do things like, using K6 to exercise your application and see that it's actually working and scaling how it's impacting some of these cloud services, so you can get an idea of how it should be configured for production.

Instead of, okay with my local mock for SQS, this app works super-fast. Well, does it when it actually hits SQS? You don't know until you hit prod. I think that preview environments are –and standardized repeatable environments are getting popular. You're seeing it get popular along frontend frameworks. Some of the passes are starting to do it for replicating your applications. We don't really see something that's doing it for infrastructure. I think that's going to be really important as we start to lean in more and more cloud services, especially around a lot of the ML services that are starting to come out in GCP and AWS. That was the first question. What was the second one? Sorry, I should have wrote those down.

[0:28:36] LA: Well, the first is relates to the application, second is it relates to the industry.

[0:28:41] CO: Yeah. I mean, I think it's important in the industry. I mean, I think that as we get more and more cloud services, and some of these are – they're cheap. It's fast and it's easy to spin up an SQS queue, right? To be able to have a preview environment with your applications running where you have a real queue, you can actually see like, “Oh, I didn't have it set up for FIFO. I didn't realize that. My mock locally did.” You can start to see some of these things. You might not catch until it hits prod, right? I think it's very important for the industry to get there.

I think what we're seeing at the same time, though, that's pretty interesting is it seems like we're in a place where people are starting to recoil from the complexity of the cloud. We're starting to see 5% repatriation a year, like moving back to data centers. That is very interesting as well. I think there's going to be some of those similar challenges there. We can do things easier in a data center, because there's less things, but now we're writing more software. We're not leaning on these cloud services.

Now how do you spin up those preview environments? I think, even when you get back into data centers being able to spin up these reproducible environments for your developers is also still important, whether you're doing it in Kubernetes namespaces, or using something like Nomad, that is a thing that makes it so much easier to QA and launch stuff than the way that we did it previously with staging environments, right? I think the alternative is starting to use things like feature flags, which are great. We actually lean into feature flags pretty heavily, but I feel like, that requires a pretty high level of maturity. It is not as easy to coordinate at scale than preview environments. Now, the tradeoff there is preview environments cost money, right? There's the rub.

[0:30:14] LA: A lot of it depends on how much infrastructure does it take for an instance of the application to run? At a small scale, a lot of applications can run with very, very small footprint, but not all applications. Some applications require a certain size footprint for an instance, even with almost zero traffic. In those cases, preview environments can be very, very costly.

[0:30:37] CO: Yeah. What we actually do is to allow you to bin pack your costs a little, if that's a phrase I can coin, is we recommend people have a compute environment that all the preview environments run in, right? Or maybe they have one or two of them to pay on the size of their team. You have this one Kubernetes cluster and one VPC, and then all of your preview environments actually get, essentially, connected to that provision. You have independent compute environments for all of your applications, but they're sharing this same compute resource, so you're not spinning up a 1,000 Kubernetes clusters, or 10 different VPCs to test out a team's changes.

[0:31:12] LA: If you're not doing performance test, you can even share database instances and things like that as well.

[0:31:18] CO: Yeah. Yeah.

[0:31:19] LA: I'm going to mention – I've got a list here of a few different products and technologies, okay? I'd like to hear your thoughts on how each of them fits into your space. Specifically, are they supportive to what you're trying to do? Are they competitive to what you're doing? Are they just unrelated and nothing that you deal with? Those three answers, there might be more answers, but certainly, I'd like to understand why you think that's the case for each of them.

[0:31:47] CO: Sure.

[0:31:48] LA: First of all, we've talked a lot about Terraform. What about cloud formation?

[0:31:52] CO: Cloud formation is, I would say, it's not a query. That's something that we are actually in process of adding to Massdriver’s one of our IaC tools. We see a lot of customers come to us today with cloud formation. We're tired of this. But they want to import their stuff. Today, we have the ability to import Helm and we have the ability to import Terraform, but not cloud formation. That's something we're actively working on.

[0:32:12] LA: Do you see cloud formation on the same level of sophistication as Helm and Terraform? Or do you see it as a either substantially simpler, or substantially more complex version?

[0:32:23] CO: I'm going to say something that might be offensive to a lot of people. I can segue this into a couple of other tools. I can just run this train through a bunch of stuff. But cloud formation and Google Cloud Config and Azure Bicep is the most phenomenal waste of human time ever. As I understand, I get to each of the clouds, like want to have their own grip on the thing. There's some cool stuff, like the ability to share cloud formation templates and execute them. That's cool. They're all also investing in Terraform modules. They all have teams that are managing their providers and managing official modules. It's like, why are you spending so much time doing two things? I know they don't want to –

[0:33:00] LA: Just do Terraform.

[0:33:02] CO: I know they don't want to be beholden to a separate company that also very much wants to be beholden to them. It's just like, every time it's presented to me, somebody's like, “Should I do cloud formation, or should I do Terraform?” It's like, well, at some point in time, you're going to maybe want to reach for a service in a different cloud and do you want to learn a different tool? That's the reality. That's the nice thing about something like Terraform or Pulumi, it gives you the ability to start reaching for – this is going to make people freak out, but reaching for hybrid sooner. Like, “Oh, my God. The guy said, hybrid cloud startups.” I don't think it's something we should be afraid of.

We as an industry, we're always saying, we use the right tool for the job. Until we get to the cloud, then we use the tool that is in our cloud, right? We're like, why are you using AWS RDS? “Oh, well. It's the one that's there.” Why aren't you using crunchy data? I don't think we do that a lot in the cloud. I think that hybrid cloud can be more than just like a buzzword. It could be you actually doing right by your business. AI platform in Google is pretty great for a while. Better than the ML and AI tools in AWS. Why be stuck using the more mediocre tool?

When it comes to the IaC tool that you're picking, the one that gives you the flexibility to not have to learn another tool, to go get the thing that might serve your business better. That's more important to me. That's why I tend to pick Terraform, or things that are a bit more cloud agnostic.

[0:34:25] LA: Next on the list are some CI/CD tools. Things like, CircleCI, Travis CI, Argo CD. Where did those sorts of tools fit into your ecosystem?

[0:34:35] CO: Yeah. We integrate with those. We have an actually pretty novel way of how we do most things. What I want to do first, if I can do product placement, Travis CI, I have it in my closet still. I got a Travis CI shirt at GitHub universe, I think, like seven or eight years ago. It is by far, my favorite piece of swag I’ve ever gotten. If any of the Travis CI people are listening, your swag person fucking rules. The break-in the build shirt, love it. So comfy.

Yeah, we integrate with CI/CD tools. We have a CLI that does all of our, essentially, the kickoff and provisioning in Massdriver, and so, you can put it in your GitHub actions. We have official GitHub actions, but you can put it in CircleCI, GitLab, etc. Yeah, that's a place that we integrate. What's interesting with how we do everything, and this just comes back to security. What's interesting like, I feel like, where we are today, if you were in 2005 and somebody was like, “I'm another company. Give me your source code. You want to use my product? Give me your source code.” You'd be like, “You're out of here freaking mind.”

Everything we do nowadays, we're like, “Ah, sign in with GitHub and give it access to all my source code,” right? We give access to companies. Give our source code to other companies all the time when we're authorizing for stuff, which seems a bit wild. One of the things we do at Massdriver that is extremely painful for us is we never get access to your source code. Even though we're helping you build images and scheduling those images, or provisioning infrastructure, we never actually get access to your source code. We don't directly integrate with GitHub, or we ask for access to your source code. We do everything through CI. CI is a very important partner and ecosystem for us to integrate with.

[0:36:09] LA: What about cloud platform systems? I'm thinking things like, Amazon's Elastic being stuck here. Where do you see platform as a service offerings like that compared to what you do?

[0:36:22] CO: For a lot of people, there's still a really great option. I mean, technically, we support anything that Helm, Terraform, and Pulumi supports. Technically you can manage your Heroku environments with Massdriver. It gets pretty inception-y inside Massdriver, but it also runs on itself, which is very weird. Yeah, those probably fit more in the competition range with a segment of our customer base. We do get a lot of startups today that they just want to run their application, but they do want some control, right?

They're competitors, but they're also sometimes things we'll just tell people they should still stick to. We've had a couple of customers that come to us like, “We don't change a lot. Heroku feels really expensive. We're thinking about going to the cloud. How can Massdriver help us?” It's like, just stay on Heroku. If Heroku bills $60,000 a year, that's fine. Don't move to us. Cloud bill is going to be – maybe it'll be cheaper, but then there's going to be a burden on you. Just stay there. If your business scales, come on back. I think it really depends on the maturity of the business.

Now, there's people that do want a little bit more control. They do need queues, or they do need to do some ML stuff. That's where it starts to be more of a customer that's interesting to us, because we don't want them to have to deal with some split operations managing. If they manage your applications in Heroku, you have to manage your infrastructure in AWS. You can't securely connect them, because you need Heroku enterprise to get VPC peering. Those are the customers that are more interesting to us. We can help you save some money on your compute and actually get you some of the functionality you can use from the cloud. I think it really just depends on the customer's maturity and what they're doing.

[0:38:00] LA: At two more companies, the penultimate one is AutoCloud.

[0:38:05] CO: Yeah, they’re our buddies. We love those guys. We actually work very well together. I mean, one of the things that we don't do is generate Terraform for you. We have Terraform that's in our marketplace, but if you don't have Terraform, you can either pick from a marketplace, or you can write your own. There are plenty of teams. Like I said earlier, there's the state of CD reports that 27% of orgs are actually using ISC today, right? There's a lot of companies that still aren't.

There's plenty of people that are in data centers today, they haven't done a ton of Terraform. They're fantastic ops engineers that have worked in data centers forever. Now they're getting forced into the cloud. They still understand the concepts. They understand networks. They understand IP tables. They understand maybe even running Kubernetes, but they might not have Terraform experience, or AWS experience. But they need to be able to provision and manage stuff quickly. They need to be able to migrate themselves to the cloud, without having to become a junior Terraform engineer first, right?

I actually think, AutoCloud and Massdriver actually work very good together. You can go in, here at AWS, you can click around, build some stuff, generate Terraform with AutoCloud, and then you can push it into Massdriver for your developers to pick and choose what they want.

[0:39:09] LA: Cooperative. Very cooperative from that standpoint. Yeah, you're solving different layers of the problem.

[0:39:14] CO: Yeah, exactly.

[0:39:16] LA: Okay. Datadog.

[0:39:18] CO: Datadog is an interesting one. Does anybody from Datadog listen to this?

[0:39:23] LA: I'm sure they are. Given I came from the observability space with New Relic, I'm sure there’s Datadog people listening in.

[0:39:30] CO: Datadog is an interesting one, but I think my opinion of Datadog and the business opinion of Datadog might be a little different. I personally, as an ops engineer, do not find a lot of value in a dashboard. It's something that's really great to put up on a screen to make everybody think that you're fucking smart. Looking at a dashboard, building a dashboard is tedious, torturous work to me. It's like, okay, let's build a dashboard for what I think is going to help me at 2 am when something's broken. I want open telemetry. I want traces. I want something with a bit more fidelity than just a bunch of blinking lights coming at me. That being said –

[0:40:03] LA: Insert any observability company into that sentence. Instead of Datadog, if you want to choose New Relic, or anyone else. Where do you sit in the observability space? I know you've got integrations. How do you see the correlation?

[0:40:20] CO: Yeah. We just bump up against it. We do a little bit in the observability space today. We've got some really interesting work underway today that I'm not going to dive into too much, in case there's that Datadog listener. I think that we have, like the way that we run and operate and the number of places that we interact with people's infrastructure, we have a lot more data points, I think, than some other companies. Our type system and ability to actually see how things are connected together gives us a lot of insight.

While we will probably never be your metrics dashboard, we are starting to bring in some more observability tools to help people lower meantime the resolution and get better understands of how their changes impact costs, how a change may be caused a metric to experience some anomaly. I think that if you're looking for a bunch of blinking lights and a pretty dashboard, that's not a Massdriver, we'd say, “Hey, yeah. Go put the Datadog thing in your Massdriver account and your stuff will go out to Datadog, too.”

We're trying to give people a lot more actionable insights around a specific environment. We're looking at things at the environment level. We're saying, this is your API production account. You've got a database, you're borrowing Kubernetes. We actually know all of this stuff. You're sharing a Kubernetes cluster with four other teams. There's a ton of context that we get from our lines. We're working on some functionality there to make meantime the resolution of it quicker to resolve, are smaller, and also, to help people find insights and problems before they occur is the goal there. I'd say, yeah, I don't know. That one's up in the air. That one's up in the air. I think, maybe we bump into each other pretty closely.

[0:42:06] LA: As you grow, there's going to be more connections and associations.

[0:42:10] CO: It's funny that with all tools in the ops space, depending on how you look at it, I feel like, all of them, you're like, it depends on how you feel that day. As a founder, you're like, “You know what? There are competitors today.” So much stuff, like we touched so many things, right? We touch CI/CD. We're technically doing your infrastructure provisioning. Does that make us an M^ZERO, or a Spacelift competitor? I don't know.

Today, probably not, because most of our customers are software engineers that maybe don't know how to write Terraforms. They're not going to go buy one of these other products, because they don't even know how to write Terraform and manage cloud, right? Now as we start to move towards more operations teams, now we start to become a competitor with them.  That’s the real interesting thing about this space.

Even when we're doing our VC pitches is like, who is your competitor? It's like, anybody we want to be, but also, we can be a partner with anybody we want to be. It's an amoeba. There's real blurry lines.

[0:43:03] LA: Absolutely a great description of the DevOps space in general. Definitely, definitely, I agree with that. Let's end with a product maturity question. Where are you in your growth cycle? Are you in beta? Are you in production? Do you carry production workloads right now?

[0:43:20] CO: Oh, yeah. Yeah. We're managing like, four – last time I checked, I think we're managing upwards of 4,000 databases and clusters. There's –

[0:43:30] LA: In production.

[0:43:31] CO: Yeah. There's a ton of compute going through us right now. We manage a number of healthcare companies, a lot of startups in the AI space. There's some pretty intense production workloads running there today. Now, I think –

[0:43:46] LA: What's your largest workload? By size, not name.

[0:43:50] CO: Oh, geez. Depends on how you measure largest. But the most active customer right now is doing 800 deploys every two weeks or so. That is just a ton of deployments running through the system. Then we have some other fairly large Kubernetes clusters. The catch is we don't really have visibility into people's configs per se. The way that we store things, like we can't report on that. The end user can use the keys that they have access to while they're interacting to decrypt and compare some of their configurations. What we really see is the end balance at the end of the month of how much compute they've bought, and effectively, like how many deploys they're acting on the system. Yeah, all those configurations.

[0:44:34] LA: You really can't tell which are production loads and which ones are test loads and things like that. You don't really have a lot of that visibility.

[0:44:42] CO: Yeah. We can see the tagging. People tend to tag things the same, right? Their production environments, like your environment actually is a first-class concept with Massdriver. People will name them prod production, all very similar. We just fuzzily search that and like, okay, these look like prod workloads based on the naming convention that the customer's applying. That's how we look at that.

Yeah, as far as the actual configuration, all of that is encrypted and stored and effectively an air gap to VPC that we maintain. Air gap VPC is some people are like, “What the fuck does that mean?” It's very complicated to explain, but it is the closest thing to an air gap VPC. Maybe we'll do a write-up on it.

Yeah, we keep everything there, because sometimes those configurations have secrets in them. That was something that we didn't want to be on the hook for. We didn't want to present it in the UI and our admin UI, or anything like that. We just didn't want to have the configurations visible to anybody, except for the people that own those configurations is the goal. Yeah, so we mostly look at cost and deployments.

[0:45:44] LA: Well, thank you. This has been a great conversation. I very much appreciate your time.

[0:45:47] CO: Yeah.

[0:45:47] LA: My guest today has been Cory O'Daniel, who's the CEO of Massdriver. Cory, thank you so much for joining me in Software Engineering Daily.

[0:45:55] CO: Yeah, thanks so much for having me.

[END]