EPISODE 1732

[INTRODUCTION]

[0:00:00] ANNOUNCER: DevOps is a powerful model for managing the building and operational aspects of modern applications. Most developers are now familiar with DevOps, and the adoption of DevOps practices is widespread and growing. Adam Jacob was the original author of Chef, a popular early DevOps tool. He's now the CEO of System Initiative, which develops an open-source collaborative tool designed to remove the many pain points from DevOps work. Adam joins the show to talk about the history of DevOps, current strategies in DevOps, System Initiatives Collaborative Platform, and more.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author and thought leader on cloud computing and application modernization. His best-selling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[INTERVIEW]

[0:01:21] LA: Adam, welcome to Software Engineering Daily.

[0:01:23] AJ: Hi, thank you. Thank you for having me.

[0:01:25] LA: Great. Yeah. I appreciate you coming here. I've been a long-time early user of Chef, so I first used Chef probably back in 2010-time frame or so. That was probably very early in what was going on. It was a Ruby tool, mostly back then for me, and that's what I was using it for. I'll be honest, I haven't used it much in, I would say, the last couple of decades, but at least the last decade, I should say.

[0:01:49] AJ: Oh, it hurts. It burns.

[0:01:51] LA: I'm sorry.

[0:01:51] AJ: Couple of decades. Yikes.

[0:01:53] LA: Yeah. Well, at least the last decade, I should say.

[0:01:54] AJ: Yeah, yeah. I know I'm old, but -

[0:01:57] LA: Oh, yeah. We're both probably on the older side there. I do want to talk about what's happened with Chef and what's going on with Chef, as well as what are you doing now? Let's get started and say, how has DevOps actually changed from the early days when you first started working on Chef?

[0:02:16] AJ: Yeah. I mean, in the early days, when we first started, there wasn't a DevOps. Chef predated DevOps. There was infrastructure automation, I think, was what was starting to be called. I think DevOps grew out of infrastructure automation on the one hand. How do we build fully automated infrastructures? How do we think about that piece of the work that we do? The rise of the cloud, and more specifically, than even EC2, or cloud computing, it was the ability to suddenly launch an application on the Internet and have millions of people using it overnight, which wasn't really a thing before Facebook apps.

People have forgotten about Facebook apps, but there was a minute where Facebook apps happened and then that was like, suddenly, you could do that and you couldn't before. There was no real way to manufacture that. Suddenly, you could get these humongous spikes of traffic, making things be incredibly popular overnight. That nature of the world and the web 2.0 at the time, universe, coalesced into this idea that became DevOps. Because what we saw was the companies who could survive and operate at that speed, did so by working much closer together than the traditional enterprise worked together.

If you were a systems administrator like I was, or you were a software engineer and you were in the pre-DevOps enterprise world, you worked in separate divisions. It was handed over the wall between, sometimes thousands of people. The application developers wrote code, then it went to QA, then QA went to operations, then operations deployed it, right? That was the way everything worked. When the rate of speed needed to increase, you couldn't work that way, because you couldn't do it. You couldn't go faster.

[0:04:00] LA: Then scale. The organization couldn't scale, yeah.

[0:04:02] AJ: Yeah, the organization couldn't scale. That's really where DevOps came from, is a recognition that in order to make the organization scale, we also had to make the automation scale, right? Those two things came together. I think in the early days of DevOps, we tended to talk about it primarily as a cultural thing, as an organizational revelation, if you will, as opposed to a technical thing. We're like, well, there's lots of different tools you could use. They're all roughly going to be good enough to solve this problem. We were wrong.

The truth is that it was always both. Culture, the tooling that we choose is the ossification of our culture, basically. There is no separation between what we do every day at work and the culture of our work. It is the same. The movement over time, I think, has grown and changed, primarily as we've both learned more about the tooling, learned more about what works and what doesn't. Then, unfortunately, it's if you look at the outcomes, you feel like the door report and things like that that are surveys of how well we're doing. Most organizations that have adopted these practices, they're certainly better off than they were in 2009, or whatnot. But they land in the mediocre middle. They're like, "We can deploy once a month, or every six months." They have a bunch more automation and tooling. It is better, but they're not living the collaborative dream.

[0:05:26] LA: They didn't take the cultural aspects of DevOps and incorporate them into their process. They took the tooling, but not the culture.

[0:05:33] AJ: I think they took the culture, and then maybe - I think, maybe they took the culture and the tooling just let them down.

[0:05:42] LA: Fair enough. Fair enough.

[0:05:43] AJ: Because if you ask them what they want to do, people know the answer at this point. If you ask people, "Hey, should engineers talk to operations people to get their software deployed quickly?" Most people would say, yes, at this point. There's a weird Luddite movement happening now, where they're like, "No, that's not true. We should go back to having a siloed divisions, where it's like, developers should never talk to operations people, and we shouldn't have to understand how things get deployed."

That will end on its own, because it doesn't work very well, not because we don't have to find out. We can just let it happen. We know what will happen, if that's what you decide to do. They just don't know, because whatever, they're young.

[0:06:17] LA: The reason why I said they hadn't taken on the cultural aspect is because one of the things I see a lot with companies who claim to understand DevOps, but really don't, is the language they tend to use, tend to be words like, "We do DevOps. We have DevOps teams. We're DevOps, because look, we've got three engineers over there. We call them DevOps engineers. Therefore, we know how to do DevOps." Yeah. What they've done is they've done some automation. They have people responsible for the automation, but they haven't taken on the cultural significance of what it means to own your application running in operations. They haven't taken on that piece. That's what I said when I said they didn't incorporate the culture of DevOps.

[0:07:03] AJ: I mean, look, I think you're probably right. I've certainly seen that, too. I think, if you ask those people what they want, they'll tell you, they want the cultural outcomes of DevOps. They'll tell you, that's what they want. If you look at why they don't get it, and that's really the question, right?

[0:07:19] LA: That's true.

[0:07:20] AJ: If they believe that that's what they want, if they're like, "Yeah. I want this highly collaborative environment, where there's low blame, where people work together to ship really complicated software packages, where it's teams of experts who focus on their expertise, but are aware of what each other are doing contextually, so that we can help each other move this work through these complicated processes." If that's what you say you want, why don't you have it?

[0:07:44] LA: Well, now that's the million-dollar question. Is the answer to that question, we don't know how to implement the culture, or is the answer to that question, we don't have the tooling to make it possible?

[0:07:55] AJ: Well, that's a good question. Because now we're, let's call it 15, I think, years into DevOps at this point, right?

[0:08:00] LA: Yup. Yup.

[0:08:01] AJ: There's been at least two generations. I would argue, probably three generations of tooling. If you go look at some of the antecedents to DevOps. If you look at infrastructures.org, which is still up, which is an incredible website that basically, told you how to build fully animated infrastructure, probably first came online late 90s, that's essentially still the technical roadmap that you have to follow today. If you go watch John Allspaw and Paul Hammond's presentation about 10 deploys a day at Flickr, that was one of the seminal moments in the prehistory of DevOps, we're still teaching people to do, essentially, exactly, what John and Paul did.

Now, what's changed is you don't use Ganglia for monitoring. You're using Honeycomb for observability. You do less configuration management of operating systems, and you do more infrastructures code to define which services you're going to stitch together. You use Git and not subversion. The tooling has, we've shifted out mostly for the better on each individual layer. My belief is that it's actually that the workflow that we adopted then, which was pretty good for us, it's too fragile. In order for it to work, all of these pieces have to come together and they have to align perfectly. When they align perfectly, they have to align perfectly with a culture that's willing to respect the boundaries of them.

That alignment is very, very hard to do. It's not that it's impossible to get great outcomes with the tooling we have. It is possible to have incredible outcomes with the tooling that we have. Unfortunately, it's very difficult to actually put those systems together and then also, put the cultural systems around it, which they work in harmony. That's because the user experience of doing it is too hard. Because the system is a little broken by design. It's not because the tooling is bad. It's because the shape of the system is wrong for the problem we're trying to solve.

What we didn't know then was that the problem we need to solve is, how do we get teams of experts to collaborate together in different ways in the same environment? That's what we've been trying to do this whole time. That's not what we built. We built automation to be like, "Well, how do I get 10,000 servers online? Because I have a million people who want to use Flickr, and I didn't yesterday. Then I got to update that." Which is not the problem we're trying to solve.

[0:10:16] LA: We built fragile tools that solve part of the problem, but didn't give an underlying core framework for how to do it correctly.

[0:10:26] AJ: Yeah. We didn't even know what the cultural outcome. Culture is an output of our work. Culture isn't what you say. It's what you do, right? We didn't know that getting people to do this wouldn't work, when you got larger, or has got more complex, or as things got stranger. That's because the shape of the system we asked you to implement, it doesn't support you in making good life choices, right? It doesn't work together to help you understand how to actually do the right thing. It actively works against you. That's why we're getting so much mediocre outcome, I think.

[0:10:59] LA: That's fair. That's fair. You mentioned that we've now done, let's say, two or three generations of tooling. Let's talk about what those generations are. The tools like, Git were generation one, right? Is infrastructure automation tools like, Chef, is that generation two, or is that still all part of the generation one?

[0:11:20] AJ: That's a great question. I mean, to me, I start counting pre-DevOps. I think about the tooling we use to build fully automated infrastructure in data centers. I ran a consulting company, for example, that did what we sold was fully automated infrastructure to startups. You paid us a flat fee and we would then automate your infrastructure. It was everything from provisioning, monitoring and trending, application deployment, configuration management, and missing things, backups, load bouncing, networks, all that stuff. It was all fully automated. This was, well, push in 20 years ago, right? The fastest we ever went from signed contract to fully automated someone's business was 24 hours.

[0:12:02] LA: Wow.

[0:12:02] AJ: We were good at it.

[0:12:03] LA: Really efficient. Okay.

[0:12:06] AJ: That was a long time ago. That generation of tooling, that's the Chefs and the Puppets and the NCF engine, and tools like Ganglia and Munin and Nagios, and that generation of tooling. Then I think the cloud, so broadly, AWS sparks the second generation of that tooling, where suddenly, compute was more accessible. You could think differently about the way you built deployments, or the way you thought about systems. That started dragging us into slightly different -

[0:12:39] LA: The automation became more possible.

[0:12:40] AJ: The automation had to adapt. The next big push there was containers. Getting to a spot where you went - oh, actually, for all this configuration management we're doing, what we really want are deployments that are reproducible, because now it's so much easier to build congruent infrastructure than it used to be. You want it to be - that's a weird term of configuration management art. Sorry. It just means that it's the same every time and it follows the same - the number of changes are applied the same shape every time. The output of that tends to be like an image that you can boot.

It used to be really hard to build golden images that you could boot. Suddenly, Docker showed up and it made it really easy to build golden images you could boot. Then we had to figure out, oh, that's amazing, and such a better user experience. How do we take that into the world? That then precipitates another round of changes of, well, how do we deploy those containers? Do we need schedulers? Kubernetes shows up. Meanwhile, all the other pieces of the stack, monitoring, trending, observability, they've all evolved as well. It's all concomitant as you move. You go from subvert. We were in the beginning, it was everybody was using track and subversion. Now, everybody's using GitHub.

All of that is a piece of that evolution of that automation history. Yeah. I think, roughly on that same line that we started out on when we built the Internet in the first place, it's an unbroken line to now of what the tooling is, and how it fits in the stack and in the workflow more specifically, right? You still think about tooling roughly the way we thought about it 20 years ago.

[0:14:13] LA: But it's still Lego pieces that plug together, sometimes very well, sometimes not so very well. It's not a cohesive -

[0:14:20] AJ: Then glue it together in whatever shape you want to -

[0:14:22] LA: Exactly. You take two Legos, you glue them together, because they don't fit otherwise, the way you want to put them together.

[0:14:26] AJ: Totally. You figure it out. We culturally tell you, that's fine.

[0:14:31] LA: Add some tape here. Right.

[0:14:32] AJ: Our culture says, "Hey, don't worry about it. Whatever choices you make, it's probably fine." The answer is, they're not fine. Some of them lead to better outcomes. Some of them lead to worse outcomes. Those choices matter.

[0:14:41] LA: But all of it leads to a more fragile outcome.

[0:14:44] AJ: All of it leads to - Certainly, yeah, a more fraught path to a great outcome is how I would then, maybe put it, right? You can get to great outcomes with those tools, but it's a narrow path and there's lots of dragons. That's why people tend to wind up in mediocrity. Not because they want to wind up in mediocrity. It's not like they started out being like, "It would be good enough. Let's just do it the shitty way." Nobody does that.

Of course, they don't want to do that, right? They want it to be awesome. Of course, they wanted it to be awesome. They made really rational choices that seemed good at the time based on their environment, based on the constraints that they have, based on the tools they already use, based on what's already been deployed. They mushed it together to try to make it work. It's understandable.

[0:15:26] LA: It's interesting. You mentioned the cloud and then Docker is the growth path for DevOps. I created a four-steps that I thought were the growth path of DevOps. The first was cloud, the second was Docker. The third one was the growth of SaaS and SaaS tools, and how that's impacted the use of Docker. I'd love to get your perspective on that. The fourth one, which I think we'll save to the very end, is the one that's still coming. That is, how AI fits into and the new AI technologies stacks that we have, how all that fits into not only DevOps enabling those technologies, but those technologies and enabling DevOps. I'd love to get your perspective. At this stage in the podcast, let's talk about SaaS and SaaS tools. We'll talk about AI in a little bit.

[0:16:13] AJ: Yeah. I mean, I think the big thing about SaaS and SaaS tooling is that there's always been this dream. But we joked earlier that we're old. I started my career in 1994. I was a systems administrator. That's who I am. I still feel like one, even though I'm a CEO. Throughout that history, that work of operations work has always been plumbing. At the time, we had systems administrator appreciation days. One way you know that you're not appreciated is that you have an appreciation day. You know what I mean? That dream persists, which is like, I shouldn't have to think about all of this complex plumbing. I shouldn't have to think about all these systems and computers and wires and networks. I should just be able to think about my application. I should just magically do what I want.

I think as we've gone through time, what we've learned is that operating those complex platforms is hard. Doing it yourself both requires a bit of expertise that sometimes people don't have. But also, it's worth paying money sometimes to just not have to think about the operational semantics of whatever the component is that you want to run, because it's not the most critical piece of the story.

You get that rise of SaaS with stuff like, Salesforce, where a big piece of what Salesforce was promising was that you didn't have to run big, bulky, complex ERP software, which required you to run expensive databases and all this other stuff. Now you didn't have to think about it. You could just go use Salesforce. The same thing, I think, is true now with like, you could go use Honeycomb. I could run my own Grafana cluster and I could run other tools that are a little less good than Honeycomb anyway, feature to feature. The big advantage of using something like Honeycomb is that I can just shoot my metrics at this magic endpoint and then I can log in and it has all this stuff done and I didn't have to think about deploying it. I just turned it on and it worked.

When you think about automation, that's what the end result of the automation would have been. If they told me I could deploy it on prem and I used a Chef recipe, or a Terraform module, or an Ansible playbook, or whatever to deploy it, what I would have hoped for in the end was that I run that playbook, then I get an endpoint and I can send my monitoring to I don't think about anymore. That rise of SaaS, I think, is important, because what it's doing is taking pieces of that puzzle and shifting the control plane, so that we have to think a little less about its operational concerns.

It turns out, we still have to care a lot about how it gets built and managed over time, because there is still a boundary between what we need in terms of, I need Honeycomb, or I need a database, or I need this, or I need that and it's behind the service boundary. I still have to manage the boundary, but I don't have to manage the details and that's beneficial. I think you're right that that was a sea change. It's one of the primary sea changes that drove the adoption of infrastructure as code, tools of Pulumi and Terraform and those tools, which are really a very clear descendant of configuration management tools. They work very similarly to the way configuration management worked. It's just the state they're managing happens to be SaaS-y APIs.

[0:19:17] LA: Another way to think of it is the early days of DevOps, we had GitHub. The later days of DevOps, we had GitLab and the power of enabling multiple things into a single package that helps you do DevOps better, but it's still only a partial solution. GitLab by itself doesn't solve everything, but you can use it with other tools to make things better. But what we don't have yet is this unifying principle behind the building the culture of your whole infrastructure environment.

[0:19:50] AJ: Yeah. I believe that that's because we designed the foundations of the system wrong. We treated the system itself as code, and we treated the infrastructure like an application artifact, and then we've been pushing that paradigm this whole time. It turns out that that paradigm sucks. I mean, it doesn't. It's amazing. That it works is incredible. When I say that it sucks, please hear how proud I am of having helped to build it, how proud I am of - It's amazing that it works. It's astounding. People who do that for a living don't love it. If you ask people, how do you feel about the work you did today? You're like, "Oh, it was fraught with peril."

If you have 100,000 line terraform repository, you do not feel good about changing it. It's scary. That's because it's a little bit of a weird fit. When you think about the way that we want to interact with it, like how can I show an application developer what the infrastructure is that runs their application only if they need to know? If they don't need to know it, then they should be able to interact with it just by being like, deploy the application or whatever. But what if they do need to know? What if the change they're making does have an infrastructure level impact? How do I navigate between those two things elegantly?

How do I, when we get to the enterprise, we know that the rules change? It goes from like, "Oh, I have just a Docker image," to, "Oh, I have a Docker image, but I'm only allowed to run the ones that are in my custom registry, because that's where we do security and compliance scanning for all the images we run for the entire enterprise." How can I make sure that no one ever uses an external Docker image anywhere ever? Just the shape of the problems change. The answers to those questions are very difficult to solve with the system as we designed it, because it's all locked up in code. It's all siloed in different places and different tools and different things. That substrate of automation that tries to glue it all together doesn't really exist in a way that allows you to manipulate it in the way we really need to.

[0:21:39] LA: What's the solution, says, with a leading question?

[0:21:44] AJ: Well, look, I mean, I think the solution is that we have to step back as an industry and get creative again. I think people have forgotten that we built all of this. That these were decisions we made, because we were in the trenches usually, under fire, trying to just solve the problems that were in front of us. We've just accreted those decisions over time. Now, we've gotten to a place where we're successful enough and we're good enough at it that we can take a breath and we can evaluate the problem from a different point of view, which is now we actually do understand that the problem we're trying to solve in DevOps is this huge enterprise problem. It is this problem that has more complexity than we gave it credit for. It does the culture and the technology are concomitant.

What we need as an industry is to just have people step back for a second, take a breath and get creative again, and not look at what we've done so far, and be like, "That's the limitations of our ambition." Instead, we just need to get ambitious about being, well, what if we broke it? What if it wasn't code? What if it was data? What if it was Wing lang, where his supposition is that it should just be a programming language designed for the glue? If we had a programming language that could then provide you this glue level abstraction, then we could build applications that use the cloud as its native runtime. Wouldn't that be cool? We could build a simulator for it. Yeah, it would be cool.

I don't know if I think that it's going to work overall. Do you know what I mean? Obviously, that's not the same bet I made with system initiative, but stuff like that, that's the answer to the deeper leading question that isn't just the layup of system initiative. Although I have an angle, I'm doing that very thing right now. I have what I think is an answer to that problem that I'm very convinced is super cool and will make a huge impact. As an industry, the thing we have to do is get interested and excited and creative again, saying, "No, let's break some glass. Let's change it up a little bit." Because the outcomes we're getting right now are decidedly average.

[0:23:44] LA: Is that the point where we're at, where it's like, we have to break it because it's not working? Or are we at the point where we've broken it and now ideas are starting to come through and let's figure out what the good ideas are and what the not good ideas are?

[0:23:59] AJ: Yeah, we're definitely in the bubbles phase. System initiative, for example, I've been working on with my co-founders and other folks for five years. Because the problem I'm trying to solve is this problem we're talking about right now. Trying to build a better substrate, trying to build a better system for doing this work is an incredibly complex endeavor, because it's pretty good, right? If you start pulling away at some of those foundational levers, what you learn is that there's a ton of stuff you have to think through. There's a ton of implications for what you have to change. I think, we're at the stage now where experts and people who really do understand how the space works and how these tools intersect can start to use these new paradigms and start to see like, okay, this could be great. This one has legs. If we push this further, it could actually be transformative in the way that we all hope that it is.

We don't know for sure if it's transformative, because you only get to know that when the outcomes are done. You know what I mean? You can bluster all you want about how whatever new technology you have is going to change the world. In the end, it changes a world, or it doesn't. The only way to know is that other people pick it up and love it and carry it off into the universe. Yeah. I think we're at the stage where we need more good ideas.

[0:25:15] LA: We need more good ideas.

[0:25:16] AJ: Maybe bad ideas, too. We just need more ideas.

[0:25:18] LA: We've got some ideas of what to do, but we haven't yet figured out the direction that we want to take those ideas. We haven't figured out which direction the ideas need to go yet.

[0:25:28] AJ: Yeah. I think there's a couple of strains. System initiatives point of view is that it's a user experience problem. We can solve that problem by basically, turning everything into data, so we build digital twins of all of your stuff, of digital things. You have an EC2 instance, or a Lambda function, or a VPC. We build a digital twin of that thing, and we let you interact with that twin in a simulation, so we can give you really fast feedback about whether or not what you've done makes sense. Is that VPC misconfigured? Is that subnet in the wrong side or block? That stuff is stuff we can show you in real time, instead of waiting to try to write code and then apply it.

Then we make that data reactive on a big hyper graph of functions. Every property, like every literal configuration property of every piece of the simulation is the result of a function that takes inputs in and then puts out the value. We can use that to then use relationships between components to infer their configuration. For example, if a Docker image exposes a port number, we can use that same data to then configure a load balancer's pool. So that if I change the port number on the Docker image, it automatically reconfigures the load balancer's pool. Then we can take all of that and we can use that to build new user interfaces.

One example is that when you compose this stuff in system initiative, we can give you a visual composer that looks a lot like building an architecture diagram. We can make that multiplayer. We can hop in together and collaboratively build our infrastructure and see all the different parts and pieces and how they relate and then manage their state over time, which is incredibly cool. It's faster, it's easier, it's more extensible. It's really, really great.

Then you have folks like Wing, who say, this is a programming language problem. What we need is a new language of the cloud. Then you have a camp of people who say, this is mostly just a platform problem, and that if we could get a better API contract between the operations people and the developer people, then that would help solve these workflow problems. Then, yeah, I think those are the primary bets right now.

[0:27:36] LA: It's interesting that you see that. I'm a big fan of the, it's not a programming language problem. I've seen that happen so many times, that yet another language is the way to solve this problem. Instead, we create five others.

[0:27:53] AJ: I think you're probably right. Elad, who is the inventor of Wing's defense, it is very novel. When you look at what he can do in terms of marrying a simulator, so when you write an application in Wing, he can build a simulator of that application that you can interact with in real time. That's really, really cool. Not a thing that's ever existed for any of these languages ever. I can't run my Erlang app in a simulator. It does not really.

[0:28:26] LA: Fair enough. If one of the core cultural problems with DevOps is to get dev people to realize that ops is important, and I need to do more ops and therefore, I have to understand more ops, it seems like, creating another language that is necessary to understand in order to understand how ops works doesn't seem like the right answer.

[0:28:50] AJ: I mean, this is why I invented system initiative and not Wing. This is also why I'm a fan of Wing, because we might be wrong.

[0:28:57] LA: Yeah, fair enough.

[0:28:58] AJ: I don't know that we're right. I agree with you. But what we need is more invention. I don't need to shut down good ideas. In order to get good ideas, I need more crazy ideas. Because the good ones never sound good at first. It always sounds insane. System initiative has taken five years to build, because we had to rebuild version control. We had to build an entirely new database engine that understands how to run these graphs and do reactive functions. We had to do secure function execution. We had to build a better UI. We had to integrate authoring, because once you break version control, I can't have all the functions that you write live in Git. They have to live on the graph. On and on and on and on, there's all of this complex stuff. It's insane what I'm proposing. It's awesome.

You have to just be willing to let it get crazy for a minute, because you might be right. It's the easiest thing in the world to look at new technology and be like, to be the grumpy guys from the Muppets, which I am a little. I also recognize that as much as I can be the grumpy people from the Muppets, I can also know that that's where good ideas come from. They come from bad ones. If somebody pitched you Airbnb cold and was like, "Would you rather go stay in someone's house that you don't know and you just rent it for a weekend, or do you want to stay and it'll be as big as hotels?" You'd be like, "No. That's a dumb idea." That is not a good idea. It's a terrible idea. Most people houses are awful. Why do we want to stay in them? It doesn't make any sense. Yet, you know? Airbnb.

[0:30:32] LA: It is industry. Yeah. Let's talk about the system initiative approach, because we talked a little bit about different ideas, but let's talk specifically about what is your idea and what do you think is the value of that idea?

[0:30:48] AJ: Fundamentally, I think DevOps is a collaboration problem. It's a collaborative sport. When you think about the user interfaces and domains that are complex, so think about what are other domains, other things people do that have similar levels of complexity to this infrastructure, plus application deployment field. How do those people work? How do they collaborate together? A really good place to look would be video games. If you look at the tooling that people use to build complicated video games, that's a multidisciplinary thing. You've got artists, voiceover artists, visual artists, modeling, software development, all that stuff is together. We have to work highly together in concert on the same thing in order to get that singular asset delivered.

Sometimes it's delivered into the cloud in the same level of complexity that the enterprise applications we deliver are. If you look at how those teams work, the experience of people working in unity, for example, is pretty dramatically different than the experience of working in DevOps, right? If you're like, "I'm an operations person. Which tools do you use?" "Well, I check out the Terraform repository from Git." You check out the application repository on the other side. We collaborate in code review, but we can't actually see any of those things together.

You compare that to somebody building a video game. If you built me visual assets and I was writing low-level application code, we see both of those things together at the same time, roughly in real time. We can change the model in real time. We can visualize it. We can see it. We can build blueprints. There's all the stuff we can do, because they designed a system to do it. Unity is itself a complex system designed to orchestrate the workflow of people building video games. We've never thought about DevOps as a thing that was a workflow we could orchestrate. That's what system initiative really is.

It's saying, well, if it was a workflow, we could orchestrate. If it was a thing that we could collaborate on doing, what's the foundation you have to build in order to make those experiences possible? Our answer is, well, first off, I need a simulator, because the way we work now, the feedback loops are just too slow. Even if I gave you an interface like Unity.

[0:33:04] LA: Too dangerous. You don't want to make changes -

[0:33:06] AJ: Too dangerous.

[0:33:07] LA: - and not know what's going to happen. Yeah.

[0:33:09] AJ: Yeah. Imagine trying to build Unity over Terraform plan, right? It doesn't make sense. It would not be a fun experience and you wouldn't enjoy it. Then once you decided you wanted to do it and you hit the button, you're just like, what are you doing for the next 45 minutes? That's the first problem is just, how do you make the user experience better if the underlying substrate is misery inducing?

The answer has to be, we build simulations. That's what we do with race cars. If I can only take a Formula One car and tweak it every couple of once a week, how can I get more testing in? The answer is I build a simulator and I run the simulator all the time. Then I use the simulator to inform what's possible before I ever get to the race. We can do the same thing. We had to figure out how to build a simulator and we did. Then you know that it has to work for more complex enterprise problems. It's easy to build any automation the way you want to when it's a toy.

When you think about the needs of a bank, if you're a global bank, you have a ton of systems that have been deployed over the course of 50 years, 60 years, 80 years. Those are critical systems. They run the global economy. They made a bunch of terrible life choices, and some incredible ones. They all made them all for a reason. You have to be able to fit in and support all of that complexity, all of which you can't possibly understand as a tool maker, right? As somebody who builds tools, I can't understand all of that complexity, because it's just bigger than I possibly could do. You have to think, how do I build? How do I make the tool powerful enough that a person could mold it to their needs, regardless of the complexity of their needs? That's how we wound up with this big hyper graph of functions. Because I have to make every single piece of it programmable.

A good example is eventually, with system initiative, I have this model of all of your resources you've ever built, right? Think about as infrastructure as code, which is the first thing that we're going to do.

[0:35:04] LA: More like, infrastructure as data is a better way.

[0:35:06] AJ: It's infrastructure as data, yeah. I've got this model of all your infrastructure. I've got the real-world data about the infrastructure at the same time. Now, let's say that the enterprise problem you want to solve is, I want to be able to say, how much incremental cost will this change at? In system initiative, computing that is pretty straightforward, because what you would do is write a function that takes the component and the resource as inputs, right? Then computes the cost based on whatever AWS says the cost is, spits out that number and then aggregate that over the size of the workspace.

Next thing you know, you could then visualize the cost analysis pretty straightforwardly. Every time I change the components, for example, if I change the EC2 instance size, that's the thing that determines the price. I can make that function reactive to when the instance size changes, rerun the cost function. At which point then, the cost function spits out the number, then I can build a UI that shows you, "Hey, here is the incremental cost of this proposed simulation." That if you hit the apply button, will change in the real world.

If you think about how to build that without system initiative, it's impossibly hard. You're like, "Well, we could try to grab all the data from AWS. We could try to go backwards. We could try to do it. How do I understand your intent?" But with system initiative, it really does boil down to, well, you can write a function. What's the data? What feeds it? Then how do we aggregate the data? Then, how do we build an interface? That's so much more straightforward.

[0:36:36] LA: You build a database. You build a simulation engine that simulates what the database represents. Then you build a UI that allows you to update and modify that database.

[0:36:48] AJ: Yup. And the code that runs on the graph.

[0:36:51] LA: Right. Right. Then there's another piece left there, which is, for lack of a better word, I call it the synchronization engine. The thing that says, "Okay, here's the way my virtual world wants the real world to look. Here's the way the real world is really looking. What's the differences and then make the differences work?" That synchronization engine, that's the other secret sauce that you need in order to make this work.

[0:37:16] AJ: Yeah. But it turns out, and this is mind-blowing and very upsetting to me, because I'd spent most of my life believing that it wasn't this way. It turns out that once you can visualize all of this information, the need for that reconciliation loop - for example, in Terraform, one of the promises people make is that if it's all in code and let's say, I wanted to, I needed to swap regions. I needed to redeploy from US East to US West. The promise of Terraform is that I could just change the region variable, and then it would figure out what to do. Then it would deploy on my things and destroy the old ones.

Now, anyone who's ever tried this will never do that, because that is a disaster. It probably won't work. It definitely will destroy things you wish it didn't destroy. The amount of complexity in that motion is crazy high. When it works that way, because Terraform does exactly what you just said, it bakes the reconciliation loop into the model. It's abstracting all this infrastructure work into a thing that behaves in a predictable way. It's predictable the way Terraform thinks predictability should be, which is if it changes, I should destroy you, right? Roughly. Somebody's listening who loves Terraform. He's like, "It doesn't always destroy it," and that's true. Big picture.

In system initiative, because you can visualize all this stuff, and because you can see it really easily, what you would do is click on the region. You would click on US East. Inside read that is a huge frame on a big infinite canvas. You could just hit control+C. Then you could move to another spot on the canvas, and you could hit control+V. It will copy and paste the entirety of the configuration and unmatch all the resources, but keep all of the basics the same.

[0:39:06] LA: Still just within the virtualized world.

[0:39:08] AJ: Just within the simulation. Then all the things that are no longer configured correctly turned red. They're like, "Oh, this thing's now not configured right. This thing doesn't work, and this thing won't work." Then you fix them, and then it'll give you a big list of all the actions that are necessary in order to reconcile what you have said you wanted to the outside world. We got to end in the order it should do them. It'll be like, "Oh, I got to create this VPC, and then I got to create the subnets, and then I got to do this, and then I can create the instances, and then I can do that."

It can become very imperative, right? The declarative model would tell you, well, you just got to - what do you do? Well, I got to change the declaration that says, all these things should be in east to west. Although, that declaration doesn't actually match to a function that AWS provides, right? AWS doesn't let you do that at all. You got to figure out how to do that. You still have to figure out how to do it. Because we've made it so easy to just visualize what happened, and then you can fix it without having to do it, that failover becomes not so bad. It's actually straightforward. But you can't even really see it, until I give you cut and paste, right?

[0:40:14] LA: Fair enough. Yeah.

[0:40:15] AJ: If I say it just out loud, if I'm like, "Oh, what you should do is drop the declarative model," and instead, it's way better if it's just the imperative activities that you need to do in order to make the transition, you'll be like, "You're insane." That's obviously worse, but it's not. It's better.

[0:40:30] LA: Well, it's only worse because you had the declarative model that you were able to use as a simulation on to know that it was correct before you determined what the differences were.

[0:40:40] AJ: Yeah. But in this case, the differences are the relationship between for example, this resource doesn't exist is a pretty clear - the action to take is very clear, which is create it, right?

[0:40:52] LA: Right. Not always that clear, though, right? It's like, the security policy needs this flag set for this time for this moment in time, because of some change I made five layers deeper.

[0:41:05] AJ: Yes. In those moments, what system initiative allows you to do is make the surgical change that's like, "And I should run this action right now. Just do this right now. Do this one thing." That maybe update the tags. Maybe it's restart the thing. A great example of this in practice is all of the applications that run system initiative have dynamic telemetry inside them. There's a signal. If you send them SIG user one, they'll change the trace level down. We can make the services go from warning to info to debug to trace. One of our engineers added that functionality into system initiative by just adding a function at all the different layers you'd want to do it.

We can click on an AWS region and be like, change the tracing level of all system initiative applications down by one that are in this region. It just works. It's very imperative. It's because it knows what to do and you can just write arbitrary functions that understands how to do what you needed to do. It also works if you click on the individual application and go, change the trace level of my individual application, and that works just fine. If you imagine that in a declarative world, it's pretty tough, because I have to figure out how do I map that? Is that a destructive state transition to declare that that's true? That property doesn't even make sense to declare, because it only really exists for me. How do I even think about modeling it?

Whereas, in system initiative, it's just, yeah, you can just model it. It's just, it's a simulator. Then you can use the simulator to change the real world. The reconciliation makes more sense, because I can visualize the delta. Whereas, in tools like Terraform, or Pulumi, or Chef, or Puppet, you never visualize the delta, right?

[0:42:38] LA: There's no visualization that makes sense, yeah.

[0:42:41] AJ: It doesn't even make sense that you would, because the workflow is so - It'd be crazy. If you did it, no one would want it. Do you know what I mean?

[0:42:50] LA: Right, right. You act on the model and then you use your - you do imperative changes in the small.

[0:42:58] AJ: Yes.

[0:42:59] LA: That's really the key there is you don't need to know how to take this representation of the world and move it to another region. All you need to know is that, well, this security group to move to a different region, I have to perform these actions and do this. That's all you have to know. As long as you set up the model, you copy paste it and resolve all of the simulation errors, and assuming your simulator is good enough, once you do all of that, you know the structure that the system needs to be in, then the implementation is very, very simple.

[0:43:34] AJ: You got it. Then you add in multiplayer. Now, it works like notion, it works like linear, or it works like Figma. Now, when you're doing this, you can have five people log in at the same time. I can see your cursor is moving around. When you watched our own people build, our own folks build system initiative in system initiative. They built the SaaS platform with system initiative. It's an incredible thing to watch. If you get them to share their screens at the same time. Discord can do this, where you can have multiple screens up at once. It's crazy to watch them build things, because they do it in real-time together in different sections. They're like, "Oh, you go build the public one. I'll go build the private subnet. I'll launch this thing." They're working together in this canvas, and they're pulling the building blocks together. It's incredible.

Then the other thing that's interesting is they tend to work now in much smaller batches. We've discovered that instead of with the existing tooling, you tend - because it's so expensive to run and takes so long, you tend to batch up more work, right? Because it's going to take half an hour to put a change through. When it's relatively instantaneous, you just start to do small, imperative things. You're like, "Oh, yeah. I don't have to get the whole app running in one metaphorical PR. I can get that one resource working. I can tweak it. Then I can do the next step. Then I can do the next step. Then I can do the next step."

[0:44:54] LA: We know from reliability science that making many small changes is much, much better and the system as a whole, much more efficient and much more successful than doing large changes at once.

[0:45:07] AJ: Yeah. Because in the end, and now I'm just what I wind up with is the same metaphorically, as if you had written all that code, as if you had done all those things. It's still version controlled. It's still a model I can attest to. It's still a thing I could copy, I could move it around, I can do stuff to it. I can still operate on it the same way I would have other things, but it's built up incrementally. Yeah.

[0:45:26] LA: Let's talk now then about that fourth pillar that I talked about, the up-and-coming pillar and how it might affect DevOps. Let's apply it specifically to your system initiative. That is the role of AI. You've got these large data models that describe how systems work. You've got these imperative actions in the small that - and a assimilator that makes the whole thing work together. Where does AI fit into that, as far as a tool to assist you in building, deploying, and operating in the DevOps matter?

[0:46:01] AJ: Yeah. We'll see, which maybe feels like a cop out. I think you have to look at AI as technology. You have to look at it as like, what is the technology good for? What does it do today? Then, what do we believe it could do tomorrow? Then, what can it do tomorrow only with some humongous leap forward in the technology, or in the shape of how those things work? Because it's not just one technology, right? There's LLMs, there's machine learning. There's this huge panoply of approaches and technique that get us to the answer.

[0:46:30] LA: One of the things I think about that AI gives you, that I think applies in your case is the ability to encompass large quantities of data and is simulated down into incremental small answers.

[0:46:45] AJ: Yes. I believe that that's one of the places your first see AI in system initiative is in like, I have this huge data set. Can you summarize for me what's happening? Can you tell me what happened over the last 24 hours? Those kinds of questions, I think it'll be lovely to be able to ask a system that in relatively plain English and get an answer. If you think about the impact of that on corporate level reporting, it could be pretty dramatic at your ability to be like, "Oh, well. What was happening in the entirety of the infrastructure at noon?" Which is a question you can imagine asking as an imperative query. But if you think about summarizing the what it meant, the AI would probably do a better job of giving you some meaning behind those things. I think that's an obvious place.

Another is that having that really one-to-one high-fidelity model of the outside world does enable you to think about letting it make changes to that in a safer way, right? Because I'm applying it only to the hypothetical model, instead of the other. Maybe there's some things you could do where you think about letting those systems have a little bit more impact in terms of raw control. I think, one of the things that the current crop of systems doesn't do super well is precision. That's because what it does is fundamentally creative. That's why it's great for creative stuff. That's why it does things that computers couldn't do before. That's why it's so transformative and compelling.

I think in the world of application deployment, infrastructure, in that universe, things tend to not work if they're wrong. They're either correctly configured, or they're not. What correctness means is usually different depending on each environment. What's a correct PostgreSQL configuration for my application is different than the correct PostgreSQL configuration for yours. It's because the way your application works is different than mine. The size of the shared buffers you need is different, because of the type of query you make, whatever.

It's not that we couldn't use those systems to help us figure out those tunings. We probably could. There's some interesting papers doing that stuff right now, specifically with PostgreSQL. When you think about it saying, "Hey, can I ask the AI to just do this for me?" The answer to that is probably not. Interestingly enough, one of the early experiments that led the system initiative was, and how we wound up with these high-fidelity models is we were trying to make a system where you could just say, "Deploy my application." How much could I infer from deploy this application to then get you a working system? We built these really high-fidelity models that then allowed you to basically say, "Deploy my application," and then we would just basically use constraint theory to tell you what the right shape would be based on the constraints you fed the system.

It worked, but it was really annoying to use, because you wound up playing whack-a-mole with constraints in order to get the system to actually say what you wanted. You knew that what you wanted was a thing that had this much memory and this much blah, blah, blah. You knew what the answer was already. Everyone always knew the answer.

[0:49:46] LA: You made it harder to come to the answer, versus making it a better answer. Yeah.

[0:49:50] AJ: Yeah. I think there's an interesting problem in our domain where it's like, okay, I don't know that it's that much better. Most of the time, it won't be better to just say, "Give me infrastructure of X. Give me a Kubernetes cluster," is not a good enough answer to actually have a working Kubernetes cluster with an application I can deploy. There's a bunch more we have to do.

Once we have to put all of those things also into the mix, and they all have to work together and they all have to be correctly configured, inserting the creativity and the wackiness that sometimes comes out of LLMs becomes a big problem for the domain. I think when you look at summarization, it's a layup. I think when you look at data exploration, it's a layup. I think when you look at targeted actions, reboot this thing for me, it's a layup, right? When you start to get to deploy my application for me and figure it out, probably not.

[0:50:45] LA: Yeah, the more creative you get, the more variability you get. Variability is your failure point when it comes to that.

[0:50:53] AJ: It's death in configuration and infrastructure.

[0:50:55] LA: Exactly.

[0:50:57] AJ: There's places where that creativity, for example, in summarizing creativity in word choice, in summarizing what happened yesterday between two time periods, super valuable. Creativity of configuration, value setting, not helpful.

[0:51:12] LA: I wish we had a whole other hour here to talk. This has been a great conversation, but we are over time. Unfortunately, I think we need to come to an end. I would love to follow up sometime in the future with you. This has been a great conversation. I want to thank you so much for your time.

[0:51:25] AJ: I loved it. So fun.

[0:51:26] LA: Great. Thank you. Adam Jacob is the creator of Chef and the CEO of System Initiatives, a company dedicated to improving and enhancing our overall DevOps experience. Adam, thank you for joining me. I had a great time. Thank you for coming in Software Engineering Daily.

[0:51:44] AJ: Oh, it's so my pleasure. I had a great time, too.

[END]