[00:00:02] ANNOUNCER: This episode is hosted by Jordi Mon Companies. Check out the show notes to follow him on Twitter. 

[00:00:07] JMC: Hi, Eric. Welcome to Software Engineering Daily. 

[00:00:10] EB: Great to be here. 

[00:01:52] JMC: We are in Open Source Summit North America 2023 hosted in a beautiful building with the amazing views. And we will talk about what you discussed in the talk you gave yesterday here in Open Source Summit later. But I want to go through a bit your career and the interests that have motivated your career professionally and in research, in university. 

And I would summarize you professionally as an expert in highly scalable distributed systems. I know I'm leaving a lot outside of that definition. But for the purpose of this conversation, there are two things that – two insights, two areas of computer science and systems that you are actually really renowned for. 

One is CAP Theorem, and we'll move on to that in a minute. But another one is clustering machines. What someone like me doesn't really know is what was so insightful about proposing horizontal scaling and adding more machines and managing the mess that can become rather than powering the internet in this case? Because you were in the search business back in the day with a more powerful mainframe or more powerful just one process machine. What was the insight there?

[00:03:19] EB: There's a few there. There had been a little bit of use of clusters for a few things. But what struck me is that when the internet and, really, I mean, the web arrived, it was clear that services for the web were going to need to be very large scale. 

I can think back in that era. This is mid to late 90s. The biggest company was maybe 300,000 employees, right? And the biggest system you'd have to worry about is something like that with a few exceptions like the airline ticketing system is famously giant and complicated and did run on mainframes historically. 

But also, all those systems and all the banking systems as well, they were turned off every night. Literally, used to be you couldn't go to the ATM between something like midnight and 3am or 5am because the banks' computers were being used for accounting for the day's activities, right? 

It's a very different view now where you expect 24x7 availability. And that's the big factor. The other big factor is just the sheer scale as I said. Once you get to millions of users or billions of users now, like, it's many applications like Gmail would have billions of users with multiple B. And so, those. There is no computer big enough to handle that. The only way to approach that scale is with lots of independent pieces. 

And when you have that many pieces, it's also true, you're going to have a high failure rate. Even one in a million means you have problems every day. And so, the clustering approach was really first about availability. And how do you make the service available even if the components are not completely reliable? And second about scalability. 

And a third one which people don't think much about but is important is incremental scalability. Meaning if I have this much load today and then I do some marketing and my load goes up the next day, what if my big computer is not big enough, right? How do I add capacity? 

And the beautiful thing about clusters is you add capacity by adding machines, right? Incremental deliver on-demand scalability is a huge part of what is now cloud computing. Because cloud computing is the same idea. You can add capacity but now you can do it with a GUI and not with actually installing machines in a rack. Virtual machine elasticity is a whole lot nicer than physical machine elasticity. But even physical machine elasticity was an important asset to have for incremental scalability. 

[00:05:49] JMC: I guess the size of the data centers compared to the ones today were minor, right? Were very small. 

[00:05:56] EB: Absolutely tiny. Yeah. 

[00:05:58] JMC: Okay 

[00:05:59] EB: Absolute tiny. 

[00:05:59] JMC: So then let's move on to CAP theorem then. This is the theoretical part of this, right? I guess this is the actual more inside less hands-on. Although it has a lot of implications that we will see in a minute, we'll talk about in a minute. But it is basically the representation of what you just said, right? That you can opt – well, I'll let you explain. But you can optimize for either availability or consistency, but not both, in a distributed system, right? Because you need to take for granted that there will be faulty machines or there will be faults in the system. And then those two are trade-offs and you cannot have both consistency and availability. Could you probably explain the theorem much better than I do? 

[00:06:45] EB: Yeah. I'll start with a little bit of insight, which is the systems we were building at Berkeley and later at Inktomi were large-scale distributed systems. Both the search engines and the way it worked. And later the kind of CDNs and proxy networks [inaudible 00:07:00] overlay networks. Now they're called CDNs. There were a variety of names for them before that. 

Both of those are distributed systems in which availability is much more important than consistency. And so, we made choices in favor of availability, forfeiting some consistency. And I was quite clear in my mind that those choices were fundamental. 

And one day in lecture, I was getting ready to teach on this material and I'm like, "I think that we can just prove that this is true." And I did a kind of proof in my head and gave it in lecture the next day. It was in 1998. And it was a great lecture. It went over really well. And in fact, that was the first presentation of the CAP theorem. 

So to summarize, Ethereum it's basically you can only have two of the three properties you want, which are consistency. Which, I mean, you can write and everyone sees the write availability, which means that you can continue to do writes, not just reads, all the time. And partition tolerance, which means if your system gets partitioned, you'd like to be able to have those other two properties anyway. 

And you pretty much prove that you can't have all three if you think about this way, which is if I have a partition and I do a write, the write, by definition, goes to one side. Which means the other side isn't going to see that update. You could either turn the other side down, in which case it'll be consistent but not available. Or you can leave it up, in which case it's inconsistent but available. 

And Inktomi in particular, we basically ended up building essentially a database that in some ways is maybe the first NoSQL database. Well, we didn't call it that. But it's not that it didn't have SQL. It's that it was a little bit more optimized for availability than for consistency. 

And when I talked about the Inktomi database, which I did at SIGMOD, which is the big database conference in 1997, a lot of people were a little dismissive of the fact that it didn't have consistency. And that's the C and ACID. And so, you're on risky ground to say that you don't need consistency. 

[00:09:18] JMC: Especially at a database conference, right? At that point in time. 

[00:09:20] JMC: At a database conference. And I wasn't actually saying you didn't need it. I was just saying there are systems where availability is more important and you can't have both. But at SIGMOD, I didn't have quite the clean formulation yet in my head. 

But by '98, I did. And certainly, I presented it first publicly in about 2000. And at that point it was mostly to try to get people to understand you actually do have to make this choice, right? Because so many people told me, "Oh, I have consistency and availability." And say, "Oh, because I'm doing snapshots." And I was like, "Well, if you're doing a snapshot, by definition, it's not consistent because it's a slightly old version and you're missing some recent updates." And like, "Oh, but I can do active-active repair." Like, "That's still not actually consistent." 

It is – and by the way, you can't do active-active repair if you have a partition.

[00:10:13] EB: You can what? 

[00:10:13] JMC: You can't do kind of database backup repairs if those two sides are partitioned. Anyway, I was right on that. And eventually, I think this led to the NoSQL movement, which I didn't name or have that much to do with. But it's kind of like it was permission in some sense to look at all of these more interesting options around availability. 

[00:10:34] JMC: Exactly. So then, in general, distributed systems became popular, powered the web. And what we understand is the general availability of all the services that run our lives basically. And therefore, CAP theorem became really, really relevant. I mean, it was at the beginning. But it became pervasive in every way because everything is ran through software. And most software nowadays is a distributed system that is highly available and so forth. 

So then, 12 years down the line, you published this article. Really interesting long-form in InfoQ, which you collected a bit of the, well, again, insights, nuances and feedback from the community, let's say, from 12 years of application of CAP theorem across the board from many different players. 

And you had a lot to say about that, about misunderstandings, for one. But also, like the whole degree of applications of the decisions that you can take between optimizing for one or for the other for availability, for consistency. It's not binary at all, right? 

[00:11:42] EB: Yeah. I wrote that paper for a few reasons. The main reason is, again, as you said, nuanced. All the variations of the ways you can get mixed partial availability, partial consistency. How you can trade them off? Trade them off for different users, for different operations. Lots of options there. 

And frankly, in 2000, people weren't ready to have that discussion, right? They're still discussing whether you have to give up consistency at all. And in fact, honestly, one of the reasons I wrote the paper is because people would tell me that CAP theorem is too simple. It doesn't cover all this new item. Then we'll go look at the original representation, it actually says it's a spectrum and you can make these trade-offs. But no one remembers that because the focus was on the binary part, right? 

So 10 years later, paper is like, "No. It really is a spectrum and you have lots of options and the strategies you can use to recover from partitions," which also hadn't been discussed much. 

[00:12:44] JMC: What catches your attention with CAP theorem is the binary headline, right? It's like, "Oh, it's A or B," right? Well, in this case, C or A. But, yeah. Obviously, there's a paper, there's a research, there's an explanation, a further one that explains all this. That's what you go about explaining in that article. Do you have any updates to it now that it's 23 or 24 years? When did you say you formulized it? 

[00:13:06] EB: How could you presented in 2000? 

[00:13:09] JMC: Okay. 23 years after that. You consider that the InfoQ article is still up to date and you stand by it? 

[00:13:17] EB: I still stand by it as the best paper on the CAP theorem. And it's not like – I think there's still a lot of nuance in there. If you reread it today, it looks pretty good. But there are better techniques available today. Joe Hellerstein's done some nice work showing, basically, that if you build systems that are monotonic, meaning values-only go in one direction, then it's easy to make systems that are highly available and can have simple consistency, potential consistency. But that's a constraint on what you build. But it's still pretty powerful. And there's also CRDTs, which I did talk about 10 years ago. 

[00:13:58] JMC: What are those? 

[00:14:01] EB: I have a hard time remembering. They are conflict-free replicated data type. 

[00:14:09] JMC: Conflict-free replicated data type. Hmm, sounds interesting. Okay. So then you're at Google. By the way, I can see your motivation and anyone's motivation to come up with solutions for such complex problems. Because if anything is common to any distributed system, it's that it's incredibly complicated. And there's high entropy and so forth. 

I can see the motivation of someone like you. And again, anyone involved in that industry. Because there's, well, genuine interest and professional interest and also an economic one, right? You were the leader of a company that went public and so forth. 

But I sort of more to understand your motivation as a lecturer. I mean, again, there's a curiosity, universities for that to expand the knowledge about. But there's no such – there's not such a big carrot after that. What takes in a lecturer in computer science in Berkeley to go after such endeavors in this case? 

[00:15:14] EB: Well, I love Berkeley and I loved my 20, almost 25 years there. Actually, I'm still there as Emeritus. But I joined pretty young. I was 26 when I started as faculty. In fact, I had grad students that were older than me occasionally. And this is right as the web was starting. 

And so, I wanted to make a difference. I wanted to have impact. I do think research is one avenue to do that. And I've done a lot of work at Berkeley that we won't even get to today on totally different topics. 

In the moment we were in of seeing the rise of the web, I'm like, "It's going to need giant-scale services. The only way to build them is clusters." And I feel like we're uniquely positioned to go solve that. 

And the reason for doing a search engine, honestly, was because the search engine, we estimated correctly, would be the first service that really needed large scale. In fact, the other search engines of the time were in fact all built on single large machines. The most famous being AltaVista where it was dirty, big deck machines as a way to show off the power of those machines. But InfoSIG was on big Sun. UltraSPARC, ten thousands, I believe. Also, big machines.

[00:16:30] JMC: Wow. 

[00:16:32] EB: So now it's common that all cloud services and internet engines run on clusters. But Inktomi was the only one at the time. 

[00:16:39] JMC: Nice. 

[00:18:44] JMC: So then, eventually, you joined Google and you get involved in several initiatives in there that Google's famed for many things. But from a software engineering perspective, is that their tooling is – your tooling or the company is really good that they are particularly secretive about it for good reasons when something is good. 

But you sort of get involved in what becomes eventually Kubernetes and such a project that has become, well, probably the most successful in cloud native land and in distributed systems probably. I need to check that statement. But I would back it. How did you go about then? Another Insight that you had in this sense? Why did you see that the – exploring the avenue of open sourcing an internal tool of Google? Because it's famously known that Kubernetes is inspired by Borg was the way to go. And did you foresee the success of Kubernetes? 

[00:19:46] EB: That's a great question. You can never really foresee this kind of success. You can hope. And certainly, I was hopeful. But foresee would be a strong statement. But I can say, again, like with the rise of the web, I could see the time was right. And what I mean by that is we had learned a lot from Borg. You couldn't really open source Borg. It's a beast with lots of legacy baggage that people wouldn't actually want to use, honestly. 

I mean, it works great for Google. We still use it for a variety of reasons. But the ideas that are in Borg were certainly strong. And we'd been using what I'll loosely call containers for a long time even by the time of the start of Kubernetes, which for me started around 2013. The first public talk was in 2014. 

So the things are going on there. We were using containers. But we were using containers of Linux performance isolation kind. Meaning that they are not the – TAR file that is inside a Docker file. It's a packaging mechanism. They were really the constraints you put on a process to make sure it stays within its bounds for performance isolation. 

[00:20:57] JMC: Isolation. Okay.

[00:20:59] EB: Right. Most people were actually using VMs to do that at the time. Google, because it's the same age as VMware, historically hasn't used VMs. Because it didn't exist when Google started. And like my own work in the 90s on large-scale services also didn't use VMs because they didn't exist in the modern sense at least. It was all based on processes. 

Google was really much more like my work where we have processes and each – there's many processes on one machine. And it's okay that they don't have security isolation because they're all from the same company, right? And we had some other checks in there. We're using Linux containers as a way to make sure, if you run 200 applications on the same machine, they don't drop on each other too badly. 

We were well-versed how to do that. We had a packaging mechanism, which we still use, which is not Docker. But when I saw Docker rising and say, "Ah, people are liking this container format, they're using it really to decouple applications from infrastructure, which is very similar to how we were using containers inside. 

I said, "Well, if we just combine Linux containers with Docker packages, we have a pretty nice combo." I just kind of said, "That's what a container should be. And let's make that happen." And then once people like containers, then you have room for all the secondary things like orchestration. 

[00:22:18] JMC: Yes. Now the problem starts, right? 

[00:22:20] EB: Right. And so, we were very clear – 

[00:22:23] JMC: So if I get you right, is that you really liked the isolation enough of the sort of like Linux containers that Google was using? But it was missing a bit the dependency packaging, the library packaging, the neat packaging that the Docker file provides and so forth. And the combination of those was the perfect insight to realize, "Oh, this is going to explode in a way. Or this is going to be at least relevant. And now we need to cater to the orchestration and the scheduling of all this because it can become a mess otherwise." Right? 

[00:22:53] EB: Well, Docker had done a better job than Google on managing libraries. Our general rule was we'll tell you which library you can use and we will all use the same version, which you can kind of do in a very kind of mandated kind of way. But it's not a great solution. Docker solution, which is allow multiple versions in the same machine, is a much better solution. And they had a great user experience where they just packaged it well. 

But really, I was looking at it and saying, "The decoupling is what I'm after. Because I don't really want to see the future of cloud be about having your application, your OS image and even your database combined on one machine is a monolithic thing," right? That's not the future. The future has to be something that's about APIs and services. And you shouldn't care about the OS you're on very much. And you shouldn't care about what else is on the same machine ideally, right? 

Then you have a sea of resources that you run your applications up. And really, we've done a little bit of that with App Engine. Because App Engine gives you an abstract infrastructure. But it was too abstract. You couldn't write all the things you wanted to write. VMs are too concrete and annoying. And Kubernetes is like trying to be the Goldilocks in the middle where we'll give you an abstract infrastructure but it can run anything, right? It's equally flexible but much nicer to work on. And we were able to deliver on that promise. 

[00:24:18] JMC: Exactly. Yeah. You actually added a lot of – not a lot of. I'll take that back. A few constrains, the new idea of a pod and a few new features, right? But before we move on to that, I'm a huge supporter of GitHub for many reasons. Professionally, I've worked at Weaveworks. But I really like it as the operating model of the cloud. That's a really ambitious claim. 

And it turns out that, right in this conference, Brendan Burns is one of the co-creators of Kubernetes. Has turned out to be, for me – he probably has been public about it a while back. But a really strong supporter of GitHub. He gave a talk about it and he considers it in the same lines that I just described. Like, kind of the operating model of a cloud. I'd like just to pull your thoughts on that. What's your thought about GitHub in general? And do you see it as a good combination with Kubernetes? 

[00:25:21] EB: It's a great combination with Kubernetes. And I'm a big fan of that direction for lots of reasons. The big one, and the big Googler on this is Brian Grant. He's done a lot to say in this topic. But it's really about a few things. It's a good building block, much better building block than something based on, for example, a GUI, right? Which is kind of derogatory called ClickOps, right? 

[00:25:47] JMC: ClickOps. Yeah. 

[00:25:48] EB: Click things to do stuff. That's not bad. I'm always in favor of enabling automation. And GitOps is a clean way to enable automation. 

[00:25:58] JMC: Because another thing you're a big fan of is declarative systems, right? 

[00:26:01] EB: I am. 

[00:26:02] JMC: I guess it's a consequence of managing a lot of distributed systems, right? But someone like me that comes a bit from the imperative world, and I think this is common across many areas of software engineering, of the software engineering universe, is that it does – I think many programmers are more used to an imperative way of talking to a machine. But it turns out that when you work with distributed systems, it's way better to actually declare a target state, a desired state and let the system reproduce it, which is what Kubernetes, among others, excels at. 

But again, it feels a bit like – I'll mention AI. It's not AI. But it feels like something is running this for me and I'm not sure if I trust it. And the only thing I need to check is the definition of the system in a declared statement. And it feels awkward, again, for someone that comes from an imperative world in which I tell the machine each one of the steps it has to take in order to achieve something, which I have very clear in my mind. I describe that thing and you take it from there. 

[00:27:08] EB: Yeah. It's worth talking about that philosophically a bit. It's certainly a core of Kubernetes. I think one of the primary philosophical points of Kubernetes is you declare your intent and we reconcile the real world to match that intent as best we can. 

And the reason I think that's the right way to think about it is, in a distributed system, unlike controlling your laptop or desktop, you really have actions going on that you didn't initiate, right? The most obvious one is machines fail, right? And so, actions happen. We don't know if they were supposed to happen or not supposed to happen, right? 

At some basic level, any kind of automated systems has to understand the difference between the intended state and the current state, right? If you're doing stuff imperatively on a system you control that doesn't fail, those two always match. You don't have to think about it, right? But in the presence of failure or other random changes, they can differ. In which case, I really need to know is there one less machine today because that's the way you want it? Or because something happened and you would really like it to be back where it was? And without a declarative intent, you can't tell the difference. That's I think pretty fundamental to eventual consistency. And in general, to dealing with partial failure.

[00:28:31] JMC: Exactly. Yeah. It's the beauty of permanent reconciliation, right? And, yeah. And having a system always striving to match end state, production state, whatever it might be, with a declared state. Yeah. 

So then, in the success of – yeah, let's talk about sort of like the ideas, the main ideas that went into Kubernetes. Because it has its constraints. They're very purposely built-in for that reason. It has new features and so forth. Could you walk us a bit through this? That I think in the talk in 2018, you called it – at least the talk was called continuous evolution. And in a way, you go through the build process of applications in Kubernetes. 

[00:29:20] EB: Kubernetes has definitely – and it's evolved over time. But was always designed to have a notion that we need to be able to upgrade services online. And the most obvious part of that is the load balancer. Because the load balancer should be up all the time ideally. And it's the thing that's going to shift traffic onto the new version versus the old version. It's your control point for which version is a particular user seeing. 

But we didn't want to solve that in a very prescriptive way. The way it solved Kubernetes is kind of indirect, which is using labels. And so, front end sends traffic to deposits that have a certain label. You can pick that label. And then, conversely, if I add that label to a pod, it will start getting traffic. If I take that label off a pod, it'll stop getting traffic. And those are good building blocks. 

I'll give an example. You're debugging something, and what you really want to do is not have it get any more traffic. But you don't really want to take the pod down because you want to look at its state. You just take the label off, it stops getting new traffic. But it's still alive and you can still interact with it. It's just not part of the live service anymore, right? And that just gives you some freedom for debugging. 

It also means you can have whatever load-balancing policy you want that's separate, right? You can have whatever mixes you want for deciding how many uh pods to have for auto scaling, right? The frequency, which you add newly labeled pods is totally up to you all, right? This is just the most basic mechanism of how does a load balancer decide who should get traffic? 

And so, that's the general philosophy, is try to make these mechanisms simple. And again, eventually, consistent, right? If you put a label on something, you won't instantly get traffic. It has to be discovered. And likewise, you take the label off, it might still get traffic for a little bit before it stops. But that's good enough. In fact, it's the best you can do. 

[00:31:25] JMC: I guess another consequence of being exposed to distributed systems at this scale is something that you and I spoke about just before this conversation started, is about how interdependent, and complex and transitive everything in software is. It has always been since the 50s this way. Software consumes other pieces of software, statically, dynamically. But it has just exploded exponentially in the last – I don't know. 20 years since distributed systems have become, again, pervasive. Powering the world and so forth. 

I guess that – I mean, it sounds logical to me. But again, you were pioneering this in a way with some others that, well, that interconnected supply chain and dependency chain. I'd like to make this difference. I can't remember who to attribute this. But just supply chain would be the connection between vendors and clients of software. And there's a liability and a contract there. But what I'm trying to describe or the person that coined this term of dependencies chain would be the same chain but of open source software. In which such contracts do not exist, right? I consume with my own responsibility open source packages that I see out there in the world or managed by institutions and so forth. 


[00:34:48] JMC: This is definitely – and this became the risks associated to such a supply chain and dependency chain became patent with SolarWinds and others. You, before that, became really interested in this already, right? How did that come about and what were your first thoughts on the risks of such an interconnected supply chain and dependency chain? 

[00:35:12] EB: Yeah, that's interesting history, too. I would say, first, what consequence of Kubernetes, which then led to, by the way, the CNCF. Because we needed a place to put the Kubernetes trademark. So that's where CNCF came from. That led to hundreds, now more than a thousand projects that basically are all in this space. And that's actually I think exactly what you want. You want a thousand flowers blooming so to speak because that's a vibrant ecosystem. And some of those ideas will be terrible. But some will be fantastic. And we'll let people figure out which or which over time. 

But it does mean that Kubernetes is particularly complicated, right? It's got all the problems you have with a distributed system. But it's also got problems of many plausible components, many of which that are partially done or incomplete or incompatible. And so, working in that space is more complicated than it should be. 

Now the good news is the core at least has gotten relatively stable, which we predicted would happen. But it took a while. And now kind of the mess is kind of at the fringes. And the fringes just gets wider overtime as more things get sorted out. It's a complicated space. 

[00:36:19] JMC: By the way, let me just interject. The CNCF event. So, KubeCon and CloudNativeCon are increasing in popularity. That's a proxy to the success of Kubernetes. I think so. The last data point was KubeCon, CloudNativeCon in Amsterdam a few months ago, a month ago. And it sold out. And there was a thousand – waiting list of thousands big. 

But, yeah. Everyone agrees that for the last few KubeCons, CloudNativeCons, Kubernetes has become "boring" because it's stable. And the mess and then interesting bits might come from the ecosystem. Yeah, absolutely. 

[00:36:58] EB: That's good, actually. 

[00:36:58] JMC: Yes. 

[00:36:59] EB: Because the early days of Kubernetes were not that pleasant to work on, right? We have court stuff changing every quarter, which means every consumer had do a lot of work just to stay current, right? The stability is a good thing. We used to worry about how do we get the mundane stuff done. And I'm not sure we ever solved other than just getting it right enough that it's stable and works well enough. 

But around 2018 I was looking at how, even then, successful Kubernetes was. But I started to worry about all the dependencies, which at that time was like 1,200. I think we, for a while, got it down to 700. I don't know what it's at now. But there's more than a thousand dependencies in 2018 into all kinds of stuff that I don't think is all that trustworthy. 

Because, again, people just include whatever they want. It gets checked in. It's working. But if you're depending on stuff that's not trustworthy and doesn't have test cases, you're asking to be burnt by that stuff. 

[00:38:04] JMC: And that stuff might be – again, have a transitive dependency to another stuff that you don't know anything about and so and so forth. 

[00:38:13] EB: Yeah. Most of those 1200 dependencies are indirect dependencies. And so, I was starting to go through how do we sort this out? And we did a few steps like we did some work to reduce the number of dependencies. At least try to call ones that didn't make any sense. 

But again, most packages on the internet have one maintainer. I'm sorry. Not most. It's a kind of a plurality argument. 30% have one maintainer. That's the biggest single group. You're trusting that one person who you may or may not have the real identity of. They could be a nation state. We have no idea. 

And so, it was fundamentally risky. And I started thinking about what are we going to do about this problem. And so, I worked with the Linux Foundation and Microsoft to start OpenSSF to say the supply chain problem is going to be serious. We're at risk here. 

There had been tiny little attacks, like mostly substitutions of bad packages, credit cards dealing, crypto mining, stuff like that. But yeah, I can see the same techniques would work on a larger scale. 

And so, we started it right before the pandemic hit, which is a terrible time to start anything. So then, it didn't go all that well the first year. Because even the groups that wanted to participate became risk-averse when the pandemic started and rightfully so. But then, of course, Solarwinds hit and also some open source versions of those things, and Colonial Pipeline. And suddenly, people are like, "Oh, even in a pandemic, I still care about this stuff." And then, that kind of rebooted OpenSSF with a much broader support and funding. And that led, of course, to the Biden Cybersecurity order. And we played a fairly big role in that because we were the only voice that was kind of ready to talk about it because we've been thinking about it for several years. And so, now about half my time is spent on open source security. 

[00:40:11] JMC: Exactly. This leads actually to the last bit of the conversation that I wanted to have. You've been fundamental in the creation and spawning of the cloud native ecosystem through Kubernetes and the CNCF and all that we've just talked about very briefly of OpenSFF, which is another thriving project. It's not in the same way as Kubernetes because it's not a platform for anything. But it's a platform for if anything collaboration between great stakeholders, big stakeholders, but also the individual contributor. And OpenSSF Day took place here yesterday. And it was a massive success. 

But yeah, I can see how you would start to care about not only software supply chain. But in general, collaboration and open source in general at a high level, right? And for that, I don't know if you've coined the term. But the way you're framing your interest in this huge, vast space is curation, right? 

[00:41:12] EB: One of the problems we have to solve. There are many. A lot of the problems are technical. They're like tooling. How do you sign things? How do you generate software bill of materials or SBOM? There's a bunch of technical problems and we'll make progress on those. But the one I talked about last year's Open Source North America, which was in Austin, I introduced the term curation. 

And what I mean by it is it's layer of accountability that goes between open source and the kind of top-level mandates that are coming towards software in general. For example, if we're going to use open source in our electrical grid or our telecom systems, which we do, it better be trustworthy. How do we make it trustworthy? 

And we can think of that from the government view, not just US government, all governments really, is like, "Oh, I want to put some rules on this stuff so that it behaves well." But you look at how open source is delivered and it literally says as is. 

In some sense, you've talked about dependency graphs not having a contract. It's actually worse that. It has a contract and the contract says as is. You're on your own. Fix the vulnerabilities yourself, right? That's literally what the contract is today. 

That doesn't work. Those two modes, top-down mandates and bottom-up as is are incompatible. And I don't want that to be an existential threat to open source, which I think at least in some countries is kind of on the table because they don't understand it. 

[00:42:38] JMC: Correct. That is exceptionally true for the European Union in my view of the CRA drafting for now. 

[00:42:46] EB: In order to prevent some bad legislation, I think it's important to create mechanisms that solve the problem. And curation is really a generalization of idea we've seen before, which the classic example is Red Hat. If you buy Red Hat Linux, they will do security patching for the packages that are included in their distribution, right? Those are supported packages. And you actually pay them money for that. I would call that curated packages. Red Hat does a great job. 

The problem is they stop at the Linux distro layer. And there's all these things you get from Pi Pi or Maven or MPM that are not supported that way and that are actually probably more risky in general for lots of reasons because they have less eyeballs on them. And those things need to be curated, too. 

[00:43:41] JMC: Yeah. 

[00:43:41] EB: How do we deliver software where someone doesn't have to be the maintainer is willing to say, "I will offer to patch this stuff for money. It's not a free service." And the money's there to pay for it by the way. It's not a problem either, right? It's already clear that people are paying vendors to give them supported software. This is not a new thing. 

But the point is we need to make it easier for curators to interoperate with open source and with maintainers in a mutually productive way, right? Ideally, a curator, which could be the maintainer by the way. It's a role. So anyone could play this role. But the curator's role is accountability for some packages and promising to fix them quickly and understanding what their dependencies are. And being on the lookout for when the dependencies might be at risk. 

The maintainer's role in this, if they're not the curator, is really to work with the maintainer to accept patches upstream. Then the question is how to make that as easy as possible? My talk yesterday was really about the need for automation, for building and testing. I guess if we had more of that, then it'd be easier for third parties even just to pay for tests and builds. 

Right now, kind of even if they did some testing on their own, the maintainer would typically have to redo that testing before they understand if this is a good patch or not. And I'd like to take that work and that cost off of the maintainers, right? There's no reason maintainers should be paying for testing and builds for critical software. 

I'm hoping to create mechanisms where other people can pay for it. Google included. We already pay for fuzzing Kubernetes.

[00:45:15] JMC: Fuzzing for Kubernetes, right? 

[00:45:16] EB: We fuzz all kinds of things. 

[00:45:17] JMC: Yeah. True. And you provide the machines for that to be free for the maintainers of each one of the projects, right? 

[00:45:23] EB: Yeah. Now we're only fuzz for free what we could use critical open source. But it is already above a thousand packages. And fuzzing is a very expensive kind of thing. We don't publish the numbers on that. But it's a non-trivial cost. But I think that can be generalized – you told us how to run unit tests, integration tests. I think we would certainly be able – we'd be willing to pay for them sometimes. But more importantly, we could make it such that other people could pay for it that care. 

If you're a vendor for a government that wants some accountability, then you should be running those tests on a regular basis, right? And the maintainers should not be paying for those tests, right? That's the curator's problem. 

[00:46:05] JMC: We can finish with this speculation because it was news I think this morning or yesterday, which you might not even be aware of. I mean, I'm sure you are. But I think Google announced that they just launched probably today or at least announced. They're launching a co-completion, co-suggestion AI-powered tool. For those of you familiar with Copilot or Tabnine or others. 

And whether that product or others I think, this is again a speculation, might reduce the cost of AI – apologies. Of testing. If we apply AI to test unit test creation, integration test creation and tests, that might actually reduce the cost for all the independent maintainer to create a test suite that will grant their software a bit more of security and quality. I think so. 

[00:46:56] EB: Yeah, I am certainly hopeful that large language models will help us generate more test cases automatically. And that, combined with more automated ways to run those tests. Again, the state of the art as a readme file. A human can read and figure it out. And maybe we'd have the LLM read the readme file and generate a VM that'll run the test. That's actually a good thing to try. 

But the right answer is something more declarative. If you give me a declarative statement of how to build and test your package, then we can automate that. And by the way, it's not just Google. I think other clouds would be happy to run these services, too, for all the same reasons.

[00:47:33] JMC: Yeah True. And I wish, anyway, that curation becomes a thing and that these curators become institutionalized in the sense that they become a reality, properly funded. And then we can see them sprouting everywhere. Because, especially, again, the dependency chain, the one that connects the users of open source software and maintainers is the one that is more brittle and more exposed to these risks coming from top-down regulation that is not so aware of how this chain works and the humans and communities powering it underneath. 

Well, that I think we can conclude. Did we miss anything from what you wanted to touch upon? 

[00:48:20] EB: No. It's nice to be in Vancouver. And I do hope curation takes off. I feel at some sense it's inevitable because we have to bridge this gap. The top-down mandates are not going to go away. The as-is nature of open source is not going to go away. 

And if we can navigate that smoothly, it will actually be a boon for open source both in use but also, frankly, I think in sustainability and funding. 

[00:48:44] JMC: Agree. Well, thanks for being on the show, Eric.

[00:48:46] EB: My pleasure. Thank you.

[END]