EPISODE 1699

[INTRO]

[0:00:00] ANNOUNCER: Large datasets require large computational resources to process that data. More frequently, where you process that data geographically can be just as important as how you process it. Expanso provides job execution infrastructure that runs jobs where data resides to help reduce latency and improve security and data governance.

David Aronchick is the CEO of Expanso. He previously worked at Google on the Kubernetes team, which influenced his decision to start Expanso. David joins the show to talk about his company.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author and thought leader on cloud computing and application modernization. His bestselling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[EPISODE]

[0:01:21] LA: David, welcome to Software Engineering Daily.

[0:01:23] DA: Thank you so much. It's wonderful to be here.

[0:01:25] LA: Great. Thank you. So, if I were to describe Expanso, just in the simplest of terms, you provide an execution engine for large datasets. Is that a good, albeit simplistic, kind of description of what you do?

[0:01:39] DA: Yes. I think you're right on. A lot of it comes from my own background, right? I was the first non-founding PM for Kubernetes and I was really inspired when I was at Google working on it around the way that you gave people, developers, and software engineers, and SREs, and so on, the ability to declare a job, and to roll out that job to a cluster. Before that, you had to think about where I place it, and how I do it, and those kinds of things. Kubernetes is amazing. But Kubernetes is built for like a single data center only. Really, not just that, right? Like a single zone. If you want to start crossing zones or crossing regions, you now have to think about like, "Well, do I piece this together? How do this in a globally friendly way, where there can be network partitions and things like that?" The short answer of what we provide is something very similar to Kubernetes, but the other half of the equation. When you're not in a single zone. When you're not in a single cluster. When you're not on a single cloud. Or you're on-prem or whatever it might be, you now need the same kind of declarative job description, but you need to be able to run it all over the place.

You say, "Okay, here's my job, instead of running it over these 10 machines that are sitting right here, I now want to run it across the entire world, or I want to run this job in Brazil, and this job in South Korea, and so on, and so forth." So, we give you that kind of same framework as a very nice complement to Kubernetes.

[0:03:10] LA: Let's talk about a specific example of a problem that a customer might be facing where this sort of technology would be useful. What's a classic use case?

[0:03:21] DA: Yes. It's really funny, because what do they say, you're doomed to repeat the same problems your entire life? That's very much what I signed up for. I remember in the early days of Kubernetes I would go in and say, "All right. Hey, there's this great new way to run jobs. Here you go." People are like, "Great, I have a Kubernetes cluster. What do I do then?" I was like, "I don't know. Do whatever you want. It's Kubernetes. Just give us a container and off you go." In a lot of ways, I'm in the exact same boat.

There are so many applications for Bacalhau. I should explain. Bacalhau is the open source platform that anyone can use as an open-source project. The binaries are commercial, but the code is open source, and then Expanso so is the commercial company that backs it. If you hear me like bright bridge between Bacalhau and Expanso, I just wanted to set that.

A classic scenario that we would describe is around log processing. Logs are great and there are so many great log processing solutions out there. Log management, application management, things like that. Splunk, Datadog, New Relic, so on and so forth. They're amazing. We're not trying to replace them. But when you look at actually using those platforms today, the most common thing is to take all your data everywhere and just jam it in a lake and that's okay. It's certainly efficient from a, well, I don't have to write a lot of code to just push this over into a lake. But boy, it's really inefficient in the like amount of data you're moving. More often than not, you're moving far more than you need, and you may be risking geographic or regulatory issues or things like that. If you're, for example, recording people's IP addresses, you're actually - according to GDPR, that is a personally identifiable information, piece of information.

So, if you push IP address raw over the wire from a GDPR region to a non-GDPR region, congratulations, you're now potentially - I'm not a lawyer, but I would go talk to your own lawyers because you may be at risk. So, what would you like to do? Well, in the dreamworld, if you have machines all over the place, creating logs, you would do some work on those machines before you push the data, right? The log gets written to the local machine or to a local agent or something like that, and then you process it, and then you move it. Well, how do you go about doing that? Well, it turns out that that's harder than you would think. You can write like a binary. You could write a bash script. You can like have an application there. But wouldn't it be nice to use the same kind of declarative job structure for running transformation of that data at the point of log creation.

So, we've talked to many customers and we have a classic customer example, where they were pushing all their data into a giant bucket, and then processing over that bucket in the central way. And just that processing was costing them $2 million a year to ingest by doing the initial processing, using containers and Bacalhau at the point of log collection. But then still using the exact same pipeline. It's basically just taking a few steps of the ETL process and moving it out to where the logs are being created, that became an instant way for them to save money, to reduce their security profile, and so on, and so forth.

That's kind of a classic example. But you could think of this in a lot of different ways. Same analogy, if I can give you one more, we have some folks who are trying to get into streaming a video that is enormous, especially with these 4k cameras. If you push all this video into a central data center, that might be petabytes a day of total video. If you could instead, run an ML model, at the point of video collection nearby, maybe one GPU machine that's able to process that video at fast speed. But use a platform like Bacalhau to run that ML model on it and just look for objects. "Hey, is there a car in this? Is there a slip and fall? Is there a whatever?" And then only push clips when something is actually going on, it just saves an enormous amount of money and that's another proof of concept that we have going right now.

These are some of the things. Basically, anywhere, you're moving data today, and you really would like to do some work on that data, because it's too raw before you move it, that's where we can help.

[0:07:43] LA: So, the focus is much more on processing data than it is asynchronous jobs. Is that a fair statement?

[0:07:51] DA: I think that that's where we're seeing the biggest uptake right now is on the processing data. Because there is so much data being created every day, the speed of data is growing at 45% faster than network growth is growing. So, as data gets bigger and bigger, and networks don't keep up, like congratulations, now you have basic things that are basically isolated. You can certainly run isolated jobs. You don't even need data to run any of these jobs. But I would say that that's probably not as critical for now. Again, we're happy to listen to customers and we have over a thousand people in our Slack channel, and they're always like weighing in on new and interesting ways that they're using it. But I think using data is probably one of the biggest things that we're seeing right now.

[0:08:37] LA: Got it. Yes. So, running computation near where data is located for a variety of reasons. I want to go into those specific reasons that go in detail there. But running computation near data is really kind of the core of what you do, not necessarily locating computation that is available, but finding the optimal place to run computation for a given data set.

[0:09:00] DA: Yes, exactly right.

[0:09:02] LA: Okay. Let's talk about the advantages of this and why you want to do this. Why do you want to run the computation there near the data? I think there's basically three reasons. You've kind of alluded to all three of these, but tell me if there's more than these, but the three that I gleaned from what you've been saying are latency, so performance, security, and the third one is data governance. Which is similar to security, but really a very different thing, for very different reasons.

[0:09:32] DA: I think you've certainly hit on them. I generally categorize it into three buckets as well. So, there's the carrot and the stick and then the really big stick. The carrot is latency and performance generally. So, for example, let's take that log processing example. Imagine you have a modest amount, maybe 20 nodes spread all over the world collecting web data, and something starts happening with one of your machines that starts doing a 500 error. How do you detect that?

Well, a lot of ways that people do that today is they grab the logs and they push that into a central repository, and then have the rules over the central repository to trigger, "Hey, I'm seeing some 500 errors over here." Well, that is a costly thing, right? There is a latency involved, even if it's a modest amount of data. Think about it. You have to push that data over, you have to spin up an adjuster or have an adjuster there, he has to process it, then you have to do your log analysis. If you're lucky, that can be minutes. If you're unlucky, or you have a lot of this stuff, that could be hours, before you even recognize that this thing is going on.

So then, they ended up building these, like secondary systems for like doing alerting, and checking endpoints and things like that. You're like, "Well, man, that feels like it's the wrong way to like go about like observing what's going on." Imagine instead, that you were able to do an initial pass of processing, or health, or remediation or whatever, at the point where that data is being collected before it even hit logs. We can do that for you. You can just run your container and we can watch it for you and check the processes and check the logs and all that kind of stuff right there. You really do get a much faster, lower latency ability to respond, to trigger, to remediate, to do whatever you'd like at the point of that. 

That's kind of a carrot. You're getting faster answers. You're getting distributed platform for running. And that gives you the ability to do a lot more things. Now, let's get to some sticks. The first stick is just cost. I really can't overstate what this is costing people. Moving unnecessary data, ingress, egress fees, storage of unnecessary data, storage in very expensive buckets. So, it's not bad to store all your data, but you shouldn't be storing it in like high throughput storage mechanisms. You should be storing it in archive mechanisms, right? But you don't know, because only 1% of that data is actually meaningful or useful in some way. You don't know which 99% to archive.

It all has to go into a hot storage to be processed before you move it off. That can be really costly as well. Plus then, the computation involved in doing this, right? Your machines probably have spare capacity for doing lightweight log analysis and you're not doing it. You're only doing it on the secondary machine. So, there's a lot of cost savings there. I would say that's a big second cost. Then the third, a really big stick is the regulatory stick, where you're moving data around with likely is under some form of regulation. It could be PII, it could be GDPR, it could be - we're certainly talking with a lot of regulated industries where they never want to move like regulated data, because of firewalls and things like that. But they want to make that data available for people to process over.

How do you go about that? Well, you don't really want to hand out like SSH keys or access to a database. Wouldn't it be nice to take an arbitrary container, review the container, review the job, say, like, "Hey, is this good or not? Ready to go?" Then run it over the data in place. Once the data is finished, it then goes into kind of like on the same machine, it just sits there with the results, and you as an organization can say, "I approve this to go out. Or you know what, this is not approved, I'm going to go talk to this person." All that is built in to our platform and you get an audit log for everything that went on.

On the inbound, on the running of the data, and then on what data was allowed to be downloaded. All with isolation between the data movement, and the job. So, the really big stick and those are the really the three big things that I see to reiterate. It's much faster, much more responsive, understanding of what's going on, a significant reduction in costs. Then, obviously, like actually obeying the laws and regulations that your industry requires.

[0:14:14] LA: Right. So, I wrote down a few questions that came out of that and I'd love to go back and talk about them.

[0:14:20] DA: Of course.

[0:14:21] LA: These aren't necessarily specific order. But let's start with the statement you made about computation and that is you know what computation was done. So, is this part of what you provide is, not only are you processing the data, you're logging what data a particular task is used, and what it's doing with that. Can you talk a little bit more about what do you mean by what it's doing with it and how is that handled? What type of information do you get?

[0:14:49] DA: Well, to some degree, we're not going to crack open your container or your Wasm or whatever. But we will say we have a you know, a write-only record of everything that went through the system, right?

So, you say, Okay, I want you to run this container, or I want you to run this binary, or whatever, these are the parameters, and these are the inputs to that thing. That could be an arbitrary path on that local machine, or a share, or we are integrated with things like IPFS, which provide full content addressed IDs, which allow you to know, explicitly to the token, exactly what went into a dataset and an input. Now, to be clear, like, Look, if you're doing something inside that container that we don't know about, and you enabled whatever, egress or something like that. Look, we're not going to track that, and that's ultimately there. But we will give you, that write once audit log will say explicitly, "This is what we ran on this date, on these times, and this is where the outputs went." Again, I've been a Linux user for many years and I love it. But I will tell you, I don't know, unless I am very explicit with my cron logging, or my bash history or whatever exactly what ran when.

So, I find value in this, just as a job runner for a great audit log for me. I can go back and look at it. 

[0:16:17] LA: Right. Yes. I mentioned also for security purposes, that could be quite useful as well. Not just governance, but security. Did a bad actor get in? What did they do while they were in? What did they access, et cetera?

[0:16:29] DA: Exactly.

[0:16:30] LA: Neat. I like that. That's great. That's actually a part of this. I had not thought about that, that could be valuable. That's great.

[0:16:37] DA: It's funny. People ask me, are you competing with Kubernetes or whatever? Absolutely not. Kubernetes is amazing and it's a great cluster management tool. Same with - or you compete with Spark, you could do Elastic and Mongo and so on. Absolutely. They are partners, they are compliments. Then they're like, "Okay, but you're doing these things that Kubernetes can't do." I was like, "That's true. You can actually do a lot of these things with Kubernetes." Kubernetes does offer a mechanism to look and see what all the jobs are, and things like that, and develop your own login system, and things like that around it. It just requires a fair bit of work. You can absolutely do that. But you can also do - we give you this for free, because I saw this happen so many times, where people are like, "Hey, exactly what job ran when and what were the inputs, and what were the outputs, and so on." That's what we're designed to do.

I mean, I'll tell you God's honest truth. When I was starting this project, I have unlimited ambition. What I really wanted to do was help solve the reproducibility problem in science, because you see so often, scientists are doing their work as much as they possibly can, but it's very hard for them to articulate and track exactly what went on where.

[0:17:52] LA: Reproducibility is critical for examining many different types of research.

[0:17:58] DA: In partnership with the Decentralized Science Foundation, who I'm doing some great work, in rethinking how journals work, and so on, and so forth. I wanted to get to a point where you could have a single tag, a single content address ID for a job that says explicitly, "Here's a hash of this job that ran." And you know cryptographically, that this is exactly the job that went in, and these were the datasets that came out.

Now, we're not quite there yet. But we're certainly on a path to get there. If you just use Bacalhau out of the box today, you're going to get far more reproducibility than you have with your standard bash scripts and so on.

[0:18:41] LA: This is great. I was initially thinking that the cost aspects, and perhaps, the general data governance issues associated with, don't move data out of country, all that sort of stuff, were the main motivators here. But this is a lot more in governance than just keeping data where it shouldn't be. It's also making sure that the processing you're doing with data is well-tracked and well-understood.

[0:19:06] DA: Absolutely. I really want to stress. One of the things that I think Kubernetes got right in, again, I have to credit the community, I can't take much credit for this, is that they really understood, this is the box we sit in, right? We're not going to go up, we're not going to go down, we're not going to go left, we're not going to go right, we're just going to focus on this box. They focused on being an orchestrator. If you wanted to add a network, or add storage, or build a complicated workflow engine, or have your own VM system, or whatever they're like, great. Go to it. Here are the interfaces that you're going to build around it.

I will say that I'm taking a very similar philosophy with Bacalhau and Expanso. Our community is kind of leaning in as well. We're saying, "Look, you can give us whatever data engine you would like. You can call us from an external orchestrator. Or you can give us a container that we can run on your behalf. But we're not going to try and be smarter than you about the specifics for how you want to transform the data, what kind of next-gen tools you use. 'Oh, there's a brand-new version of this model or whatever.'" We're going to try our best to make it very easy for you to run it in any place you want. Even if that is far away, and you maybe don't know where the data is, or whatever. You could just tell us like, "Hey, go find this hash somewhere on my network and then go run it against that."

But like I said, we're going to really try and focus on being that reliable job executor engine, but one, just like Kubernetes, that focuses on doing what we do well, which is dealing with all of the network and other challenges that people have, that a lot of people just don't pay attention to.

I joke about this with my other hyperscale friends. I was lucky enough to work with all three hyperscale clouds. I will tell you that like, when I talked to them, and they're like, "Oh, I don't think of this as just edge, an edge play, or a cross-cloud play." This is just, "Oh, I just have two zones that I want to deploy to." Zones, here's the secret, they disconnect. They partition all the time. It's not a bad thing for a zone to partition. It just happens. That's what happens with networking. If you want to build a resilient system, you have to figure out a way for things to queue up on either side of that partition, and then resolve once that partition resolves. That turns out, that's a really hard problem and we want to solve for that. Even within a single hyperscale cloud, we think we can have a lot of value.

[0:21:44] LA: Great. So, there's some reliability aspects, as well as a security data governance and some ease in the development of a distributed system in general. That actually kind of leads me to what are some of the downsides of this approach? And one of the obvious downsides is it's easier to process data when it's centralized. That's a very generic statement. I get that.

[0:22:07] DA: No. You're absolutely right.

[0:22:09] LA: But there's some value in a statement like that. The fact that you're distributing this work means you're distributing agents, agents that have to be upgraded, have to be - for a working production system, there's a lot of maintenance that has to go on with now a hundred nodes, a hundred different areas that you're processing data versus one area. What types of capabilities do you provide to help manage that scenario?

[0:22:35] DA: I think that's a wonderful point. You don't get all this for free. I like to joke, and again, I'm going to steal this from a friend who mentioned it to me. He's like, there's this thing in software engineering that I'm sure your listeners know about called CAP theorem, which means you can only pick two. Consistency, availability, and support for network partitioning. Like I said, you only really get to pick two. Most clustered systems today focus on the C and the A, right? We're going to be strongly consistent and we're going to support availability. We focus on the opposite, which is we're going to support availability, and we're going to support partitioning, meaning your system could be out of consistency for a period of time. That necessarily means we are going to be slower to schedule jobs than your centralized system would be.

So, we are saying, "Look, if you need high throughput-ness this or whatever, we recommend another system." You can figure out ways to go around this. But that's one of the things that we've really taken on. One of the things that we have done to make this very easy or as easy as we can, is our agents, our binary is very small. It doesn't require much and it is trivial to set up. It really is one command. You type Bacalhau serve, and that will create the core node. If you then add a flag and point it at the core node, it will bootstrap itself into existence. That's it. You created a network.

Now, I want to be very clear. Remember what I said earlier about staying in our box? We expect you to have a network where the nodes can see each other. It does not need to be a public network. But the nodes do need to be able to see each other because otherwise they wouldn't know how to talk, so we make that assumption. We've thought about. We've had some folks in the industry or excuse me, in our ecosystem talk about like, "Oh, what about if we like bundling something, WireGuard, or Tailscale, or something along those lines." We've thought about that. We're not opposed to that either as potentially an add-on. But the core is going to be that. Because it's so easy to set up, upgrading it really is one command too. Then you just download the new binary, replace the old one, and off you go. It reuses the same state store. It reuses the same thing. Once it joins the network, it achieves consensus on its own. There's no like special magic things.

[0:24:58] LA: Cool. Okay. So, it's easy to set up, which means it's easy to maintain. It's kind of the theory there. That's the theory. Where did that fall apart is where I'm trying to think in mind is.

[0:25:11] DA: Well, the truth of where it falls apart is distributed systems are hard. Distributed infrastructure is hard. Where I've seen people fall down the most is around that networking component. "Oh, wait, I thought that this thing had access to that port," and so on and so forth. We don't require a lot of ports and so on. But that is certainly something that's an issue. We also - I should stress, we have flexibility for a number of what we call executors. We support Docker, we support Wasm, we support arbitrary binaries. And keeping those things up to date is out of scope for us.

For example, let's say you had a container, and it required the latest version of a binary, or the latest version of Docker in order to run, that is a second upgrade that you would have to provide in order to get that thing up and running. So, thinking through what those resources look like, thinking through how to join those resources together and keep that up to date. We are a component of an overall solution and we want to make it easy to do that. But if it was easy, everyone would be doing it.

I should also say, one other thing that you mentioned, I miss out on it. The way to think about our platform is by default, is everything is embarrassingly parallel, meaning no internode communication for the jobs themselves. That means like, if I'm going to take a job, and I want to run it across a hundred nodes, each of those nodes need to be able to act as though they're the only one running on that component of the job. Meaning, you, before you issue the jobs, need to shard the job into, in this particular case, a hundred independent jobs.

I like to joke like, we handle the map, you handle the reduce. You're going to take it, you're going to split up the job into like one of a hundred shards, and run all those hundred shards everywhere. Then, you get the results back and you can decide what to do from there. But you can get around that, you can do things outside of - out of band and things like that. But I do strongly recommend thinking through many of your platforms and your overall pipelines to support a platform like this, because you'll just find that it's far more resilient. So, that often is some costly rethinking about some architecture, but most stuff doesn't require rewriting at all.

So, you mentioned the typical way to run is you execute the binary to run a server, and you execute a binary appointed at the server for the control agency. You just set this up. Now, is it safe? Am I making an invalid assumption here to say that the normal use cases that executable would probably be running in its own container, and that in turn is running in a Kubernetes cluster in the region or in the availability zone or whatever, in the zone that you want to run the job? Then, you already assume you have a Kubernetes cluster already set up in that zone? Is that correct?

[0:28:07] DA: No. Not necessary. Don't get me wrong. We have folks using that as a way to distribute Kubernetes jobs. Again, very common scenario. A lot of folks in retail and so on have small Kubernetes clusters spread all over the place. They're like, "Hey, I'd love to issue a job to just this cluster." Like issue a job to all of these clusters. You could do that. You can like have our - basically, put cube control into a container. Then, say like, "Okay, here you go. Go run this in all these places," and off you go.

But the most common thing is actually just to use Docker on a VM. So, you hand that container to Docker or whatever container runner you'd like. Then, on that VM itself, it'll run, and it could be a long-running job and will make sure it stays up and running, and if crashes, we'll restart it, and so on, and so forth, or it could be a batch job. So, on that VM, or on whatever VM you point it to, that agent will start that job and run it. That's the most common way to do it. You don't even need a cluster.

[0:29:10] LA: Okay. So, even though you're tied in closely with Kubernetes, Kubernetes is not a requirement to use your product in any way, shape, or form.

[0:29:16] DA: Sorry. Yes. I should restate. We are not tied to Kubernetes at all. We've just very much inherited lots of the thinking around deployment of jobs, structures of jobs, level-based job scheduling, and so on.

[0:29:30] LA: Okay. Let's take Kubernetes out of that picture then. But stay in cloud for a moment, and where does things like serverless computing, like Lambda fit into your use cases? Are you enabled for those sorts of environments too? Or is that not quite where you are yet?

[0:29:45] DA: So, we could absolutely do that and we've had folks who have been like already experimenting with those kinds of things. They'll have a large cluster, or they'll have spare machines, or whatever. They're like, "Oh, you know what. I'm just going to put this on at low priority and just make it a generalized compute cluster." So, you can absolutely do that. Actually, we have an experimental feature right now, which just landed in 1.3, which means you don't even have to containerize. You can just hand us a raw Python script and we'll run that for you. We'd love for people to come in and try it out and give us feedback if it works.

Now, that said, I like to credit all the hyperscale cloud folks whether or not it's Lambda, cloud run, to functions, Azure functions and so on. The essence of what makes those things magical, is I really don't have to think about infrastructure all the way down. Just magically, this thing is provisioned. At the end of the day, a Bacalhau cluster does need to be sitting in a VM, or on a compute target of some choice. So, it's not quite as magical as that. On the other hand, it's also not external to your organization. If you have a VPC and you have some VMs, and they're up and running, and you just like to reuse, since you're already paying for the hour, and you'd like to reuse some of that compute, we're a great way to do that. We run this job, in this very isolated way. So, you can easily issue Lambda-style commands against that.

Again, the platform is super flexible, and I'm really been excited with seeing how people are using it, and this is a perfect example of an area that I'd love for folks to explore more.

[0:31:25] LA: Cool, cool. The reverse Lambda was you're essentially building a Lambda infrastructure similar to what AWS has, but using resources you've already got available. That's an extremely intriguing option, because one of the problems people have in general with AWS Lambda, that particular mechanism is the vendor lock-in and the pricing that's associated with it, and all that sort of stuff. But the philosophy of how you build applications as functions that run independently, and work independently, and scale independently, is very intriguing. You're essentially providing that mechanism in a generic way that is not tied to an AWS infrastructure, for instance, but could run literally anywhere, to provide essentially, the same experience.

[0:32:12] DA: Absolutely. Again, I don't want to, as someone who's been there, who talked about many, many, many serverless, stacks, and things like that, I don't want to understate the amount of work to build a completely rich serverless stack. But that said, if you have very lightweight tasks, and you already have compute infrastructure out there and running, we can be a great solution for you.

[0:32:36] LA: Oh, that's cool. That's cool. Before we go into change too much, I want to talk a little bit about the performance or the latency cost performance aspect a little bit. One of the things that we've been hearing recently, especially with the recent AWS announcement, but other cloud providers have been following suit of is the change in egress and ingress fees, specifically egress fees. I've been involved with companies that have had huge costs associated with egress. When I worked at New Relic, every one of our customers that was pulling data, analytic data out was paying for that data, and it's so huge cost. Now, that's been improving because of recent announcements. How have those announcements impacted what you're able to provide to your customers as well? I imagine it's a help, but I want to figure out how it's a helper, and if I'm missing something from a hurt standpoint.

[0:33:34] DA: No. I mean, there's no question. Anytime you can reduce costs for customers that you have a very open air. Because the amount of stuff that you're moving is reduced, you're going to save money there. But I will stress that like egress is not remotely the top thing, where people are like, "Oh, this is the reason why I'm adopting it." It's just kind of like a cherry on top. The truth is the biggest cost savings and the biggest benefits are one, like I said, being able to be much more responsive. I don't care if egress costs go to zero. The cost you're paying is in time that it takes to move this stuff and the speed of light isn't getting any faster.

So, there's no amount of discounts that are going to make that faster to move the data from point A to point B, in a reliable way. You being able to process over that data in place is just a huge winner no matter what. Then, on top of that, the cost in time for ingestion, right? Again, the fact that you have to take this thing and move it to this new place, and you spin up machines or have machines running that then process that data to do aggregation, or standardization, or whatever it might be, is also super costly. That's a big cost savings as well.

Again, I'm not saying that the egress thing isn't a winner for us. It certainly like, it helps. And it means that now you can have clusters that span multiple clouds and share data between them in a real way, because we can do that for you. But at the same time, we have winners, like I said, even if cost of egress goes to zero.

[0:35:18] LA: Cool, cool. Great. So, we're running a little bit late in time, but I don't want to miss out on this last general category, and it might be a couple of questions here. That's about the difference between the open source versus commercial offerings and how they work together. You have an open-source version. I know you've been going back and forth in your discussion of the two. You have an open-source version of the product that's readily available, but you also have a commercial offering that has capabilities on top of it. It's a purchased service. Can you tell me what the differences are between the open-source and the commercial version?

[0:35:53] DA: So, we very much borrow from the Red Hat model and I was deeply inspired by the excellent Adam Jacob and system initiative. Basically, the way to think about it is our source code is open source. It is yours to do what you like. The trademark is not open source, and the binaries that we build, that include an SBOM, and other things like that are not open source either. So, if you want to take our source code, and you want to go and compile it yourself, and run it on your own infrastructure, knock yourself out. All the power to you.

If you want to use our binaries, even the first one, you need to form a commercial relationship. We have a very generous trial tier and so on. But that's where it is. Now, that said, we are planning on offering a number of hosted services that are incremental to the commercial binaries. But that's basically how we see the world and we're big fans of that. At the end of the day, you're paying for our expertise and doing vendor management and doing security verification and all that good stuff. And standing behind the product with support and other things like that. That, from an open source, the code being open source, but the binaries feels very, very natural to us.

[0:37:09] LA: Yes. That makes sense. But I think the other thing I read in what you're saying is you want to go into a model where you're providing implementations, SaaS services, or IAS service, whatever you want to call them of your offering. But right now, it is still customer-hosted. Here's our binary, run it on your own infrastructure. Is that correct?

[0:37:29] DA: Exactly. We will be offering that. There's no question about that. We're going down that road. But one of the reasons that we had to choose a model like this is because we just listened to our customers. At the end of the day, the customers we have, the US Navy, Lockheed Martin, and commercial customers all around the world who care about security and isolation and things like that. They're never going to run a SaaS solution or an IAS solution because they have so many requirements around their infrastructure. They need to be able to run this in an entirely detached, air-gapped, isolated way, whatever it might be. So, we needed a licensing model that would work for them, that's why we chose this particular one.

[0:38:13] LA: So, even air-gapped, you don't call back the license servers and things like that. None of that is required. It's completely standalone, isolated on your own network, but you talk between each other, but you don't have to talk anywhere else.

[0:38:25] DA: Exactly, right.

[0:38:26] LA: Great. Well, thank you very much. This has been a great conversation. I very much appreciate your time here. My guest today has been David Aronchick who's the CEO of Expanso. David, thank you for joining me today on Software Engineering Daily.

[0:38:41] DA: Thank you so much.

[END]