EPISODE 1741 [INTRODUCTION] [00:00:00] ANNOUNCER: Data is at the center of many business decisions and advances today, including AI-driven capabilities. This requires companies to have well-governed data that is easy for users to find, use and understand. In moving to the cloud, Capital One modernized its data ecosystem and adopted a "You Build, Your Data" model to equip its data stakeholders with self-service capabilities to use and build data applications. Jim Lebonitte is a senior distinguished engineer at Capital One leading technical architecture and strategy for enterprise data platforms. He has over 15 years of experience building platforms focused on data and software delivery experiences. Jim joins the podcast to talk about how to empower data users at scale while keeping data well-governed, building data pipelines and applications, and much more. This episode is hosted by Lee Acheson Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His best-selling book Architecting for Scale is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com. And see all his content at leeatchison.com. [INTERVIEW] [00:01:37] LA: Jim, welcome to Software Engineering Daily. [00:01:39] JL: Hey, Lee. Thank you. Happy to be here. [00:01:42] LA: Before we get into the background or anything like that, tell me what is specifically You Build, Your Data. [00:01:48] JL: Sure. You Build, Your Data, and we also say you build it, you own it here at Capital One, is really kind of a model that we've come up with to enable velocity of our teams to deliver really as fast as possible. But, also, being in a regulated environment, we have to be very careful about a lot of things. We've kind of made that work at Capital One while doing all that. But this all started back when we first migrated to cloud. We were one of the first banks fully on cloud. And the only way we would have gotten there was taking this mantra. And it really empowered our engineers to rapidly move to cloud while also keeping the right guardrails from a regular perspective. [00:02:31] LA: Jim, when did Capital One first move into the cloud? [00:02:37] JL: Yeah. Capital One completed their move to the cloud - in year 2020, we fully exited our data center footprint. [00:02:44] LA: Got it. Thank you. You actually did this move to this model simultaneous with your move to the cloud? Or was it done separately and afterwards when you realized that you needed a change? [00:02:57] JL: We started this with cloud, because we knew that we wanted to move fast. And if we had tried to have a big central team do everything to do that, we knew it would have just taken forever. And we never would have gotten fully on cloud and exited the data center. That's kind of where this started for Capital One. And then with data, especially at Capital One, I mean, we are very data-driven and data-hungry business. Once we got to cloud, we also started to have a central data team kind of the way you mentioned at the beginning of this. And a lot of companies do it that way. But we saw very quickly that it was just moving too slow. We also went ahead and federated a lot of the data processes back to the line of businesses to enable them to build data applications, data pipelines and all those types of things and while still maintaining a lot of the controls and regulatory things that are required for a big bank. [00:03:48] LA: Right. When you move to the cloud, was your application suited to the cloud very well back then? Or did you have to do a lot of refactoring? And what I mean basically is a lot of large models and a lot of older companies end up with a lot of large monoliths. Those large monoliths are very hard to move to the cloud efficiently. And certainly, hard to move to a distributed ownership model very efficiently. Were you in a large monolith? And did you change that architecture? Or what did you do to change into the localized ownership models? [00:04:22] JL: Yeah. One of the first things we did was really adopt the microservices-based architecture and domain-driven design type of stuff to really enable that federation. And the other big part of that too is taking a big API first-based approach. We were heavy adopters of REST APIs and building microservices for specific domains that then enable teams to move pretty fast and not be super prescriptive about what type of text stack are you using. Are you using a specific database? That type of thing. We left kind of a lot of those types of decisions up to the teams. But, again, able to keep controls by a few key concepts and governance things that really translated over to the data world now that we're using at Capital One as well. [00:05:03] LA: Yeah. That makes sense so how did you subdivide the monoliths into services? What I'm trying to get to is what size of services do you have? Do you tend to have a half a dozen services that make up your application and they're all very large services? Or do you have, on the other extreme, 100,000 very tiny services and subdivided? And where are you in that spectrum? [00:05:27] JL: Kind of in the middle. We have lot different services. I mean, when you go to our websites and stuff, there's all kinds of things happening behind the scenes. I mean, there's ML models. There's all kinds of stuff happening. We've got quite a few. But one of the keys that enabled that to work was, in order for those services to talk to each other, we came up with standards and we forced people, for example, to register API contracts with each other. We did that part centrally, right? While they could use whatever tech stack they wanted, they could use serverless function, they could use Docker container. Once you wanted to talk walk across those services, we had contracts in place and governance processes in place that would not allow you to bypass that. [00:06:07] LA: Got it. You required formalized contracts between services and you required those to be specific guidelines for what they had to contain and things like that. Can you talk about what some of those guidelines are? I want to go a little bit deeper in this area because I find a lot of companies that break into service architectures tend to not think about that. And so, what they end up with is a thousand services with a thousand undocumented APIs. And it's a huge mess. And so, can you talk a little bit more about how you regulated that? What types of processes you did to make sure that that didn't happen? [00:06:46] JL: Yeah. And this whole API path I'm about to tell you translates directly into the data world as well. We followed very similar processes. But for the API path, we have a kind of central catalog and we say a design time type experience where, before you deploy that service, you have to go register your API contract. And we won't allow the services to talk to each other unless you register that. And then that endpoint is deployed to some of our central services. Fundamentally, that's where a lot of this whole pattern - how it works at Capital One. The things that we want to make sure that are done in a standardized way, we build tooling that allows teams to do that in a self-service fashion. We don't have a team that goes and registers them all. We built tooling and automation that said, "Hey, these teams can go register those contracts." And then it goes and deploys it for them and allows them to start communicating. [00:07:39] LA: Got it. And you extended that to the data model as well once you started federating data. Is that correct? [00:07:46] JL: Exactly. And data, like you said, we started with a central model and a big central team trying to make a big single data model for Capital One. And we quickly saw that just the velocity wasn't fast enough for the business at all. We really took very similar approach to API world and said, "Look, we're going to make a catalog type of place where you can register your data sets and do all those things." And producers could do that by themselves. And then if you wanted access to that data - also, consumers could use that same platform to go access it. But a lot of the - again, the stack with how you build a data pipeline is pretty diverse, which enables teams to move pretty fast. [00:08:27] LA: Let's talk about what the contract for a data interface would look like. And I imagine it's still a RUST API of some sort, right? You don't share database connectivity, connections, and tables, and things like that across services. You have formal APIs for talking. Is that still a fair statement? [00:08:46] JL: Correct. And in the data world, we actually formalize tables as well. There's an API endpoint for real-time streaming. And then there's also table contracts if you're building you know data pipelines in a data lake. [00:08:57] LA: Okay. Got it. Yeah. For data lake purposes. But let me take a step back and see if I can describe what I'm trying to get to here. If I'm building a service, that service requires some data. I own that data. I'm responsible for that data. Yes, I store it into a data warehouse or other archival mechanisms for other purposes later. But while I'm actively using that data, that data is in my service. But other people need that same data. Do they go through the service to get to the data? Or do they have access to the data directly? [00:09:26] JL: No. They go through the service. [00:09:28] LA: Got it. There's an API contract that's required to the service. And it's a contract that ends up returning data, or updating data, or processing data. Whatever needs to happen. But it's still an API-level contract. You're merging them at that level. You're not separating data out. And even though it's distributed treating it separately, you're treating data as part of the embedded service that owns the data. [00:09:53] JL: Correct. [00:09:54] LA: Cool. I certainly have found that model tends to work the best. And I've seen people - one of the mistakes I think some people make when they move to service-oriented architecture, when they don't think the data through all the way, they'll do things like, "Well, I'll do multiple services. But I'll have them share the same database." They'll use different tables. They won't use the same tables most of the time. Okay, maybe sometimes. And, "Okay. Well, if I need this piece of data, I'll just go get it. It's right over there. It's easy to do." You see those sorts of mistakes happening where someone will borrow an interface to another person's data because it tends to be well-defined and pretty static in a real-world situation. You don't tend to change your data contract schemas very often. Not to compare to other things. But you tend to suck off the data if you will. And the access to the data from the owner of the data. And so, therefore, there's access to that data that's outside of the owner's responsibility. And that can cause problems. They don't know what's happening. They don't know when peak usage time is. All those sorts of things. What's your thoughts on companies who try that model first? [00:11:04] JL: Yeah. I mean, not adopting a, let's just say, distributed architecture on public cloud. I mean, I think the first thing that comes to mind is it can get very expensive, right? If you have these big databases that just house a huge amount of data and like one instance of it, those types of things, that's just not really, in my opinion, how cloud's meant to be used. It's meant to be used as smaller chunks of things and a little bit more flexibility to give teams to get to the NFRs and stuff that they need. A lot of use cases now, I might want a Dynamo database for certain use cases because I want to look up via key values. Or I might use DocumentDB for something else. And that to me is one of the key benefits of the cloud. If you want to make a central database that has a whole bunch of tables that's shared, you might be better off staying on-prem. There's a model for that. But to me, that's not kind of where you want to be in the cloud. [00:11:54] LA: Not conducive to the cloud. Yeah. I teach a course to interns just starting in the industry about the cloud. And one of the things I teach them is to think about the cloud. It works best with lots of pebbles. Not with large boulders. And that's the same thing as what you're trying to say, is that the smaller you can divide something into, the more efficient you can run it on the cloud, basically. In simplified form, basically. [00:12:18] JL: That's a great way to say it actually. [00:12:20] LA: Yeah. I used the jar filled with big rocks versus filled with sand and how it's shared. It's a good example anyway. One of the things that comes to mind with all of this, too, is - and this is an area I see a lot of companies underinvesting on as well, is they move to this model and they'll move to a service-based architecture. Even doing data the right way, etc. But they don't focus on standardized tooling. And then that result is things don't work quite right. Or people - there won't be consistent best practices. And they won't share how they do things. Not to say they don't want to share. But they don't know what other people are needing and sharing the right things to share, etc. And you end up with a lot of repeated code. Repeated things, anywhere from CI/CD pipelines. All the way to what cloud are you running in and what technologies you're using. Obviously, there's a play that occurs between requiring specific structure for how things are done and letting teams do whatever they want to. Whatever makes sense for them. And there are extremes on either side of that argument. And there's a right way to find the middle ground. But how do you ensure that you have the proper level of consistency, and tooling, and things like that within and across your teams? [00:13:40] JL: That's a great question. And I personally am a big fan of autonomy for engineers. I cringe a little bit when we try to restrict engineers from doing what they need to do. But at the same time, it is super important to be thoughtful about where you're allowing that and where you're not. We definitely pick certain things as part of our stack to standardize and/or even centralize that we know that we don't want to allow autonomy or different choices. Because it's just going to become a challenge from an interoperability perspective. Or Capital One case, definitely regulatory. We definitely have a lot of rules we have to follow in that stance, right? And so, for that, again, our catalog approach really helps us there. Because we can abstract a lot behind those registrations and stuff. And I think the other key is automate stuff behind that. So you don't want to just abstract and not still provide automation to let people do what they want to do. And then other things, maybe we let a little bit more flexibility happen. That's kind of how we think about. But there's a lot of things that we say, "Look, we can't allow to pick differently." Because we have 12,000 technologists that work and we have to build cohesive systems. [00:14:52] LA: Shared tooling has to be part of the strategy. And so, how much do you invest in shared tooling to make that happen? [00:14:59] JL: There's a few aspects of it. We have a lot of central tooling that is shared across all of our data pipelines for stuff under data management perspective, like lineage collecting, catalog registration, interfaces to things, the API Gateway that I talked about, or our lake interfaces, stuff like that. On the other side though, we have invested a large amount into our CI/CD processes to standardize that. And within there, we have a lot of standard patterns that provide guardrails but around making sure you don't deploy something bad to the cloud. But, also, to help developers get started quickly. I'd say we have a blend between tooling that developers like to use. We've released SDKs that have a lot of standard Capital One features that help them save time and provide governance. And then we have some things that we've just centralized that we said, "Look, we can't afford to let this - " [00:15:55] LA: I imagine things like the CI/CD pipeline itself is something that's centralized. You have one standard deployment mechanism and things like that. [00:16:04] JL: Yes. [00:16:04] LA: But tooling for creating your APIs probably is tooling that's available. And as you say, much of it is required. But I imagine there's also parts of it that are optional but valuable to use. And so, whether you use them or not, it's up to the individual teams. Can you talk about some examples that fit into each of those categories? [00:16:24] JL: Yeah. We have some things - for APIs, we have automation that allows people to deploy Fargate-based things or LaMDA-based things. And you would pick between those for the NFRs and whatever your use case needed. But we do - for example, there's a limit of those choices for APIs. It's not like there's 20 of them. There's five or six of them. There's enough choice. But at the same time, we do provide standards that don't let it get too out of control. [00:16:59] LA: Right. You talked several times about governance. And I know governance is really big for financial institutions such as what the industry that you're in. How much of this change was driven by governance? And how much of it was really process improvements and things to make the business more efficient? Is that a split? Or what really was the driving effort? Which side of that equation? [00:17:29] JL: I mean, I think, first of all, the business - and especially with data, our business is a data-hungry business. And they want to move fast. Figuring out the right balance between that and being well-managed, it's what we're always trying to figure out. We have to be well-managed. There's nothing that we can do about that. But how do we move as fast as possible while being well-managed is the key thing that we're always trying to do at Capital One. [00:18:08] LA: Maybe I should restate it a different way then. Did the move to the data ownership model help data governance? Or did it make data governance harder? [00:18:20] JL: That is a good question. I think it both helped and it hurt a little bit. We use data everywhere at Capital One, first of all. We use data for some things that there's no way other companies even think about using - taking the time to do the analytics to figure out the right reason to do something. We use a lot of data. It just wasn't practical for us to kind of slow that down. It just wasn't going to get slowed down. It did create some data proliferation, we'll say. But at the same time, with the cataloging strategy and stuff that we have, that was able to contain it. Because if you do have a catalog with all the data that's sitting around your environment, you can then analyze it and be like, "Hey, you know what? This looks a little bit duplicative." It was made because two different teams were working on two different use cases and they had to do what they needed to do to solve the business problem. But we have visibility to it, right? And I think a lot of the - again, that design time aspect is so important. And the automation of making that design time into the runtime interfaces is really what enables us to move fast and be well-managed. That's kind of the key. [00:19:33] LA: Got it. The governance was more of - it was something that was there. You weren't doing this for governance. But you had to always have governance in mind as you did this. And in some ways, it helped governance a little bit. But in other ways, it made it harder because you are still distributing data and having to deal with all the issues. And there is more governance involved than doing that. Is that a fair way to summarize - [00:19:56] JL: It did make it harder. But I will say that automation also made it easier for people as well. I mean, if you try to do a bunch of governance without automation, let me say it this way, it can make it really hard. [00:20:09] LA: So, automation really helps with governance is a good summary in general, right? [00:20:14] JL: Yeah. And we can now do things pretty easily, like, "Hey, a software application can turn on a data pipeline and make data available in the lake without a central team." They use our central tools. But that's like an example of the well-managed aspect of that. And when that data lands in our lake, it's registered, it's cataloged. And then somebody can go request access to it all in a self-service fashion, right? [00:20:41] LA: That actually kind of brings up an interesting question though, too. The ability to quickly register new data sets without the centralized team is valuable for a fast-moving company. But it also has a disadvantage, and that is it will cause the amount of data that you collect to grow much, much faster than if you have more of a control over what new data gets created. How much did you run into that? [00:21:06] JL: Ran into that a lot, especially - we still do. But the good thing, again, about having that design-time experience, now when people, for example, register data that we think is duplicative, we have visibility to it. And we can actually redirect, say, "Hey, there's already a data set that has all this information, right? And so, it is a catch-22. Like you said, enabling all the self-service does cause a lot of data to get created really fast. And making it super easy is a good thing. But that is a downside. But we are able to because of how we've invested in automation to actually kind of put control some of the madness. [00:21:50] LA: When someone registers new data, is there a review process that goes on? Talk through that a little bit. I think that's one of the ways to say, "That's kind of a centralized way, right, of maintaining a lid on unfettered growth." Right? And so, how do you do that? What's involved in your process for that? [00:22:09] JL: Yeah, exactly. As a central team, given that we don't really own any of the data, we assign ownership of different domains of data, so that if I go register - let's just say I want to go register the clickstream data from our website. There is a data set owner of the domain for, let's just say, digital data. They do sign off and approve like, "Hey, this person's creating a data feed for that domain." Actually, within that self-service process, it's not that you can just go create whatever data you want. We've also embedded in there who owns the different domains. And then they actually will approve or deny the fact that that can get created. [00:22:53] LA: Got it. It's not like a central filtering of what gets created. You have ownership of domains. And whoever is the owner of that domain is the one that owns the decision as to whether that data is created or not. [00:23:07] JL: Correct. [00:23:08] LA: That's not necessarily be the same as the producer of the data. Correct? [00:23:12] JL: Exactly. [00:23:13] LA: Yeah. A producer of data who wants to store data in your lake, in the data lake, or in your data system, I want to assume, it's just in your data lake, has to register it. Has to get permission from the domain owner of that type of data. Also, presumably, some level of permission from you in order to actually create the data in the first place. And once it's there, then they're responsible for generating the data. But the data itself is owned by the domain owner. [00:23:44] JL: Correct. And then on top of that as well, when you want access to it, there's a top-level domain owner and then individual, more granular owners of the data that will say, "Hey, you're allowed to access this data and use it for said purpose." That whole life cycle is done in a self-service fashion by the data set owners, the producers, the consumers. And our central data team, we facilitate that and we build automation for that. But we don't actually own any of the pipelines. [00:24:17] LA: That was going to be my next question. And I assumed the answer was what you said it was going to be, is that how do you generate that approval process? And the fact is you have tooling that does that and it is self-service. [00:24:29] JL: Yup. [00:24:30] LA: This investment in tooling, which is big investment on your part, has to have improvements overall to the organization as a whole. Are you able to quantify those improvements? What type of improvements do you see? And how valuable has this been to the organization as a whole not just in the efficiency of how you've been able to do the governance but the improvements overall to the organization? [00:24:55] JL: Yeah. I think one of the biggest things that I see is that our data users can work pretty much in a self-sufficient fashion and do what they need to do. I mean, at any point in time, we have a very large amount of use cases going on in parallel. I don't even know the number. We don't really hear that anybody's blocked from doing stuff, right? They may need more capabilities. And there may be more automation to do it faster. But we really just enabled the whole business to make data-driven decisions across every single organization. Whether it be from cloud operations to credit decisioning. I mean, they're able to move as fast as they can while also not getting in trouble around and staying well-managed. And not having to worry about stuff like scanning data for human PII or whatever it is. We take care of all that stuff for them. And they can just kind of go, which is the biggest benefit. [00:25:56] LA: Nice. Nice. What type of advice would you give? A lot of people who are listening to this are thinking about whether they want to move to this sort of model. Some of them are looking at monoliths and trying to figure out what to do with the monoliths. Others are looking at, "We've distributed code, but we still have centralized data. And we're looking to - we see the disadvantages of that." What type of advice would you give to either of those groups for, one, whether they should move to a distributed data model? And, two, what are the things they should look out for in making that model work? [00:26:34] JL: One of the key things that has helped us, you have to define standards and you have to be willing to - there's some standards that you define that you have to be willing to truly enforce. And some of those things, they will be hard specially to start. But you need to have that support from leadership that you are going to set these standards to do certain things in a standard way. And, to me, the most important thing with that is around the contracts and how you were asking me, do people talk directly to databases and stuff like that? I think getting the contract and the model stood up for that is probably the most important thing, in my opinion. [00:27:17] LA: Can you do that without investing heavily in tooling to start? Or do they go hand-in-hand? [00:27:25] JL: You could try. But I think if you don't also invest in automation, you're quickly going to find that whatever systems you're going to have to enforce those contracts - if you have people manually going in there and updating stuff and there's a bottleneck there, it probably will start to fall apart. Because at the end of the day, a business needs to be able to move. And engineers need to be able to move and build stuff. If you put standards in place without thinking about how you're going to automate and enable people to still get their job done, you're probably going to run into some challenges. [00:28:02] LA: Got it. Yeah. And automation means consistency, right? I mean, the more consistent you can make the process, the better off you are long-long term. [00:28:12] JL: Yep. [00:28:13] LA: Great. Well, thank you, Jim. This has been a great conversation. Jim Lebonitte is a senior distinguished engineer with Capital One. And he's been my guest today. Jim, it's really been great talking to you. I'm glad you've done this model. This is certainly a model I've been promoting a lot in my recent years. I'm glad to see someone who can come up and say it worked. And you recommend doing it. And it sounds like you would do it again. But thank you very much for your time. And thank you for coming on Software Engineering Daily. [00:28:44] JL: Awesome. Thank you, Lee. [END]