EPISODE 1726

[INTRO]

[0:00:00] ANNOUNCER: A major challenge in applied AI is out-of-distribution detection, or OOD, which is the task of detecting instances that do not belong to the distribution the classifier has been trained on. OOD data is often referred to as unseen data, as the model has not encountered it during training. 

Bayan Bruss is the VP of AI Foundations at Capital One, and in this role he works with academic researchers to translate the latest research to address fundamental problems in financial services. Bayan joins the show with Sean Falconer to talk about OOD, the importance of bringing AI research to real-world applications and more.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. 

[EPISODE]

[0:00:53] SF: Bayan, welcome to the show.

[0:00:55] BB: Thanks. It's great to be here.

[0:00:57] SF: Yes. Thanks so much for doing this. So, you're the VP of AI Research at Capital One. You've been working there in like the ML/AI space for I think over seven years. To start, what was that journey and what's your day-to-day look like as an AI researcher at one of the largest banks in the world?

[0:01:16] BB: Yes, happy to tell you about this journey. It's been a really exciting and fun process for me. I've been able to witness firsthand Capital One's commitment to research, how it started many years ago and how it's grown over the years. As you mentioned, I'm in Applied AI Research. I'm part of a larger organization we called AI Foundations, which is really looking at how can we take all of this amazing AI research that's coming out of academia, coming out of other industry labs, and solve some of the really core fundamental problems of financial services.

That goes back really to the beginning of Capital One, though I haven't been a Capital One since the beginning. But this question of how do we use data, how do we use models to meet the needs of consumers from a financial service perspective is really where we began as a company and we've only invested from there. In the last seven years since I've been at Capital One, I've gotten to witness us towards the tail end of our large-scale overhaul of our data ecosystem, our migration to the cloud, which really set us up for the next wave of what we were doing as a company, which was employing machine learning in almost every application, every service, that we provide to our customers. 

So, we've been largely successful at doing that, everything from fraud to marketing, to anticipating customer service needs. All of that is running on a very modern technology stack at Capital One, and all of it is largely using machine learning today. So, it's been really impressive to see the company go on this journey since I've been there. This latest wave of AI technologies is just a continuation of that from our perspective. It's looking at the same broad sets of processes. We've got all of this data. We've got all of these customers around the country, around the globe, and we're trying to understand how can we use it better, use this technology in a new way to meet their needs.

So, my day-to-day, it really spans the gamut. Part of my job is working with universities, working with researchers at some of the top labs, everywhere from NYU and Carnegie Mellon and Columbia and USC. There were framing up research problems where myself and my team are in the lab. We're working with them on some of the cutting-edge research that's being published at top conferences like NURPS and ICML and ICLR. Then, that's kind of the farthest upstream of the work that we do. Then the work that we do spans all the way down to, okay, we've got a very specific business problem. We need to detect this kind of fraud. How can a specific type of AI methodology solve that piece of fraud? 

We really get to do this whole thing end to end, which is really exciting because, you can hypothesize about how something's going to work in a laboratory setting and you might get something to work on an academic benchmark. And then you try to get it to work on real-world data and real-world application. Suddenly, it doesn't look as great as it is. Innovation that's required, not just in the like getting it to work on the benchmark, but there's some oftentimes even more in getting it to work on the real-world environment. So, our team kind of shepherds that process from start to finish.

[0:04:31] SF: So, what's that kind of look like to go from essentially, maybe it's like starts in academic research to essentially some of the stuff that you're looking at in applied AI to actually being integrated into existing like services, so that you do create this situation at Capital One where ML is part of essentially every part of the application stack?

[0:04:51] BB: Yes. The process kind of moves in two directions. On the one hand, we have to really understand the context of the business. We have to really understand the context of the application.
You can have very simple understanding of the nature of fraud or the nature of customer servicing, and you can build a solution for that simple understanding. Then what ends up happening is that it doesn't work.

So, you really have to have a deep understanding of that context. You also have to have a really deep understanding of where the technology is and what it's good at and what it's not good at. What that process looks like is kind of an iterative matching process. I kind of think of it as like a spaceship docking. It's kind of rotating and moving as it gets closer and closer, and you try to figure out what piece of technology fits with the business problem. So, it is largely iterative.

You kind of come up with a hypothesis. "Okay, we think that we're seeing this kind of technology work maybe in an academic setting. Could this solve this particular problem inside the company?" We go and talk to our partners who work on that team. We understand their context. We test it out. Sometimes it doesn't work. What we learn when it doesn't work is some constraints about how the model was specified or how the problem was designed. Then we go back to the research side and we say, "Okay, well, these were the assumptions we had made when we were thinking about this problem. These assumptions turned out not to be true. So, let's change those assumptions and rebuild how we do this and then go back to the problem again."

Sometimes that cycle takes many, many iterations. Then, that's just solving a single specific problem. Once you've solved a specific problem or kind of gotten a model out into the world, you also can step back and say, "Okay, looking at this big picture of what that process looked like, what are some common themes or some like broader blind spots in the research that would have allowed us to solve this problem a little bit better?" That's when we go back and we actually like really fund and work on kind of broader sets of questions that need to be solved even before we can kind of do the next phase.

To give you a couple of like concrete examples of that, one thing we've observed over the years is as we've kind of worked on large-scale, deep-learning architectures for some of the core applications that we have within the company, this question of explainability comes up again and again. How do we understand what the model is doing? How do we build confidence that the model is making decisions for the right reasons? All of these pieces around trustworthiness of the models that we're using.

So, we funded research, we've done our own research in how to do this effectively over the years. We continue to invest in that research because that's still an open question. But that came out of understanding that this is a fundamental constraint as we try to use this technology in what we're building.

Another example is a lot of deep learning methodologies or a lot of AI research in the last two decades has been primarily focused on how do we use AI and deep learning for computer vision or natural language processing tasks. We observed that a lot of the data and a lot of the problems we have are kind of tabular in nature. So, we were one of the early companies to work with a variety of academic labs to study how can you use these methodologies for tabular data or like relational data. That's, of course, like a much bigger field now. There's all these papers around how do you get LLMs to understand tabular data and to reason over large tables. But four or five years ago when we started looking at it, very few people thought that this was an area worth investigating. But we had observed that it was really critical for us to make progress on some of our core problems.

[0:08:29] SF: You mentioned there some of the challenges around like explainable systems, explainable AI, and if we look at everything that's happening in Generative AI and around large language models, there's a number of limitations or unsolved problems. One is explainability. There's also challenges around reliability. And given that something in like banking, probably the bar for like correctness is very, very high.

Where do you think, based on your experience working in a space, what is sort of the entry point for starting to leverage some of these technologies in the world of banking where maybe they don't have to be perfect, but they can still be like a high-value contribution to solving various problems in banking services?

[0:09:12] BB: I mean, I think in general, many industries, not just banking, are starting with assistive technologies. So, rather than going to fully autonomous systems where we're relying on Generative AI to kind of make decisions without any oversight. We have humans in the loop. We have humans reviewing. We acknowledge that that's a constraint, obviously. That's not the full potential of what this technology promises, but it is a natural stepping stone that helps us build confidence, helps us understand what are the real failure modes when you use these technologies in large-scale systems and build the kind of guardrails against those failure modes. Sometimes that's the best way to learn is through that iterative process.

I think many - you just look at AI research in general and the way AI is penetrating society, even in autonomous vehicles, for instance, like they started with like diagnostic systems where you might get an alert to kind of assistive technologies where you might be guided within a lane and we're getting closer and closer to fully self-driving vehicles. I think you're going to see that same pattern emerge in other high-stakes environments where initially you might use these tools to help identify or spot issues, and then you might use them to help humans do what they're already doing. Then once you've built up that confidence, once you build up that data, once you build up the systems of safety around it, then you kind of start to take the constraints off of it and let the models do what they do.

[0:10:39] SF: Yes. I mean, I think that's a really good example, like going back to autonomous vehicles, like even something around getting warnings around collisions or overtaking, braking, and stuff like that. It's tremendously valuable from like a safety perspective. But that's a long way away from I'm just going to like sit in the back seat when my car drives me around to like whatever my destination is. But those are the stepping stones that start the way the groundwork from getting to that sort of world of full autonomy.

[0:11:04] BB: That's right. We take for granted often the progress that's been made because it is incremental obviously and it's incremental for good reason because these are complex systems and they're complex and they're potentially quite dangerous. But if you think about learning to drive 10, 15 years ago, 20 years ago before even simple things like lane change alerts and blind spot warnings were in place, all of these things are now completely standard for all drivers. Let alone the kind of almost autonomous driving that you can do on the freeway now, it feels now like, "Oh, that doesn't seem like we're anywhere closer to autonomous driving." But if you actually compare it to where we were 20, 30 years ago, it made quite a lot of progress, and there's a lot of core investments beyond just the AI that have gone into that, everything from instrumentation, how the like vehicle can be controlled through various technologies.

A similar path is going to be the case in a lot of industries, which is there's a lot of investment that goes into instrumentation. Because once you move into something that's fully autonomous, you need to have all these systems that are measuring and understanding the dynamics of the environment and whatever the agent is that's moving in the environment. For us, we're investing just as much in the core AI research as we are in understanding how do we monitor instrument, understand the environment around the various models that we're using.

[0:12:24] SF: Yes, absolutely. I mean, I think that we tend to not only sort of take for granted, certain technology innovation that would have seen magical at one point, but we also become like sort of disgrunted with it. I commented on this recently that if you had to show something like ChatGPT or GitHub Copilot to people five years ago, their minds would have been completely blown. But now, everybody's reached a point where, and it's not even that, we're not even in that far into the life cycle where people are constantly complaining about how dumb these systems are and they can't do all the things that we want them to do. I think that's just like a natural progression between like humans and interacting with various technology.

[0:13:01] BB: Absolutely. 

[0:13:02] SF: So, one of the things I wanted to get into today is this concept of out of distribution detection or OOD, which is about identifying inputs to a model that are outside the distribution and the trying data. 

So, before we get into too much detail here, I want to start with what are some applications, essentially, of OOD? Where am I using that essentially for when I'm applying it?

[0:13:23] BB: Yes. It's a really interesting area. It's probably worth starting with the big picture of what we're talking about here. It is very closely related to what we were talking about with explainability and robustness and safety.

The fundamental idea and problem that we're describing, OOD is kind of a narrower subcomponent of this, but the broader picture of what we're describing is this problem where the data that you have, when you're training a model, when you're building a system, is inherently incomplete. We can't, at any point in time, have all the information about the world or about the universe, right? We're finite. We can only collect so much.

What that means is, that by the time you use the model, and while it's out in the world, it's going to be interacting with data that is in some way different than the data you had when you built the model and you deployed the system. There's a number of different ways that the data could be different. For instance, you might train a classifier to identify the difference between two types of images. This is a common framing within the OOD literature. So, you say when you train a classifier at the time you build it, you have 5,000 samples of cats and 5,000 samples of dogs and you're trying to train a model to predict the difference.

The model not only learns the properties of cats and dogs, but it inherently starts to think, at least because of the way the data is presented to it, that 50% of the world is cats and 50% of the worlds is dogs. Then, maybe it gets out into the world and discovers that there's a lot more dogs than cats. There's, 70% of the world is dogs and 30% of the world is cats. That's an example where the data that you trained on fundamentally looks different than the data in the world.

You can have a similar problem, but slightly different instead of changing the ratio that you're seeing between when you trained to when you're deployed, you might have the same ratio, but the kinds of dogs and the kinds of cats that you had when you train the model is different than the kinds of dogs and kinds of cats you're using when you're deploying the model. So, maybe you trained on bulldogs and suddenly you're seeing a bunch of French poodles. They're still dogs, but they look different. They have kind of different features and different aspects that make them dogs versus cats.

Another common way that this might appear where you've got data that you're training on that's different than the data that you're using, which is that everything is exactly the same, but now there's just a lot more noise. This is actually a very common one where when we're building models, we clean up the data a lot. You might remove outliers. You might normalize the data in specific ways. So, the data itself becomes pretty clean looking, especially if you train models on kind of academic datasets. They've been really, really cleaned over the years.

Then all of a sudden, you get out into the world and the data is very noisy. It's corrupted. It's got, kind of, in the blur, in the case of images. Then another way that it might be different is that you have all the same data, maybe it's not corrupted, but certain aspects of the data have changed. So, for instance, when you trained the model to classify between dogs and cats, all the dogs were in the yard on grass, and then suddenly you start getting pictures of the dogs on dog beds. That's just a fundamental. You've never seen pictures of dogs on dog beds.

Then finally, you might get a situation where a whole new thing starts to show up. So, instead of just cats and dogs, suddenly a parakeet is showing up. There's all these different ways that the data that you have when you're building the model can be different than the data that comes in when you're using a model. Ideally, you want to have models that are robust to these changes. That's the goal. You would like models that can, if not adapt, at least not fail in the presence of new types of data or data that's slightly different than the data that it was trained on.

If you can't, adaptation would be like the best thing, not failing would be the next best thing, and then kind of the third tier down from that is, well, I know it's going to fail, so can you detect it? Can you tell if one of these potential issues is coming up?

Probably the one that's most studied in the literature is the last example I gave where you have a model that's trained to do a specific set of classifications over a set of possible classes and a new class shows up. This is often referred to as open set recognition. This is probably studied the most because identifying new things that are outside of your training data set, or at least kind of being able to estimate how confident you are when you're outside of your training data set is actually one of the critical pieces of computer vision problems in self-driving vehicles. So, you can imagine a self-driving vehicle has a set of possible things that it's been trained to detect, and depending on what it thinks it's seeing, it might make certain decisions. What you would like is that if it sees something that it's never seen before in its training data set, that it maybe reverts to some more cautious policy or if there's a human in the car kind of allows them to take over. So, that's probably why that's been studied the most.

But it's not just a self-driving vehicle problem. In healthcare, you might have this problem where you might start seeing a new symptom or a new illness emerging. Think of COVID-19, where we start seeing all these symptoms, and we want to be able to potentially want a system that could detect that a new virus is starting to spread in the community and maybe you had a model that was training on predicting things like flu and RSV. But now you have this new emerging virus, so it wasn't in your training dataset.

So, you know if you're in this new regime would be an example in healthcare. In financial services, there are some examples where you might have a new customer intent emerging, primarily in kind of our chat-based systems or even in our call centers, where maybe we roll out a new product or a new policy and customers start calling us with specific questions about that. We have systems that can detect what customers are trying to solve based on what they're saying to us. Maybe this is a new thing that the customers have never called before. So, we want to have the ability to detect that this is a new intent that the customer might be reaching out to us about.

Certainly, there's a big difference between a one-off where you're just seeing something for the first time and you just want to kind of maybe have a human step in to take a look at it and kind of an emerging entirely new intent or an entirely new, in the healthcare example, illness, you want to be able to identify when that kind of entire new regime change has occurred and how do you react to it holistically.

[0:20:08] SF: Okay. Essentially, when we're developing models, ideally, we're building a model that's robust can essentially handle potentially like, is it going to handle unseen data before to some extent? But we also want to be able to detect those situations where we know the model's is probably not going to be able to handle this. So, going back to the autonomous vehicle example, maybe we've only trained an autonomous vehicle using urban street data and then suddenly we're in a rural area and it looks different. There's cow crossings and we're on an unpaved road. Then we might want to detect that because there's a high chance of failure. Let's have a human take over or proceed with caution or something like that.

Then the other way is we might want to be able to detect something that's completely new so that we can also overt someone's attention to it because this is like a new disease that we need to look at or something like that.

[0:20:57] BB: That's right. Your reaction to each of those things might be different, as you said, like how do you fix the problem changes? So, if it's like an entirely new pattern or new class of things, then you want to start collecting certain amount of data and you actually want to retrain the model, right? You want to bring that in and allow the model to be able to predict that in the future. If it's an outlier or something that you only have seen one time, then you might want to revert to some expert system or some manual override. But knowing which setting you're in is really, really critical to kind of the overall safety of the system.

[0:21:30] SF: So, how do you define like this boundary between what's in distribution and what's out of distribution?

[0:21:36] BB: That is really, really hard. I would say that's the crux of a lot of research over the last few years, and it gets down to the different approaches. So, there's actually, a number of different approaches to OOD and OOD-related methodologies. They go by a lot of different names, and often times, the researchers in these different names and different fields don't necessarily talk to each other. It might be called outlier detection, anomaly detection, which has a history that goes back many decades. I mentioned open set recognition, novelty detection, one-class classification, distribution shift detection.

Each of these has a different kind of set of possible metrics and methodologies that they use to ultimately decide what is in distribution and what is out of distribution. Sometimes they share that metric or share that methodology. But they don't always and there's a lot of debate on which methodologies work the best and when. I would say it's an open question as to, I don't think there is one at the moment that works better than any of the rest.

Now, the ones that I've looked at the most are kind of in the space of open set recognition and distribution shift detection. In the distribution shift detection literature, you can use a variety of two-sample tests. So, you can ask, okay, I've got a sample at time T. I've got a sample at time T plus one.
Is there a difference between these two samples? That's one way you can identify, is there a fundamental distribution shift.

In the open set recognition, so this is where you've got a set of classes of things that you've trained on, and you have a potential new class that you've never seen before. Again, saying you have cats and dogs that you've trained on and you're trying to, this time you have a parakeet that shows up. There's some common ways to do this. Probably the most common way is something called the maximum softmax probability.

So, in any classifier, the thing that the classifier outputs is a probability distribution over all the classes. In cats and dogs, it's just two classes. If you had 10,000 possible classes of different animals and houses and anything that you could see in an image, then you'd get a probability distribution over all of those. A very simple thing that you can do is look at the maximum probability in that distribution. So, you look at you have 10,000 probabilities. All of those have to sum to one. When the model sees image that it was classified on, it usually will have a maximum probability that's pretty high, meaning the model is confident. Whereas if it's a new class that it's never seen before, that maximum probability will be a little bit lower.

Now, this doesn't work 100 % of the time, and this is why it's still a very active area of research. It often fails and fails in ways that are surprising to people. A lot of deep learning models can be quite confident in areas that they shouldn't be confident or particular, in open-set recognition, they might be confident on things that are clearly not in their training distribution, and there's a number of reasons for why that is. 

So, there's another way that you can identify, another common way that you can out of distribution which is, what would be referred to as feature-based. So, all deep learning algorithms go through this process of learning the features of the data that they've been given. One thing you can do is you take a classifier that's been trained to make certain predictions. Then, you look at the features that it learns about that data, which is essentially like you have a deep learning model and you just take the top off. You lop off the head of it. Now, you have kind of these representations of the data. You can fit a density model over this space, over the different classes, and you can use that density model to estimate the probability or the likelihood of a sample under the density model, and that can tell you whether or not a new data point is expected or not.

[0:25:40] SF: If you're dealing with this, essentially, open set problem, so you have a bunch of classes defined, presumably you've had training data that's gone into the definition of those classes. Could you compare essentially the distance between the new input value vector representation and the centroid of those classes? Then, if it's larger than some reasonable threshold, it's something that's net new?

[0:26:02] BB: You could. You could. That's essentially what this density-based method does, except it does it in a more probabilistic manner. So, you can actually view it as a likelihood of the new data point. But yes, you could also just look at it in pure distance terms.

It's interesting though, because these models don't necessarily use this space that they are given in these high-dimensional features that they learn about the data, in ways that are intuitive. So, it's not like you get all of your data points in all of your cats in one place, and all of your dogs in another place. Then, a new data point just kind of plops somewhere far away from them. Sometimes it's actually quite close or it's closer to one than the other in ways that we wouldn't expect.

So, the learning dynamics of these models is still not entirely intuitive. It gets back to this problem of explainability, is we don't always know why certain features or certain aspects of the data are learned in the ways that they are. It makes this problem of open set recognition particularly challenging.

[0:27:07] SF: I would think some of it's also context-dependent on the use case because there could be certain circumstances where you're going back to the dog example, maybe for the problem I'm solving, I only care about like really specific types of dogs. Then, like a net new dog is actually a new class, but it doesn't fit into like whatever problem I'm actually trying to solve and maybe it's less important or something like that.

[0:27:31] BB: That's right. That's right. This is a hard problem for that exact reason. It's also hard because these neural networks are particularly susceptible. I don't know if neural networks are more than other models, but we've studied a lot in the case of neural networks because of how they learn to what's referred to as spurious correlations.

So, we tend to think that a model has learned a specific aspect of the data. I'm interested in classifying cars between trucks and sedans, for instance, might be something that I care about. I think, okay, well, it's learned that trucks are big. They have this flatbed space. Like, it's learned all these aspects of the truck that I would use to make a prediction. But what it ends up, it turns out learning is that in all the pictures where the trucks were, it was a picture of the truck on a ranch or like on a mountainside. And all the pictures of sedans were in a city.

So, what the model was really learning was to tell the difference between cities and mountains, not between trucks and sedans. These are referred to as spurious correlations. There's a famous dataset in AI research called Waterbirds dataset, where they have all these - you can train a model to predict whether or not an animal, a bird is a water bird or a land bird. So, you have all these ducks and they're all sitting on water. Not surprisingly, when you put a bunch of ducks on land, the model thinks they're all land birds. Was it classifying water birds or was it classifying land versus water?

This happens quite a lot in deep learning and computer vision. Part of the reason why is that neural networks are, it's been shown, at least in the literature, that they're what's called shortcut learners. They're kind of lazy. They tend to employ superficial strategies and how they learn the data. They take the easiest pass to get good at the task. A lot of the work then goes into how do you train them on a way so that they don't kind of use these superficial strategies. So, part of that is, okay, we'll make sure that your data set is actually representative of the data that you want. How do you regularize the model away from certain common shortcuts that it might be able to take? So, there's a ton of research on how to how to make these models more robust in the learning process.

But there's also a fundamental problem with the way we do modeling, if you think about it, right? Which is that these models don't have an inherent sense of knowledge. They don't understand that a bird is a thing that exists and that certain types of birds live on water, but they can also go on land and that other types of birds live exclusively on land, right? That's not something that a model can inherently learn. It can learn the patterns of the data that kind of map onto that knowledge, but that knowledge is something that we as humans have built up over, thousands of years of our experience and our scientific process and all of that.

So, fundamentally, you can't really ever expect them to develop that sense of knowledge, of truly knowing something. They can get really good at approximating that knowledge, but they're never going to cross that threshold. This isn't necessarily a deep learning problem inherently, right? All machine learning suffers from this to an extent. They're not designed to obey structured logic. They're learning from patterns.

I think as we move and as we advance the field, we'll obviously have to deal with this and part of that will be figuring out strategies to accommodate for that limitation, and part of it will actually be developing research that moves past that limitation and seeing if you can actually get models to have that inherent sense of knowledge.

[0:31:02] SF: Yes. Do you think then, based on what you're saying there, that it requires essentially new innovation in order to break through that barrier that we have right now where models don't have essentially this inherent knowledge or they can't essentially do the pattern recognition at a level that can mimic as if they have that knowledge?

[0:31:18] BB: I think so. I mean, I think if you look in like general language models that kind of knowledge enhancement is coming from external systems. It's not coming from the model itself, right? We're getting much better at finding different strategies for empowering those models with factual information, storing that factual information outside of the model and providing it in a way where we have more control over how it gets to the model. Different strategies for how those models process and rely on the data, I think makes a big difference. I think that even that has a limitation of externalizing the knowledge, which is that we have to set up those systems. It still requires a lot of humans to think about how does this get to models and how do they take it in and spit it back out to it.

That threshold of when they can actually know something inherently and can kind of use that information to deduce other things and show some kind of like inductive logic. All of those things are things that we inherently do that I don't think models are able to do given their current setup. I wouldn't speculate on exactly the path to get there. But it's an area of active debate.

[0:32:25] SF: Yes. Well, I mean, I think predicting anything at this point is pretty challenging with how fast things are moving. So, going back to the out-of-distribution detection, given that there's a bunch of different varieties approaches and they fail for different reasons, do people look at combining multiple approaches to sort of fill in the gaps? Let's use some sort of distance measurement, but also let's use the softmax approach and maybe some other approaches.

[0:32:52] BB: Yes. It's a great question. To be perfectly honest, I haven't seen a lot of research in this area of ensembling different techniques. What you will often see, at least in production systems, is for different types of failure modes or different types of distribution shifts. We will use different methodologies, but we'll usually pick one methodology for each. Say for open set recognition, we might pick one methodology. For distribution shift, we might pick a different methodology or change point detection. We might pick a different methodology for outlier detection.

But that being said, I think in a number of fields, when we found that no one model is particularly good, ensembling or combining kind of the wisdom of the crowd, so to speak, is quite effective. I would expect that that would be true here too. There's probably a lot of work that needs to go into how exactly to ensemble these different methods and how to combine their ability to spot different aspects of the data and when you can bring them together to make it more effective. I think it's an interesting direction that could be quite successful.

Obviously, you incur a much higher cost if you have to run six different algorithms instead of a single algorithm. That's always a trade-off in safety systems, which is that instrumentation and guardrails are not free. The more you put on there, the more maybe latency goes up or just general computational costs go up. So, you have to be mindful of that. Some of these can be quite expensive.

[0:34:21] SF: In systems that have to be, like in an autonomous vehicle where reaction time matters, then you might not be able to handle the additional latency in those types of systems.

[0:34:30] BB: That's right. That's right. Those are the kinds of constraints that are really interesting about actually deploying these technologies is you're like, "Oh, okay, well, yes, in a perfect world, I could do 10 different systems, and they could all add up to get exactly right, and I can feel very confident," but then you actually look at the constraints that you have when you deploy it, and you're like, "Okay, well, I have to pick two, and how do I pick the right ones, and how do I set up the system?" So, that's the fun of applied research.

[0:34:53] SF: Yes. I mean, that's like basically the separation between the academic and the applied is how do we do this in reality. How are these systems tested? How do you actually determine how good an OOD detection model is?

[0:35:06] BB: In an offline setting, you can simulate setup that would look like an OOD scenario. So, you could basically say, "Okay, fix the number the classes that my model has been trained on and then sample from a set of classes that my model wasn't trained on." For each one, calculate, in the case of MSP or maximum softmax probability, you would basically say, "Okay, for in-sample examples, so flowers and trees, calculate the maximum softmax probability for each one of those. Then, airplanes calculate the maximum softmax probability for a bunch of airplanes that the model wasn't trained on.

Then you essentially can use that maximum softmax probability as a classifier unto itself, right? So, that gives you the ability to look at the area under the ROC curve for that model's ability to tell the difference or that methodology's ability to tell the difference between in-distribution and out of distribution. That's when you're testing the model in a kind of simulated pristine environment. When you actually use the methodology, it gets a little trickier, because if you are out of distribution, you usually don't know it unless you have one of these methodologies. But if one of these methodologies isn't working, then you really have a blind spot, because it's the thing that's supposed to tell you if you're out of distribution.

So, you have other systems that can help. Oftentimes, you can wait for some downstream system to alert you to a failure, or even if there is a human in the loop, sometimes they might spot it and tell you that something's changed.

[0:36:42] SF: I've heard you talk about how the whole premise of using supervised models, training on distribution data for out-of-distribution detections, like a fundamentally flawed idea. Why is that approach flawed?

[0:36:55] BB: I wouldn't expect. If you actually step back and think about the problem. So, you have this model, it's got a set of data that's been trained on, and it's got a set of classes that it's trying to map that data into. From the vantage point of the world that that that model knows, that model knows the data it's seen and that model knows the classes.

Inside of that world, the model has no conception that another class of things might exist in the world, right? Ontologically, that is its entire universe is the set of things that it's seen. I wouldn't necessarily expect it to be good at saying, "Oh, this is something I've never seen before." Because that's not what supervised learning does or why it's designed to do what it does. So, you might say, okay, well, add another dummy class that says, "Okay, I have two things that I'm interested in, and if I don't know what it is, classify it as this third thing." That might just give it this ability to tell you that it hasn't seen that before.

It turns out that it doesn't do a good job at doing that classification either. I don't think we really know how to specify the problem correctly at the moment, but we're seeing, in all the different ways that we tackle this problem, that there's a fundamental or ontological limitation to the models that we can't get the models to overcome inherently. I think it's one of these interesting gaps that we're seeing it as we think about whether you want to call it AGI or something else, but as we move beyond kind of the set of models that we have, humans know there are limits to the things that we've observed in the past. We know that we haven't been able to experience everything that is in the world and every day we encounter new things and we can reason about what those things might be and how those new objects or new experiences might fit into a framework of thinking and experiences that we've had in the past.

That's just not how supervised models or it's not even how generative models work. This is - it's one of those barriers that we haven't crossed yet between how we as a species think about the world and how these models think about the world.

[0:39:11] SF: Yes. I mean, there seems to be a fundamental difference between whatever's going on, essentially, in our head versus what's going on in terms of a model. Even if you just look at the amount of input that you need to train on, like a human is sort of exposed to a relatively small amount of input compared to something like a large language model is basically trained on every digitized textual information that you have available, way more information than any other, any person probably has ever seen before, just to be able to generate sentences and paragraphs in a way that is like understandable. There seems to be fundamental difference, essentially, between what we can do with a model executing on a Turing machine versus whatever's going on in our brains in biological sense.

[0:39:57] BB: Yes. I mean, I have two little kids and I think maybe the best thing for any AI researcher to do is to have little kids running around, because it really shows you how we learn and what we learn. One thing I've observed in the last few years is they've gone from babies to toddlers to little kids, is that they build up this ontology of the world. They build up this structured abstraction. Not only they do it kind of completely on their own, but they're constantly checking it with you, with others, with their peers, with experts, they're refining it.

Early on, they're kind of developing this complex understanding of how different things are and how they relate to each other. Then another interesting stage, as a parent, is like both fun to watch but also can get maddening is they ask you why like a gazillion times.

[0:40:49] SF: Yes. My kids are in that phase right now.

[0:40:51] BB: Yes. Why? Why? Why? Part of that is because, okay, they've got ontology of the world. They've understood that there are dogs and cats and trees and horses and cows. They know that these objects exist, and now they're starting to understand causal relationships in the world. They're essentially asking you to impart your prior knowledge about those causal relationships onto them, and they're refining that causal model in their heads. It's not causality in the like statistical sense that is often used in the literature. It's a much softer kind of causality. But humans all have this way of understanding the world that's somewhat structured, and we're most importantly, constantly refining it based on our experiences, and we're doing that in a very efficient way, I think is the most important part about it is. You don't need to experience everything in the universe to have a pretty good understanding of how things work.

[0:41:41] SF: Yes. I mean, we're basically inherently kind of bad at math and probability, but great at very limited amount of information pattern recognition and detection, and very, very efficient at that. So, it's like, it's a completely opposing views to what computers are good at. But I feel like we could easily spend an hour or more just on this topic. Maybe we'll have to have you back down the road to dive into this. I would love to hear more of your thoughts on this. But as we start to wrap up, is there anything else you'd like to share?

[0:42:11] BB: Yes. I think there's a lot of interesting areas in this broader robustness domain that we are going to make significant progress on in the next couple of years, primarily because the success of these broader systems, these broader AI systems, have taken the technology out of the lab and into everybody's daily lives, and it's there that we're going to actually figure out how to solve some of these more theoretical problems. Where are we really struggling with OOD detection? Where are we really struggling with distribution shift and model robustness?

The flip side of that is that it's now easier to test solutions, right? When it's strictly an academic exercise, you test solutions on the set of benchmarks that are available to you. I think we saw a few years ago, and in some domains, this is still the case, which is that the benchmarks kind of got saturated. We achieved the best performance we could on the data that was provided to us in an academic setting. Getting more data, getting real-world data, getting environments that can really be tested on, really shows where the opportunity lies for the way ahead. I think that's what I'm really excited about.

[0:43:25] SF: Yes. Absolutely. I mean, I think as more and more people are basically building on AI and building new types of systems that are interacting with real people, the learning cycles is just going to speed up compared to, if we sort of look back 30 years ago where most of the stuff was existing in academic labs in a university setting, there's just only a limit to the amount of input and feedback that you can gather in those settings.

[0:43:48] BB: That's right.

[0:43:48] SF: Well, Bayan, thank you so much for being here. I really enjoyed this and have a great day.

[0:43:52] BB: Thanks, Sean. It was great talking to you.

[END]