EPISODE 1916 [INTRODUCTION] [0:00:00] ANNOUNCER: Predictive modeling is a core element in modern systems and powers capability such as fraud detection, loan approvals, and recommendation systems. These systems typically operate on structured relational data stored in enterprise databases with rows, columns, and interlink tables. While computer vision and natural language processing have undergone a neural network revolution, the tabular data layer underpinning predictive modeling still largely relies on manual feature engineering and task-specific models. Relational deep learning proposes a new approach. It treats databases as graphs and applies transformer-style attention mechanisms directly over structured relational data. Researchers are now building foundation models for tabular data that aim to generalize across predictive tasks without painstaking feature engineering. Jure Leskovec is a Professor of Computer Science at Stanford University and he previously served as Chief Scientist at Pinterest and was an investigator at the Chan Zuckerberg BioHub. Most recently, he co-founded the machine learning startup, Kumo.AI. In this episode, Jure joined Sean Falconer to discuss the limitations of traditional predictive modeling, why structured enterprise data requires its own modality-specific neural architectures, how graph transformers generalize attention to relational databases, and more. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. [INTERVIEW] [0:01:46] SF: Jure, welcome to the show. [0:01:47] JL: Thanks for having me. Great to be here. [0:01:49] SF: I'm really excited to get into, I think there's a variety of topics we can dive in today, but maybe before we get there, just grind on the audience a little bit. Who are you? What do you do? What's your background? [0:02:00] JL: Yeah. I am primarily a professor at the computer science department, AI lab at Stanford. Been there 15 years. My research focus on AI, especially AI over structured, tabular, relational type data. I work a lot with graph data. In my career, the models we've developed are today used at Facebook, at Meta, at YouTube, and at a number of different places. In my career, I was also a chief scientist at Pinterest for six years, group Pinterest, from 150 employees, post-IPO, and was basically building large-scale AI machine learning platforms there. [0:02:39] SF: What drew you back to academics after having experience at a place like Pinterest? [0:02:43] JL: I think Stanford is the most amazing place. It's where the future happens. What is also amazing about Stanford is that it has this - allow us to flow between industry and academia and really understand what are the biggest problems out there that are worth solving. We go to the industry, we come back with the ideas. The students at Stanford are amazing. I would really say, the future happens at Stanford, and that's the most exciting part to me. [0:03:08] SF: Yeah. I spent some time at Stanford myself as a student and then was drawn out to industry and never found my way back. I guess, now that there's so much going on in the space of artificial intelligence, especially with companies. You have your OpenAIs, your anthropics of the world and every large cloud provider doing amazing work in the space. How do you think in terms of where research is going to make significant contributions, versus where maybe the private sector of companies are going to make significant contributions? [0:03:39] JL: That's a great question. I would say, research academia is different, right? We cannot compete on scale. We cannot compete on pushing products to customers and things like that. That's why we have startups. That's why we have industry. Whenever we make some new research breakthrough and we want the world to see that, we spin off a company, we spin off a startup and scale it up there. Then at the same time, there is huge value in academia and research and education, because the risk profile for us is very different. We can truly explore. We can truly fail. We don't have performance reviews, so we don't have all that that is in the industry. We can always ask about what are the paths not walked yet? What are the interesting new directions that maybe the industry is too conservative to take? Can we show the path there? There's been examples of this throughout history, where basically, academia found a new path, or researchers found a new path where the industry was just plowing forward full scale. Right now, it's similar in, I would say, in the field of AI. Of course, the frontier labs are making humongous progress, scaling up these models, and so on, but there is so much unexplored, there is so much more to do and that's what we are focusing in on our research. [0:04:50] SF: Yeah. I mean, I think a big part of that would be, there's no commercial obligation in academics. You could chase a problem that may or may not ever have some sort of commercial application just because it's an important problem, or at least to the individual to explore and maybe mean something in the long run to how we think about ourselves, our own intelligence, some other type of scientific endeavor. [0:05:12] JL: Exactly. Even with that, I would say that we are very careful what kind of questions we ask. We already ask ourselves, if we solve this problem, who's going to care about it? Who can benefit from the solution? Being connected to the real world is a very important part of the way we think of the research we are doing. [0:05:29] SF: I want to get into this a little bit and first talk a little bit about predictive modeling, which I think is something that has a long, rich history in machine learning, artificial intelligence. There's fairly simplistic ways of doing some form of predictive modeling and then there's very sophisticated approaches as well. Can you give a little bit of background in terms of the history? What are we talking about when we say predictive modeling? Why does that matter and then what's the history of the discipline? [0:05:57] JL: Yeah, that's a very interesting question, and I think as we go through, it will become clear why this is interesting, right? Predictive modeling has been around forever and it's all about forecasting risk estimation. It's about filling in some information that doesn't exist yet. Where does this matter? This truly matters for, let's call it, quantitative decision-making. If I go ask for a loan, there is a predictive model that estimates the probability that I'm going to pay back that loan. If I'm in a hospital, the hospital wants to estimate what's the risk that if I get discharged that I get re-admitted? If I am, let's say, dealing with customers, I want to estimate what is the lifetime value of the customer. I want to estimate how likely this customer is going to churn. I want to estimate what next product, or what next show item to recommend to the customer. If I'm a, let's say, financial institution, I want to estimate what's the likelihood that this particular transaction is fraudulent? When a, let's say, user logs in, I need to estimate how likely is this a stolen identity? Somebody else is logging in and things like that. These are all predictive type problems, where based on the historical patterns based on the data, we want to estimate something that we don't know yet. We want to forecast something, and so on. This has been around for a very long time and people have been building machine learning, statistics, data science, have been building these predictive models for the last 20, 30 years. The point is that every percentage improvement in accuracy of these models means humongous business impact. Even 1%, 2% improvement can have humongous business impact. [0:07:34] SF: How would I think about this in terms of, with forecasting, say it's like financial forecasting, I'm taking a bunch of history and I figure out what is the function that describes that history, and then I'm projecting that out to see like, where could this, I don't know, trend in the stock market go, or something like that? How is it when you think of something like classification? I want to go and take a bunch of history as my training set and I'm going to use that to train some sort of classifier to figure out whether an email is spam or not. Is that predictive modeling, or is that a different type of classification of how we would think about that AI model? [0:08:10] JL: This is all what I would say falls under predictive modeling. Both examples, time series forecasting, any kind of classification, churn modeling as you gave the example, where the idea is that based on some historic patterns, you are trying to forecast, is this person going to cancel subscription in the next month? Because we don't have the information what is happening next month, we have to forecast. [0:08:33] SF: If we take a specific example, like recommendation systems, I think Amazon's been pretty famous for using recommendation systems for products for a long time. How do those systems typically work? [0:08:45] JL: Yeah. The reason we started opening this is because this field has remained practically unchanged for the last 20, 30 years, right? It's all based on this idea that you bring in, let's say, a data point, a unit of something that is described with a set of characteristics, or a set of features and then you are making that prediction. For example, for a recommender system, the idea would be that you have a description of the user, which would be maybe how long ago did the user register? When was the last time the user logged in? What were the last seven products the user visited? What categories are these products from? And so on. You have this profile of the user. Then you would say, okay, I also have to now build a profile of the product. Now, I need to learn some function that takes the profile of the user, profile of the product, and tells me how likely is, let's say, the user to purchase that product. Then if I find top 10 products that the user is most likely to purchase, I show those to the user and my sales go up 20%, 30%, 50%. That's how this is generally done. Now majority of the work goes in building that scoring function that takes the user profile and the item profile and gives me the prediction. If you say, how accurate is this prediction going to be, you have two aspects to it. One is how accurately do I build the user and the product profile? Let's call it this way. Then, how powerful that predictive model on top is. The point is that this is super painful, super slow and super manual to do, right? You need to hire a team of data scientists. They need to build these profiles. This is called feature engineering. They come up with some historical summaries of the user activity, put those into the user profile. They do something similar with the product. Then they create these training data sets to build the models on top. The models on top can be this decision tree, XG boost, CAT boost type things that work well in practice and are completely respectable, as well as two more sophisticated neural network approaches, and so on. The bottom line is that you need about two full-time people to build a single model, right? If you say, how expensive this is, it's like, I need two employees to support one model. If I now want to have 10 models in production that are making these decisions on the fly, I need that number of times two number of people to support that. [0:11:07] SF: If I build my e-commerce recommendation system, I also need fraud detection. I can't just pick up my recommendation system model and apply it to fraud detection. I got to go train, essentially, a new model. I got to do feature engineering just for the fraud detection. Probably use maybe even a different type of model to train and test against and then probably operationalize that model with a different set of people. [0:11:28] JL: Exactly, exactly. I would say, now, what is the exciting thing, right? Why are we talking about the past? I think the exciting thing here is that this entire area hasn't seen real progress in the last 20, 30 years. If you think about what has happened in the broader AI ecosystem, is that we went fully neural network, right? What I mean by that is on the, let's say in computer vision, we used to do some edge detection, some feature detection and then build a classifier on top to say what's in the image. We don't do that anymore, right? Today, a neural network just learns directly from the pixels of the image. The same thing I would say happened on the, let's call it natural language processing area, right? Where we used to do all kinds of parsing and feature extraction from sentences to try to say something about what the text is saying. Today, the attention mechanism just attends over the tokens of the text and the AI, the reasoning is born. I think the exciting thing here is that machine learning hasn't gone through this neural network revolution. That's the exciting new thing here is that there is the neural network revolution for machine learning ready to happen, that completely changes how we are building these models, how accurate they are, how much feature engineering it takes and things like that. [0:12:43] SF: Why is this hard in practice and why couldn't we just take - we've had this revolution around things like large language models that understand text very well and now they understand images and audio files and even video. Can we not take those and just apply them to this problem? [0:13:01] JL: I think the argument is the following. I would say, if you ask what kind of data are we using when we are making this predictive modeling, predictive problems, it's structured relational data, right? This is data that is stored in tables that are interlinked with primary foreign key relation. This is usually stored in a database in some structured form, right? This is the most useful data that enterprises have, because it's the ground truth of the enterprise. All the events, all the activity, it's all stored there. Now, what I'm saying is if you say for images, to process images, we have special neural network to process images. To process text, we have special neural networks. To process this tabular data, we need special neural networks. It's a different data modality image, has its own set of networks that are different architectures trained in certain ways. Text has a set of networks trained in specific ways and so on, and the tabular data also needs its own set of neural network architectures and its own way to train this that can directly attend over this structured tabular data. I think it's a different data modality, that's why it needs a different approach. [0:14:07] SF: Right. I mean, I think with something like text, where there's probably billions, if not trillions of examples of how sentences come together, and I can take a document from one place and a document from another place and it can - learning one of those documents can probably help me infer something about the other document. I think with the way I think about this problem around tables, rows, columns in a database is, is there that much I can, from a pattern standpoint, learn from one table versus another table? Those patterns seem like they could be fundamentally different in terms of how the person has modeled the data. It makes a lot of sense that this is a different class of problem, but how do you go about, I guess, attacking that problem? Does this make sense in terms of the patterns from one table, not necessarily allowing you to infer something about the pattern from another table? [0:14:57] JL: That's a great question. I think the way to think of this is as you have this private data, you have patterns, properties in it that are unique to you. I think another point to make is also, you cannot just textify a table and give it to a large language model. Large language models are amazing at what I would say, qualitative human-like reasoning, but they are not really good with numbers, if you want to say it very simply. If you think about what are we storing in tables, we store quantitative data. We need to do quantitative reasoning, not qualitative reasoning over huge amounts of data. Another point, I think, that is important here is to say that no enterprise has data in a single table. You have data spread across multiple tables. Usually, you would have your customer catalog, you would have your product catalog, you would have a set of transaction records, you would have your website browsing, click data, you would have your supplier data, you would have your returns data, and all these tables are interconnected. The only way to learn over this is to learn over this collection of tables as they are interconnected with each other. Maybe to say more, in the paths, the way we deal with this, we would say, oh, let's take the user table, let's take the transaction table, let's join them, and then somehow summarize the number of purchases, the number of transactions you had in the last time period. There is an infinite way of creating these summaries. I can count, I can sum over some time period, over shorter time period, in the mornings, in the evenings, I can add the prices, I can look at product categories. The number of ways you can summarize the data blows up. The problem with machine learning is that we predetermine how to summarize the data before we start building the model. The way to make these things better is to generalize the attention mechanism to attend over the row events in the database and learn how to summarize them to give you that prediction. That's the key differentiator. You don't need to be joining tables anymore, but let the attention mechanism very similarly as in a language while it attends over the previous words to say, okay, what's truly the meaning of this word? Here we are saying, if we are making a prediction, if we are filling in, let's say, some cell in a table, let's attend over the other rows in the same table, other columns, other tables far out and figure out how to bring all that information, so that we can make the accurate prediction. It's really about generalizing the notion of attention mechanism to this structured multi-tabular data. [0:17:31] SF: If we have that, was it unlocked from an application standpoint? Today, I feel like a lot of people are trying to apply large language models to databases for the purposes of being able to do intelligent natural language to SQL conversion. Is this yield a better version of that? [0:17:48] JL: That's a great point, right? Text to SQL is amazingly useful. If you think about SQL, SQL is summarizing what has happened last month, what has happened last week. You're aggregating some best, and maybe creating a dashboard to understand historical trends. Then maybe you can use those historical trends to do some qualitative decision-making about what to do tomorrow. If you think about predicting transaction fraud, deciding which customers to send an offer to, and so on, for that you need predictive modeling. Text to SQL won't get you anywhere with that. You may use SQL to generate historical patterns and then build a model on top. But as we talked about that, that's super brittle, manual, and takes a lot of time. The approach we invented in my research group at Stanford and then we founded a startup around it called Kumo.ai is this notion of relational deep learning, where we basically take the transformer architecture and generalize it, so that it can attend over this structured relational enterprise data. The key to this approach is to think of the data as a graph, to think of your enterprise as data as a set of connections between the entities in your database, to think of the tables how they are interconnected as a graph. Then generalize the attention mechanism to be able to attend over this relational structured information. [0:19:16] SF: I mean, thinking of a database as a graph, I think people in the data modeling world have been using those concepts for a long time. How is this different? If you think in designing schemas and things, people used to build entity relationship diagrams, where essentially, you have your tables as a node, and then you have relationships defined in terms of edges across foreign keys, and so forth. Is this something fundamentally different, or is it using a similar concept as essentially, the basis for doing this deep learning? [0:19:46] JL: Yeah. It's a great point. Any database is a graph, as you've said. These concepts have been around for a very long time. But I feel like nobody has put one plus one together in a sense, right? We've been working on this graph-based machine learning, graph transformers and things like that for a long time, but it was mostly applied to social networks. People, kind of that community didn't realize that actually, the database is any database is a graph. I think the database community was so stuck in feature engineering and running historical SQL queries that they did not think about, "Oh, how can we take these AI tools and apply them to the database?" What I'm saying in some sense is supernatural. We knew for a very long time the database is a graph, nothing changes. But what in some sense changes is that now, the field of graph machine learning is not this obscure, it only applies to social network, type thing, but it applies to any database. The benefit that comes with it is that is the same neural network revolution that we have, and computer vision that we have in text, natural language understanding, videos, and so on, now to another data modality, which is the structured tabular data modality. With the same set of benefits and the same set of amazing outcomes that we have already seen play out in other data modalities. [0:21:05] SF: Does this only work against a traditional database, or could you potentially generalize this to other tabular forms of data, like spreadsheets? [0:21:15] JL: Yeah. I mean, where the data sits, it doesn't matter that much. It can be in a database, it can be in Salesforce, it can be in Databricks, can be in Snowflake, can be in Spreadsheets, can be in JSON. As long as it's structured, semi-structured, may include images and it may include text and include columns and categorical values and all kinds of geographic information, all kinds works nicely in this framework. [0:21:41] SF: How does training work and how is it different than training with traditional transformer models? [0:21:45] JL: Yeah, that's a great point. So far, we've said two things, right? The first thing we said was any set of enterprise data is structured, semi-structured is usually stored in these relational tables. These relational tables are a graph. Now, what we can do is we can take, in the old days, we would take a graph neural network and apply it to a graph. But in today's age, we take transformers, in particular graph transformers that have the generalized notion of attention and positional encodings that can attend over this structure to give you the prediction. Now that, okay, we have these two steps, now the first step is how do we train? We have, I would say, two options here. One option is to actually build a pre-trained foundation model that is database schema and predictive task agnostic. What this means is that now you have a single pre-trained model that can connect to any structured data, because it just represents and thinks of it as a graph. Then you get, basically, the specification of the predictive task on the fly and the model is able to make you that prediction. This means that you don't even need to be building task-specific models on the fly, but you basically have the same type of ChatGPT type experience, but now for predictive type of questions. For predictive modeling type questions, for churn, fraud, readmission prediction, all kinds of advertising, use cases, customer 360 use cases, marketing, sales, and so on. The way this is trained, as you ask, is it's trained, basically, to teach the model how to do in-context learning. Basically, how to look at subsets of your database, use those examples to then generalize and be able to make the prediction. When we say, what are the subsets of the database, are these small local subgraphs around the entity, or around the entity that you are making the prediction about. [0:23:36] SF: If you're doing in-context learning, is there learning, though, that goes on in order to first force this attention mechanism? When I think of training traditional large language models against text, the input is all these massive amounts of text that then become tokens. Those are essentially the inputs of the model that help adjust the weights. Is something similar going on? I think I'm not following before the context one. [0:24:00] SF: What you'd have to do here is you also have to amass humongous amount of pre-training data. This pre-training data now needs to be structured. You need to amass a number of different tabular databases, things like that. Then you also need to define a number of different predictive tasks on top of this data. What is interesting is that we have just shown, we just published a paper about a method called Plurel, where we can basically synthetically generate a lot of data and then train these models on top of that. You are exactly right. The same way as in, let's say, training large language models, you need large amounts of text. Here you need large amounts of structured tabular relational data. [0:24:40] SF: In terms of, say, a user interacting with this, they want to be able to do predictions against their own databases, or tabular data, what is the input there? Because again, with going back to the large language model, my input is essentially text that then becomes tokens as the input, which is the same thing that the model started as a training corpus. Here, it's a bunch of database data, graph structure of these databases. How do I actually ask questions about churn prediction? [0:25:10] JL: Yeah, great question. At the setup time, you need to connect the model to your database and say, here's my database. These are my tables. Then to prompt the model, you need to specify the task. You can specify the task in natural language, or you can specify it in a domain-specific structured language that we call predictive query. To say, I want to predict churn, you need to specify, what does that truly mean. You could say, I want to predict whether the count of transactions, or count of purchases of this particular user is going to be zero over the next 30 days. That's the proper definition of churn. If you define it this way, then what the model is going to do? It's going to go into the database, take a set of subsamples from the database, and then send them through the neural network to give you the prediction for that specific, let's say, user, or customer that you want to say, what's the likelihood they will have zero transactions in the next 30 days. The way this operates is that the model basically goes, fetches data from the database, and then this data from the database is sent through a frozen model to get an output in the end. The point is, there is no task-specific training needed here. There is no need to train the model for your specific database. It's very similar to large-language models where the model is pre-trained. You give it text, it understands text, you ask it question, you get the answer. Here, it's similar. The model goes to your data, understands the data, and gives you this forward-looking predictive answer. Of course, just to say, right? Now, if you write this kind of ad hoc querying, we see it's very useful for, let's say, humans using these type of models, or agents using these type of models where you don't know the question ahead of time. If you are a large bank and you need to predict 1 million times per second whether a given transaction is fraudulent or not, you would, of course, go and maybe fine-tune a smaller version of the model to make this faster and more cost-effective. You can use the big pre-trained model, or you can fine-tune it for a specific task to get more speed. Again, similar to what we see in large language models. [0:27:18] SF: Is the mathematics behind inference for these relational deep learning neural network similar to that with large language models? [0:27:26] JL: You mean in terms of the model sizes and things like that? [0:27:29] SF: Yeah, the model size, and I guess, the major computation that you're going to be doing behind the scenes. Like, the bottom line is what is the cost of inference and how does that compare to the other models that now have become familiar to people at large? [0:27:43] JL: That's great. These models are transformer attention-based models. You need GPUs to run them effectively. They are smaller than large language models. The amount of compute that is needed is way smaller. What this also means then that predictions and output is faster and cheaper. That's the way I would describe this. What's the difference? The difference is that you need to have a very strong data backend, where the data is represented as a graph. You need a special, let's say, graph engine that is tuned for this type of AI workloads that allows you to make these computations scalably and quickly. On the other side, you need a GPU for the results to come out. Since the models are smaller than large language ones, there's, let's say, sub-billion parameter models, or something like that, you can run them quite efficiently and quite quickly. [0:28:39] SF: What's involved with getting that graph structure for your own database? [0:28:42] JL: Yeah. What we have built at our startup, Kumo.ai, is exactly the infrastructure that allows you to basically take in any database. The engine will internally optimize that representation into this graphical form, so that then these graph transformers can efficiently be run on top. That's the hard part, right? The linear structures are easier, because you can just feed them through. Graphs are hard, because they are just these interconnected sets of objects. You cannot chop them. You cannot linearize them. You really need to be able to do this quick breadth for searches, or subsampling of sub-graphs over this database. That's the hard part. It's basically building the optimized infrastructure that allows you to do this at scale of tens, hundreds of billions of nodes and edges. [0:29:29] SF: Is that the equivalent to, I guess, with text generating the embedding? [0:29:34] JL: That's very interesting. Yeah, exactly. What the model does internally, it generates the embedding, right? Now the embedding in the text captures the, let's say, semantic meaning of the word if you think of it this way, here in this database graph view of the world. Now you have the embeddings of your entities that capture the semantic meaning of those entities that is predictive of the downstream activity, whatever is the task that you are trying to solve. [0:30:02] SF: The concept of a graph neural network, how long has that concept been around for? [0:30:07] JL: I think the graph neural networks were invented maybe 10 years ago, something like that. Even a bit less. That's when the field started. I would say, a graph neural network, it's this idea that when if you have the data organized in a graph, when a node wants to compute something about itself, or make a prediction, maybe this is a user is a node, then not only we are using the information about that node, but you also use the information from neighbors and neighbors of neighbors, and so on. The idea is that basically, now neighbors are passing information messages to their neighbors, and this way to the node in the center that is of interest. That has worked really well for small scale task specific models, but now the field has moved forward to basically, this generalized graph transformer type architectures, where we are not passing information across the edges, but it's actually the attention mechanism that attends over the center node, its neighbors, neighbors of neighbors, and so on. You can see how this nicely applies to the database world where if you're making a prediction about an entity, maybe a user, maybe a product, we are then attending over the nearby tables and one hop in tables, two hop tables and bring that information all in the same place. [0:31:20] SF: Is the value there of using the attention mechanism to apply, distribute these weights over the graph, versus using the traditional message passing that's happening in a graph neural network, essentially, it's more dynamic. I don't have to have more of a purpose-built model versus a generalized model that I can apply to any, essentially, database problem in this particular context. [0:31:43] JL: That's exactly the right intuition, right? The attention mechanism is so flexible, and so contextualized and context dependent that it really allows for much better generalization and for very effective pre-training. This then means that these types of models can really be applied to a very large diverse set of use cases and can generalize to them very effectively. [0:32:08] SF: How does the performance of these more general models compare to, if I was to go and take more of a traditional approach, where I take my two data scientists and I go and I build more of a purpose-built model using my domain specific data, build that model? Obviously, it's more rigid, but do I get - is there a quality difference? [0:32:26] JL: There is a quality difference. We shouldn't be surprised that there is a quality difference. The way it works is the same as it worked in computer vision, where we went to basically, and in LLMs as well, where we are basically going to this super human level performance. You could say, "Hey, I'm so good at recognizing any -" Let's say, in computer vision, you could say, "I want to build a model that detects whether there's an elephant on the image or not." You could be saying, "Hey, I studied zoology. I know elephant's inside out. I'll build a perfect elephant detector." If somebody would go and say, "I know how to detect elephants. I'll build these amazing features to detect elephants," everyone would be laughing at them. Now, when we go to tabular data, I think the outcome is similar, right? It's very hard to engineer perfect features that give you that accurate prediction. The same way as it's very hard to engineer perfect features that detect whether something is an elephant or not. If you train in a purely data-driven way, a neural network that starts attending from raw pixels, in our case, this would be a row, rows and columns and cells of the database and learns how to aggregate all that information into this representation, this embedding that then tells me, is this an elephant, or not? Or, will this user churn or not? It's the same effect. The point is that with these types of models, we can get to the superhuman level of performance. What I'm trying to say, we shouldn't be surprised by it, because we've seen it before, just as I was now giving this example with computer vision, that today nobody is surprised about. For the machine learning, database, structured data world, the same revolution is out there. It's not surprising. It's amazingly natural. That's what I'm trying to say. [0:34:13] SF: Just take a step back and looking at the larger trend that's happened around the field of AI, especially in probably the last decade or so, where we've gone full neural networks. It's a lot more just bottoms up learning where we're learning the patterns, versus something where if you look at the early days of rule-based systems, where we're trying to encode all the rules that we could imagine into some tree structure that we then navigate, why do you think that this approach has ultimately been so much more successful than trying to take our knowledge of that system and then encode it into a set of rules that a computer can execute? [0:34:54] JL: That's a great question. Exactly what you just said also applies, let's say, to this predictive modeling world. Where we cannot anticipate all the rules, all the hard-coded signals that give us that prediction. I think the reason is the following. The world is so unique, so diverse, there is so many inputs that it's very hard to preconceive all of them and write them down as rules, or signals, or whatever we are talking. Data that captures the world is the king. Let the neural network learn directly from the data, directly from those raw signals, as noisy as they are, how to combine them into the final signal. I think the reason it works so well is because when we are designing these rules as humans, we don't understand, let's say, the noise. We don't really understand the richness. Also, our visual and our perception neural networks are already learning how to take these raw, noisy inputs that our bodies are sensing into some higher-level representation. It's very hard for us to reconstruct. Like, the way we think it's through the neural network. Rather than having one neural network build deterministic rules, we should just train our artificial neural network on the same set of inputs and it will be very good. [0:36:16] SF: Yeah. I mean, in a lot of ways, it seems to, I guess, mimic the behavior of humans, or maybe even animals to some degree in terms of how they learn. It's not like if you have a baby and it's trying to learn how to navigate the world, the parent is giving them like, "Here's your set of rules that you need to execute in order to crawl across the floor." They're doing that more through, I would say, playing experimentation and the - it's almost like a reinforcement learning, or something like that. [0:36:44] JL: Exactly. Then, of course, I think when you talk about rules, I would say, those are also in some sense important, but the way I think of those is more like explanations, right? When the model, when the neural network makes a decision, makes a recommendation, you can always ask why. At that time, it's important for it to give you some general rule, some general reasoning, what led it believe to make this type of prediction, recommendation. Also, what is important, I think in this machine learning, or this predictive modeling side of things is having the very strong accuracy estimates. You cannot have hallucinations. It really needs to be data-driven and rooted in the patterns that are in the data. With these models, you can achieve that as well. [0:37:28] SF: Now, this idea of the relational deep learning, you're applying this to databases, helping understand tabular structure, but there's lots and lots of other domains that can be described in terms of graphs. Can this approach be applied to other domains? Even if I think about object-oriented programming, I can describe the class structure and the relationship between certain objects as a graph. I'm sure there's things in, you take this to the world of biology and biological structures. I'm sure a lot of those things can be described as graphs. Is there a way to apply this same approach to other types of domains of problems? [0:38:06] JL: Oh, definitely. I think a lot is very interesting, right? If you think about biological structures, molecules, proteins and things like that, those are essentially graph structures, right? Even models like alpha fold, and so on, they are really in the end, learning to reason over this graph of amino acids and this graph of proximities as the proteins starts to fold. The same happens at the scale of, let's say, small molecules for drug development, and so on. I would say, the graph view of the world becomes very important whenever you basically have a set of entities, objects, these can be atoms, these can be amino acids, can be users, products, whatever those entities are, that interact with each other. Now, what are you predicting? You can be predicting, or estimating the toxicity of the molecule, you can be outputting the 3D structure, the 3D coordinates of the protein, or you can be doing fraud detection over this interconnected graph of entities to understand whether a transaction is fraudulent, or whether the account has been taken over, or something like that. In the end, mathematically, it's all the same. And that's what makes this so beautiful. [0:39:16] SF: With foundation models today, more and more of these are multimodality, where they can handle images, audio files, text, of course, video even. Are we eventually going to move to a world where also, essentially, tabular data, or semi-structured data is part of one of those modalities? [0:39:37] JL: I think so, I think that's where this is going. You cannot textify an image. You cannot textify a table. It just makes it super hard to learn over that. In principle, you could. You could just write the image as a sequence of RGB values in ASCII and say, "Tell me, is this a cat or not?" But the point is, we figured out that we need special domain-specific, modality-specific encoders. I think where this is going is now that, basically, we have these modality-specific encoders on top of the reasoning-based large language models for images, for videos, and of course, now also for this structured tabular data. [0:40:18] SF: Yeah. I mean, even in audio, we saw a huge performance increase when we stopped textifying audio and we actually trained models directly against the audio, because there's so much nuance in audio versus what might be just available in the text, so it ends up, I think, leading to, when you textify it, you run into problems where the model doesn't really understand certain pauses and stuff like that. [0:40:40] JL: I think it's a beautiful example. Also, it just shows that the two of us right now are just talking. Everything we say should be just captured in words, but it's actually not, right? You see, if when you work on the raw signal, on the raw speech signal, the performance, there is more information there, the performance goes up. [0:40:57] SF: Yeah, absolutely. Well, Jure, is there anything else you'd like to share? What is some of the - from your perspective, the big problems that need to be solved in the space? [0:41:08] JL: I think the exciting part is that I feel like the structured data world has been a bit left behind. People know how to run SQL. Then for everything else, they feel like they need to build manual machine learning models super painfully and super slowly. I think what we discussed today, the exciting thing is that we have now foundation model for structured relational data. For example, Kumo RFM is one such example. We have this approach of relational deep learning. We have the transformer architectures that can now learn directly over this structured tabular data. Actually, the structured data world that has humongous business impact is ready for the AI revolution to actually take place there as well. Today, we shed some light on it, and that's what I'm very excited about. [0:41:55] SF: Yeah, absolutely. Well, thank you so much for being here. This was really, really interesting. [0:41:59] JL: Yeah, amazing. Thank you for having me. [0:42:01] SF: Yeah, cheers. [END] SED 1916 Transcript (c) 2026 Software Engineering Daily 1