EPISODE 1817

[INTRODUCTION]

[0:00:01] ANNOUNCER: NVIDIA RAPIDS is an open-source suite of GPU-accelerated data science and AI libraries. It leverages CUDA and significantly enhances the performance of core Python frameworks, including Polars, Pandas, Scikit-Learn, and NetworkX. Chris Deotte is a Senior Data Scientist at NVIDIA, and Jean-Francois Puget is a director and a distinguished engineer at NVIDIA. Chris and Jean-Francois are also Kaggle Grandmasters, which is the highest rank a data scientist, or a machine learning practitioner can achieve on Kaggle, a competitive platform for data science challenges.

In this episode, they join the podcast with Sean Falconer to talk about Kaggle, GPU acceleration for data science applications, where they've achieved the biggest performance gains, the unexpected challenges with tabular data, and much more.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:07] SF: JFP and Chris, welcome to the show.

[0:01:10] CD: Thanks for inviting us.

[0:01:11] JFP: Thank you.

[0:01:12] SF: Yes, absolutely. I'm excited to get into this. I think we have a lot to cover. But I wanted to start off by talking about Kaggle and being a grandmaster, which is a distinction that I believe both of you have. For those that are unfamiliar with this concept, can we start there? What's it mean to be a grandmaster?

[0:01:29] CD: Yeah. Kaggle, for me, means many years of entertainment. I've been participating for six years. It's an online community for data science, and there's currently over 20 million users. On this platform, you can engage in conversations, you have access to Jupyter Notebooks, you can share a code, host datasets, and you can also compete in competitions. The website, you can earn achievements and you can gain titles. Yeah, you've heard people say, Kaggle Grandmaster. What is that? That's one of the titles you can gain. It's the best title you can acquire.

You can actually become a grandmaster in the four categories; discussions, notebooks, competitions, and datasets. The most desired one is the competition's grandmaster. To achieve that, you need to actually win five gold medals in five separate competitions, and one of them has to be a solo that you won it by yourself. The competition is incredibly difficult on the website. People are competing from around the world. Typical competitions have thousands of people. It's very hard to obtain. There's only, I think, a couple hundred competition Kaggle Grandmasters in the world. It's an amazing thing.

Then I'll mention, you can also, as I said, get the other grandmasters. You might have heard the expression, a double grandmaster, a triple grandmaster. That's someone who's actually received awards, posting discussions, or notebooks, and they have acquired another grandmaster title. Then you can stack them up when you cite them.

[0:02:57] JFP: With that, competition grandmaster is based on your merit, how the quality of the modems you build. The other ones are based on community votes, so it's a bit different. I would also define Kaggle as a legal drug. It's really, adrenaline flows when you compete. It's really addictive. When you start, you can't stop.

[0:03:21] SF: Can you talk a little bit more about the competitions? What does a competition consist of? Is this something that's happening live? Or is it more like a problem goes up, and then people are asynchronously putting time into that and you have to, essentially, try to solve it within that timeframe and come back with a solution?

[0:03:35] JFP: Typical competition is like a short time data science, or machine learning project. You're given some data sets, or data - Kaggle creates a data set. Kaggle also curates a question. For instance, you want to predict next month's sales for a retail chain. The data is past sales, by product, by store, by what have you. Or it could be some image classification, medical image diagnosis. It is a cancer or not. You have an image used to train. Duration is typically three months. Basically, you have to submit some code that will be run on a hidden test set. Then a number is computed from your predictions, and that's the score. You see a little board of the score obtained from some of the data sets. After the competition, they will compute the final score on the rest of the test data.

The reason they do this is to avoid what is known as overfitting. Making sure they select good models and not lucky ones. There is another form of competition they call analytics, which is a bit different. It's also a form of data science. You're given some data and you have to find an interesting story out of it. Then it's judged by a human jury.

[0:05:03] SF: Chris, who actually creates the competition? What sort of expertise do they need to be able to create these competitions, for presumably some of the people that are the best in the world that [inaudible 0:05:13]?

[0:05:14] CD: They're sponsored by actual businesses. A business approaches Kaggle with a certain challenge they need solved. Maybe a university wants a model that can read student essays and assign scores to it. They approach Kaggle and say, "I would like you to host a competition." They put up money. They put up cash prizes. Then Kaggle will help curate the data and do all the infrastructure and logistics. It is interesting to note that it actually starts from a business need. The competitions are real problems and the company afterwards gets when they give out the prizes, they've received the code of the top solutions. Oftentimes, they'll immediately implement the code. It's nice to know that you're competing, you're helping a good cause and that your code can actually be used to solve a real-world problem afterwards.

[0:06:00] SF: Then, what do you personally get out of participating in these, beyond just the satisfaction of a job well done?

[0:06:07] JFP: I'd say, the main thing is you learn. You learn both from the problem and reading relevant papers, or blogs, or code base, and from the community. If you want to know state-of-the-art models for a given topic, best is to enter Kaggle competition on that topic. We learn from the top teams, so the winners, the prize winners, usually they have to disclose what they do to get the prize. You can learn as well from that.

[0:06:40] SF: Then, how did both of you get involved with this? For those that are maybe interested in dabbling, how would you get started?

[0:06:47] CD: A friend recommended it to me six years ago, but I would say that the purpose it played in my life was the learning process. My formal training is in PhD in mathematics with a specialization in computational and simulation. Then I started learning data science on my own. After you learn all the ideas, I wanted a way to practice, to test it out, to build some models and to talk with people. Then someone said, "Hey, do you know about Kaggle?" For me, it was wonderful.

Immediately, I met people to talk with. There was problems to solve. There was competitions, playground competitions. Really, I went to it for the learning process. Then as JFP said earlier, it's highly addictive. I mean, once you're there, you get involved in a comp, or just talk with people, it's tons of fun. Then, you're checking it all the time and you're participating in more and more. It's really helped my learning tremendously.

[0:07:38] JFP: Yeah. For me, it's a bit different. I did a PhD in machine learning, but in the previous millennium, so it's irrelevant now. Then I went, during my professional life, working on something else on mathematical optimization. Then at my previous employer, people saw I had some machine learning background and they said, "Oh, why don't you go back at this while developing tools?" I said, "Where can I find an update on what's the current state of the art, machine learning practice?" I found Kaggle. Watched a little bit. Then I jumped in the water and got hooked. It's great. Again, it's funny when developing machine learning and data science tools. That's a wonderful place to see, what are the needs for today.

[0:08:29] SF: Yeah, I totally get the addiction component of this. I was never involved in these type of competitions, but I did compete in things like Topcoder and the ACMICPC programming solving competitions through university. I became completely addicted to that experience competing in these. I think one of the things that, even though they're not necessarily business-driven problems that I got out of participating in is that it just made me a lot more comfortable with software engineering, because I was putting so much time into just the act of practicing. So much of coding, it really sharpened my skills and made me way more employable than I was before, even if it wasn't necessarily directly the types of problems that you'd be solving day-to-day at work.

I'm curious, how is this experience of competing in these types of competitions translated into your day job and how you're leveraging some of those modeling techniques and other things that you've learned in your day-to-day job?

[0:09:26] CD: Participating on Kaggle has just taught me so much. It made me a better data scientist. Yeah. Again, I just learned tons of new techniques, how to do things correctly. I'll mention that I had read a lot of books, but there's a lot of techniques that you learn on Kaggle that are not really in textbooks yet. Also, oftentimes, a lot of new things, I think even gradient boosted trees was developed on Kaggle. I mean, it is on the fringe of research. It's the latest ideas. You're learning the best techniques. You're also got a chance to work problems in all different domains, from computer vision, natural language processing, tabular data. Just all of that exposure.

Then, I guess, one thing I'll add, too, the way they set up a competition with a hidden test set, you really have to make your model generalized to unseen data, which is one of the most important things in the field, building models in the field of data science. Yeah, all the skills you learn, plus repeatedly learning to make models that truly generalize, then immediately, when I'm inside NVIDIA, building models, working on projects, all that knowledge just comes, and it all benefits what I do.

[0:10:31] JFP: With that, we both got our job at NVIDIA, because we were a Kaggle competition grandmasters. That's also a nice outcome of all the learning we got. We have 15 or 16 Kaggle grandmasters at NVIDIA now.

[0:10:50] SF: Yeah, I think that's something that I always think about in recommending, even going back to my own experience in things like Topcoder and ACMICPC competitions, is that top tier companies are paying attention to these types of competitions. If you're interested, or just starting a career in the space, competitions like this is a good way to not only learn and build up your own skill set, but sometimes you might have a company that just comes up to you because you participated in something like this and done well in them.

Even beyond just being ranked in the top 10 in the world, not everyone's necessarily going to achieve that, but it just shows that you have a passion for the space, and that you're pushing yourself and learning and working at it. That is also really attractive to companies. I wanted to talk a little bit about NVIDIA and the RAPIDS platform. This is an open-source suite of GPU accelerated data science AI libraries. First of all, what problem is this helping data scientists with? How does that set of libraries actually work? Maybe Chris, let's start with you.

[0:11:48] CD: Okay, so it's a whole suite of libraries, and it helps with a whole variety of tasks. Its main goals to speed up all sorts of things. The two libraries that I work with the most are cuDF and cuML. cuDF helps with all your data frame needs. It's got an API, similar to Pandas. It's on all the same functionality. With that, you can speed up all your data frame needs. I should probably take a step back and say, maybe why was it, in my opinion, what role did it play?

Today, all companies are getting more and more data, and it's getting harder and harder to process all the data. Even things like computing statistics doing data framework, we need to run that faster. That's where the cuDF comes in. Basically, it does all the computations on GPU, and it could be 100 times faster than using other libraries. Then as we move forward with more data, it's going to be getting faster and faster. That's great. Then I also use a lot cuML, which has a similar functionality as a scikit-learn. It does machine learning models. Once again, it'll train all these models on GPU and do them much faster.

If you're doing tasks requiring support vector machines, KNN, and other models like this, they could train the models hundreds of times faster. Basically, I would say that, yeah, it helps with things that maybe we've been doing all along. But because now it moves the process to GPU, it's incredibly faster. If you're working on experimentation, or you're iterating things and trying to make more accurate models, or just trying to get your work up quickly, I would say, it's becoming a necessity with how data and everything's growing.

[0:13:19] SF: In terms of the GPU acceleration that's happening, what needed to happen in order to make it so that you could do things like KNN, for example, on GPUs, rather than traditional CPU?

[0:13:30] JFP: As Chris said, RAPIDS, it's more than that. It's also the GPU accelerated version of Pandas, Polars, and scikit-learn, there is more, there is graph, there is signal processing. Recently, over the last year, we made the move from CPU base to GPU base seamless. If you have a nice Pandas code, you just have, in a notebook, you just have to load an extension at the start. Then all your code will be GPU accelerated seamlessly. You don't need to change any line. More recently, we did the same with Polars. This is a way for people to just experiment what they gained from moving to GPU very, very easy.

[0:14:14] SF: Is there, I guess, a cost associated with that, that you have to take into account, given that this is running on GPUs?

[0:14:20] CD: I guess the cost would be just that, I guess, if obviously, you need a GPU to run on a GPU, but a lot of modern-day systems have both a CPU and GPU in your system. I think for most people, it's a matter of just flipping the flag and you'll just immediately get speed up and it'll just use your machine's GPU.

[0:14:38] JFP: Yeah, I just agree. When we see cost, Mind Star was on the time of the data scientists' cost. We reduced this to the bare minimum, but there is still a compute infrastructure that remains.

[0:14:52] SF: Besides getting a better 100X better performance, does the fact that you can do these things so much faster, so that you're shortening the learning cycle also changed things in terms of how you think about building models, or how it might even impact your existing work?

[0:15:10] JFP: Yeah. I see data science and machine learning as an experimental science, just like physics. Ideally, to build a good model, you have a baseline, you want to improve your design, you say, "Oh, I have decided maybe more data, maybe different parameters," what have you, we design and experiment to test if the change is really improving. Then you run it and you look at the results. Depending on the result, it becomes your new baseline owner. If you can do this faster, you will try more ideas and that will lead to a better model, just because you can experiment more, you can perform way more experiment in a given time.

[0:15:53] SF: Then, does that also change from an experimental standpoint? I guess, the types of models that you might be able to try to give in data set, because now you're less worried about how long it's going to take to train something you can move much faster.

[0:16:09] CD: Yeah, it absolutely does allow you to do new things. Yeah. I guess, there's two things that enable you to do. JFP pointed out that you can do experiments faster. You can do what you were previously doing, but we could do it better, because we could try out more things. It is actually doing a second thing, which it's allowing people to do things that we're not previously even able to do. For example, if you would try to use KNN, or actually, one thing you could do is you could actually take tabular data and you could push it through UMAP to create features and then put that into an image model and do these weird pipelines.

Back in the day of running this on CPU, you really couldn't use some of these models that are way too slow. We recently saw a co-worker, a colleague won a competition, where he actually used a combination of deep learning and machine learning. Deep learning has a backbone which generates features. The head will then do the regression. Because cuML has accelerated machine learning models so fast, he was actually able to just take the features out of the deep learning model and then train support vector regression. He was able to do this cycle over and over so fast, because of the new speed that in the end his model won first place and it was a hybrid actually. It was actually a combination of a deep learning model fused together with a support vector regression head.

These hybrid models and other, there's been other advances in feature engineering. There's a lot of new techniques that I'm seeing that are a direct result of having this speed and we can do some new model designs and some new techniques.

[0:17:44] JFP: Another thing we can do is also, to just run deep learning models on tabular data. There are a lot of papers claiming it's the best. We don't usually - that's not what we find. Red and boosted like XGBoost, like GBM, CADBoost, all GPU accelerated, still outperform. When you ensemble, so you take one of these and you take some transformer, or some other deep learning models and blend the prediction together, you will improve over a single model. On Kaggle, it's used a lot. Of course, you're running deep learning on CPU and the - no, it's not great.

[0:18:24] SF: Yeah, definitely. You mentioned this hybrid approach. I think a lot of data science works, or traditional data science work, we think about predictive ML. Now, there's a lot of focus on generative AI and generative deep learning techniques, stuff like that. Do you think that because so many people are excited about what's happening in generative AI and there's so much hype around it, that sometimes we lose sight of the fact that predictive ML can still do a lot of useful things? We try to throw maybe too big a model at something that we can actually solve with a simpler bespoke-trained predictive model?

[0:18:56] JFP: I would say, it depends on what you want to predict. Generative AI as the name indicates would generate something, a text, typically, or images with diffusion models. If that's what you need, of course, that's what you should do. If you want to forecast your sales for next quarter, you need to break numbers. That being said, so classical machine learning, so regression models, or classification models, deep learning or not, are still the way to go in that new space. We do find that if you don't put this text and find something you want to do, text classifications, expand detection, or classify in few categories using a generative model, or LLM, but only take one token, just ask it to output one of few options. This is a great classifier. That's quite interesting, because you benefit from all the investment and progress in those elements.

[0:19:52] SF: You're talking about, essentially, instructing the generative model to produce an output that's within a specific range. If I only want a value between zero and one based on the probability, that this thing is to indicate that this thing is part of a particular category?

[0:20:06] JFP: This is beating the just the anchor, there are only models like Diberta, Roberta. They are much larger, so no surprise, but they improve.

[0:20:17] SF: Chris, did you have any thoughts on this?

[0:20:19] CD: Yeah. Your question about people throwing too big a model at it, and you are right. Now, with LLMs and the reason they're getting better and better, it's more tempting to do that. But this has been an age-old problem. I always see this. A lot of people just throw the biggest model they can, but I think it's always been the case that we should try the simple models. I love doing it, because it's really fun. There are definitely times when the simple model can outperform. That's exciting.

A lot of times, the big model can do as good as a little model, but then it's inefficient. You don't want to use more compute than you need to. When I'm given a problem in the early stages, I actually like to try a whole range of models, simple and even complex. Even lately, I have been throwing LLMs at every problem I can, just to say, can it do this? Can it do this, right? You do the whole range. Then in the end, I generally try to go with the smaller models, the simpler models.

[0:21:08] SF: Yeah. I like the hybrid approach, too, where you could use a model, or a particular model on the backbone of some generative AI model to check, essentially, the answer and then do those iterative steps. I want to talk a little bit about tabular data prediction. Can you talk a little bit about why this is such a challenging problem? Why are people being focused on this and interested in it for such a long time?

[0:21:30] JFP: Yeah. It's really a good question, because we see deep learning becoming the way to go for all sorts of data modalities, except for tabular data where there is still, the jury is still out. There are many reasons, but it depends on which tabular data. If the data is measured from a physical, like you have weather data, what have you, a deep learning model is likely to be better. My hypothesis, it's not science here, it's if the data is sampled from the physical world, the physical world is smooth, and deep learning will work well.

If it's sampled from human decisions, like people behavior of sorts, and say, it's forecasting, or what have you, it's much more discrete. I would say, chaotic in a scientific way. So hard to predict. There are smooth models, like deep learning models, not as good as, say, gradient booster trees that can handle discontinuity very naturally. That's just one angle. Chris, you may have another one.

[0:22:42] CD: I've been actually fascinated by this particular question for a very long time. A lot of researchers have been wondering, because we saw a transformation in computer vision and natural language and text, natural language processing about a decade ago, right? Starting about a decade ago, so before a decade ago, people would actually take in computer vision, humans would actually engineer the feature. They would actually take images, process it, extract features, and then put that through a machine learning model, like a support vector machine, and they did similar things with text.

Then, we invented deep learning, and then deep learning, totally on its own, it does the feature engineering and does predict it. That's revolutionized computer vision and natural language processing. You can download pre-trained models, fine-tune them. But that's yet to happen in tabular data. I would say that the best tabular data models are still involved human handcrafted engineered features where we make new columns. I am particularly looking forward and curious, will the day come when there'll be some deep learning model that can digest a variety of different tabular data frames and essentially, engineer features and do it on its own?

I think, the reason it's challenging is I think that the data is much more variety, right? Images that all have shared the fundamental building blocks of lines and shapes and text has fundamental building blocks of words. What is the fundamental building block of tabular data? Here's statistics from a finance company. Here's data from medical data. Data is so different. It's going to take something that's going to actually have to see how is it all similar. What's the common theme? Maybe the common theme is some cause and effect, or logic or reasoning. Some model has to understand it and find all the similarity. Maybe then, it can engineer on its own and it can use past learnings to help with future problems.

[0:24:42] SF: Because the data sets are so varied and different. Could you end up with a situation to, where if you had a ton of tabular data to train on, the model might not actually tune itself to recognize the patterns? Essentially, the pattern recognition has less to do with the data, is more about the structure, the fact that is organized in the rows and columns, you essentially end up with biasing the model and what it's trying to predict, and essentially, leading to a place where you're overfitting the model against the wrong pattern?

[0:25:12] JFP: The latter is not happening. When you have an image, you have pixel range in 2D. If you have videos, 3D. If you have text, you have numbers in one dimension, same for you. It's very regular organization of the data. You can train a model once and it works with hopefully, a lot of the instances. Tabular data show it's 2D, but sometimes the columns are independent, so you can shuffle them. Sometimes they are not time series. Sometimes the huge correlation between products. Some columns are not there.

As Chris said, the format is not specific at this point. Maybe sometimes people would have trained a model on every tabular data available online and claim it's a foundation model. Actually, some people do claim they have foundational data on tabular data. In Kaggle, we don't find they are the best models yet.

[0:26:11] SF: How do boosted trees work on this type of problem? I'm not familiar with that. Is that a variation of the decision tree?

[0:26:17] CD: Yeah. It's actually an ensemble. It's a linear combination of multiple decision trees. Yeah, boosted trees are, they just repeatedly make decision trees and then each new decision tree it trains on the previous cumulative error and it tries to reduce that. It keeps just adding a new tree and the purpose of the new tree is to reduce the error a little bit more. Then in the end, you just combine all the - you just basically, you take an ensemble of all the trees and that's what it is.

[0:26:47] JFP: You could see in terms of deep learning, it's a gradient descent. Each update, you don't update existing weights in a model, be it a linear regression, or deep learning. You add a tree that implements the gradient delta. People think it's a recent technology, because the first useful implementation is XGBoost. It's 10-year-old only, so it's quite recent. But the theory was published 25 years ago, more or less.

[0:27:20] SF: Are there libraries within RAPIDS that help doing some feature engineering?

[0:27:25] CD: Yeah, absolutely. I think that's another advantage of the speed of RAPIDS. We just discussed how, yeah, so cuDF is one and then newly the cuDF pandas and cuDF polars. Specifically, to improve model accuracy, what you often do is given a data frame, you'll make new columns, and there'll be transformations, or combinations of old columns. That's what feature engineering is, and it's done manually. With the speed, so with cuDF, which operates on data frames and the speed at which it works, you can systematically go through a whole set of transformations.

Let's randomly pick pairs of existing columns, combine them together, and then target encode it, and we'll make a new column. Then we'll see if that improves the model, right? You could basically build these four loops, where you just systematically go through typical things that humans will try. Then you can train a model and see if it improves. Actually, I recently won a competition doing just that. It was a Kaggle playground competition. You had to predict insurance premiums, maybe a car insurance, the annual premium. Basically, I just set my computer running overnight. It actually tried tens of thousands of - The original data set had 23 existing columns. I randomly picked groups of two, three, four, five, or six. I combined them, targeting code of it using cuDF. Then I trained the model to see if it improved the validation score. It just keeps.

In total, there may have been something like a 150,000 combinations to try. It's just randomly tries them. It found hundreds of ones that worked successfully. Then I added them to my final model and it boost the score tremendously, so much so, there was even a gap with the second place. This was only made possible by the speed. If I had tried to do a data frame operations on a CPU library, literally the search would have taken months. It would never have finished, right? This search just happened overnight. Absolutely, this speed is allowing us to actually do some automated feature engineering.

[0:29:31] JFP: I would add the exponents of things, Chris mentioned. He built on top of RAPIDS, but he used a built-in component called target encoding. It is a way, in tabular data, you have basically two types of data. One is just category. You have a color. Other is other numbers, the weight of someone, or whatever. For counting categories is very hard to manage for algorithm linear regression, super vector machines. Basically, you have to create additional data. It's called one heart encoding, one column for each possible value. Then they can be combined linearly. It's a pain, because it expands your data tremendously. You have to use pass implementation. It's exists in cuDF in RAPIDS.

There is another way, which is smarter, which is to say, basically, say you have a category with five values, you just average, say you want to predict some numerical value out of your - you just average fudge value, the target you want to predict. This gives you an indication of how good this value is. This can be done automatically. It's called target encoding. But if you do it the way I do, you overfit to it, because you include the target. It's tricky. You need to use what we call out of fall prediction to avoid - You never use a target of a row to compute a value for that row. There are ways to do target encoding for one row using other rows. This is a built-in in RAPIDS. Then what Chris used was to apply this target encoding on column combinations. That's very useful.

[0:31:28] CD: At the upcoming NVIDIA GTT conference, we're actually giving a workshop, which we're teaching this exact technique, how to target encode, how to use cuDF to do that, also some other encodings, like counting coding. I hope that, yeah, I advertise that. You all check it out. It's going to be a great time. We're going to make the features and then we also train some models and show how it improves the models. I would say, for tabular data, it's probably the most effective and powerful technique to improve your models. Time and time again, it's been the key component to winning these Kaggle competitions. It's actually a hands-on workshop, where if you're there, you can work along with us and follow the code. It's going to be great. I suggest that everyone checks it out. It'll be taught by some KGMon and some other NVIDIAns.

[0:32:12] SF: Yeah, awesome. I'm hoping to be there, too. Hopefully, I can participate in that. In terms of even in competitions, how do you determine where you need to focus on making improvements to your output? I mean, it could be part of the feature selection process. It could also be part of the model that you're using. There's a lot of things that could go wrong and you only have limited time, so how do you figure out where to actually spend that time?

[0:32:38] JFP: That's a great question, because basically, you have some available budget, so the time till end of competition, where you can work on it. You may also have a compute budget. You need to allocate the resource the most wisely. What I do is I make sure first, I have a good test on this. I can really evaluate my model. Typically, with a baseline, create a cross-validation setup and try different baseline models, submit to the competition to see if my cross-validation correlates with the score. If it does, then I don't need to submit too much work with my local setting.

Then you need between, oh, if you're try engineering, trying different models, implementing a more complex workflow, then it's a combination of where you have some filling based on past experience and the low-hanging fruits, you estimate the time it takes to code it and run it. There is a part of luck if you investigate the right thing first and you do better. Hopefully, after years of doing this every week, we get some feeling of what might work first.

[0:33:57] SF: Yeah. You started to build an instinct, essentially, when you see something that maybe feels like it's underperforming and you can understand where that problem might be.

[0:34:05] CD: Yeah, absolutely. I've been in 80 competitions in the last six years. I have such strong intuitions. I'll train a model. I'll look at its output. It'll always be getting something wrong. I oftentimes know exactly where to look that, oh, I could even - you really talk to get a sense of how you should alter the model architecture, or the training procedure, or how you should augment the data, or this, that. It's amazing. I actually make an analogy. For NVIDIA, I was the teacher at the university. I'm always making analogies that, for me, training a model is actually teaching a student. You get better with time as a teacher, right? You learn how to listen to your students.

You teach a student and then you have them do a problem and you watch them, or you watch. Then you see how they do it. Then they get the wrong answer. You look at their work and you see, "Ah, I see. They just forgot to divide by two here." You start to learn what the common mistakes are. Then when you talk to him again, you have to emphasize the divide by two. It's the same way with the models. I'll see models make common errors. I'll know how to address it. I know how to change things in a -

[0:35:16] JFP: I would just add that I also don't rely on automated optimizing tools. I see people using Soap2na to tune parameters. I always do it by hand. Because that way, I get some intuition. I learn from my experience. If I rely on the black box optimizer, or maybe I will learn how to use the optimizer better, but I will have no understanding of what works under in the hood.

[0:35:45] SF: In terms of where data science tooling is going, if you can make one prediction, I guess, where are things moving progressing? Where do you think the next big breakthrough is going to come from?

[0:35:56] CD: I would say, one thing and we're already seeing it, is how large language models are to completely change the workflow. They're basically going to be, we're going to be working together with them. Already, we see them helping write our code. We see copilots, people basically ask them questions. They can suggest ideas. Already. Pick a project from start to finish. A company comes to you with a certain task. Here's our data, or even we want you to be able to predict this. Then to finish, here's the finished model and here's what it does. There's all these different steps and different roles of people involved in the process.

More and more, we're going to see LLMs get involved in all different steps of the process, from the beginning EDA, writing code at various points, giving suggestions for this, maybe even taking charge of an experimentation cycle and then running experiments on its own and changing things. It'll be really exciting to see how they'll be utilized more and humans will be working, I think, together with language models in the whole process of building a final model.

[0:36:57] SF: Even around generating test data is massively useful.

[0:37:01] JFP: Yeah, test generation, definitely, I will come back to this. I would add to Chris, with Kaggle, we focus on the modeling part of data science and machine learning. Getting the data and then using the model we create in production requires coding, which is something data sent is may not be good at. Transferring to software developer that are not good at machine learning, you also lose something. Maybe we see LLM used as coding assistants, getting traction, they could be assistant for data scientists to write the code they don't want to write to connect, be fun after.

Back to generating testing data, we do already do it for text and images. There is a lot. For tabular data, there are people - I think, generating tabular data to me is not mature enough, except if you model some physical phenomena. In Kaggle, there were a number of competition running synthetic data for astrophysics, for particle physics. There, the simulator, the data generator was great, because it was based on physics principle.

The playground competition, but Chris is doing more of these, always have a bit of a fear that modeling means reverse engineering the data generator. But Chris would disagree here. I don't know.

[0:38:32] CD: Yeah, what he's referring to is, so Kaggle has increased their frequency. Every month, they're offering a new playground competition. It's very hard to offer competitions that often, because the most difficult thing is getting data sets. Recently, they've been using synthetic data sets, where yeah, they're generated by LLM. LLMs essentially make the data. The risk has always been when data set is synthetic, you can actually reverse engineer, because somehow, it's making new data with a target. If you can think how it thinks and how did it assign targets, and how did it make new data, then you don't have to actually forecast the insurance price. You just have to figure out how is the data made.

You do see this often from time to time, people do figure this out and they win comps, because they've reversed engineered some process. Yeah, it's something we have to be careful of. I think as time goes on, the synthetic data is getting up a higher quality, but there's still artifacts that you can take advantage of a little bit. That's always a risk when went using synthetic data.

[0:39:33] SF: Yeah, absolutely. Well, we're coming up on time. JFP, Chris, I want to thank you so much for being here. I thought this was really, really interesting. Hopefully, we'll see each other at the workshop at NVIDIA.

[0:39:42] JFP: Thank you for inviting us.

[0:39:45] CD: Yeah. Look forward to meeting you in person, Sean, at the conference.

[END]