[0:00:00] JB: Companies have high hopes for machine learning and AI to support real-time product offerings, prevent fraud and drive innovation in general. But there's always a catch. Training these models requires labeled data that machines can digest. As data volumes increase, the opportunity to get great ML results rises, but so does the problem of labeling all the data to get that excellent result. That's where Snorkel AI comes in with their focus on programmatic data labeling and ML platforms, like Snorkel Flow.

Today, we are interviewing Alex Ratner, one of the founders of Snorkel AI. Alex is a born teacher, who always has enthusiasm for this topic. Today, he'll share the newest evolutions of this product at Snorkel. He'll shed some light on why doing ML well requires programmatic approaches to data labeling, and we'll also talk about foundation models in actual enterprise setting.

[INTERVIEW]

[0:01:04] JB: Hey everybody, thanks for listening to Software Engineering Daily. We're excited to have Alex Ratner here from Snorkel AI. Alex has a long career as an academic, researcher, thought leader, and now has been the founder of Snorkel AI, where they're trying to make it much easier to train and deploy models into production across the board. Welcome, Alex. It's great to have you here.

[0:01:27] AR: Jocelyn, thanks so much for having me. I'm excited to be here.

[0:01:31] JB: Maybe we could start with, you could give a little longer introduction about yourself and a little bit about Snorkel.

[0:01:38] AR: Yeah. Longer is a dangerous adjective to give for that, because there's – I can go on and on about the Snorkel project. It's been about eight and a half years between academia and the company. Please cut me off. I'll try not to go full-edge. I guess, for myself, first of all, Alex. I'm one of the Co-Founders and CEO at Snorkel, as you captured. I'm also on faculty at University of Washington. I get to work with some great students there.

The broader umbrella project between Snorkel, UW, and Stanford, where one of my co-founders is as a professor and where the project started, is this idea that we call data-centric AI. The concept there is that the operations around manipulating, curating, labeling, cleaning data, there's a lot more there, which are often treated as second-class citizens in the machine learning workflow. Go back to intro to ML, or intro to data science.

Of course, you might have taken in school, or online and look how much depth is actually covered in terms of those critical data operations, versus the algorithms and model-centric ones. Over the last eight and a half years, our focus has been on making those data-centric operations first-class and more programmatic versus manual.

Zooming in a little bit, a lot of what we do at the company today is trying to do that for one of the most critical data operations in machine learning, which is labeling data to train models. Whether you're training them from scratch, or may take a wild guess, and that we might be talking about large language models. In that case, you're labeling data to fine-tune them, or to adapt them for high accuracy at specific problems.

What we've been working on at Snorkel over the last eight and a half years, last three years of the company is trying to make that labeling look more like software development. We call it programmatic labeling, and I can go to the more depth about that. Rather than what it often looks like, which is sending over a bunch of documents, a bunch of images to some relevant experts who have the knowledge to label it and asking them to label 10,000, whatever it is, data points, before you could even start your machine learning journey.

I'll give a quick example just to make that all concrete, and then we can circle back to it later. One of our customers would just release some details about what we've done together. This is Wayfair. They had been working with a vendor that used manual labeling approaches. They took about four weeks a turn of manual labeling for every model that they were training. This is over images for their website. We were able to, with them, get it to under two hours a turn with 7% to 22% accuracy improvements. Basically, by not by trying to automatically make the data labeling go away. They're making it look more like software development, writing them higher level, we call them labeling functions about what attributes they're looking for, what features, and using that to do the data labeling and supervision of the models.

Taking a step back, again, big picture view between Stanford, UW, and the company is data, and all the development operations of data from how you – what mix you select, to how you clean it, to how you label it, is one of the most critical interfaces for machine learning these days. Our goal has been to push it more into limelight, and make it more first-class, well-supported by systems and abstractions. They are programmatic than manual, so it can be efficient.

Per your intro, Jocelyn, that is one of the key determinants of whether stuff actually gets shipped on time, or even tackled at all with AI. I'll pause there, but that hopefully gives a little bit of a high level and zoomed in view.

[0:05:25] JB: Yeah, absolutely. A couple of quick things. I certainly have seen this myself that we had high hopes for machine learning in industry, especially around things like automated underwriting, real-time decisioning, real-time product offerings, right? We had high hopes, but we really stubbed our toe on training data. I have seen this myself in several companies, where it could take a long time for a model to get retrained when you have new findings. The classic here is fraud, where yesterday, my fraud model worked. But somehow, the fraudsters got smarter overnight, which is a real thing.

[0:06:02] AR: Oh, yeah.

[0:06:04] JB: You have to immediately respond very quickly. The way, for example, is great, too, because customers are fickle, and they expect very personalized recommendations and experiences. This is such a big problem in industry. Getting that training data right is something we'll be talking about.

I have two follow-up questions. One is, when you think about training data and getting a model into production, I was doing some research in this area on the business side, and someone said to me, people really don't realize is that production models, they’re always being replaced. Have you found that to be true? I should say, what portion of activity is training?

[0:06:52] AR: Yeah, it's a great question. There's a couple things in there, and maybe let me start by zooming in on the replacement and the updating of production models. What you said earlier about fraud is a great example of a use case type that changes very frequently. The type of data that's coming in, the type of attacks, or scams that you're dealing with, they change all the time. They change actually in an adversarial way, to your point. Because once you get your model to pick up on something, then the fraudsters will try a different tactic.

When you say that loud, as I'll say in a second, it sounds very intuitive, although I think it's still very underappreciated, the most important thing for your model is generally, the data it’s trained on. When you need to change your model, you generally need to go back and change your data. As models get more automated and powerful, is the positive way of saying it, but also black box in terms of interface points. Increasingly, really, data is the only way that you actually update the model.

I think data labeling leads to these sub toes, both in a point in time way. If you look at our papers in the academic literature, case studies and customers, it's often what we highlight, and I'll share some examples of that. The bigger problem is the over time, the maintenance. I trained a model, and maybe spent a bunch of time, money, subject matter expert effort to label all that data to train it. Now, either the input data has changed adversarially, or just through shift over time, or the output objectives have changed or both, right?

The number of data science projects I have on, I've been on where the spec of what you want them all to output was right on try number one is zero. It's both input data shifting. It's output objectives. Every time that happens, you have to update your model, you have to replace it. Guess what that involves doing, almost every single time is updating training data that the model learns everything from. That's where these manual legacy approaches to doing labeling, really start to cause pain, because it's one thing to spend –

[0:09:07] JB: Yeah, it’s so interesting. Oh, sorry. Your materials, you always use the – I was reviewing, the word ‘practical’ comes up a lot. I think that is a perfect way to think about. Some of the solutions you have offered where there are definitely very complex mathematical and system structures that you've applied to this whole area. The reality is, or what I've learned is practically, it's not some beautiful citadel of trained data available to me when I need to, when I need to redo, or rework my model.

In fact, and what I've seen in business is often, you need new data to come into that training data. You got to talk to the people who own it. Get it all labeled up and ready to go. Run a bunch of tests. The practical concerns really start adding up, so you don't get the fancy output that you're looking for.

[0:10:00] AR: Oh, yeah. I mean, I think, two elements that pull out there, practicality or pragmatism, and just messiness. I guess, the first was just how we found this problem in the first place since we were pitching all these fancy algorithms with nice theoretical guarantees. This was back in 2014, when we were starting the project, or when we were about to start the project, we were going to all these users, and this time it was scientists and collaborators at the healthcare system and other academics we were pitching our fancy algorithms. This is when deep learning was starting to take off.

People were having these on more black box, powerful push button models, but they were very data hungry. The response we often got was, “Okay, that's great. But I can't even get the model trained in the first place, because I'm stuck on labeling a whole bunch of documents, or images before I get started. Can you help me with that?” Of course, we ignored that the first couple of times, because like most of data science, you think about data as something that's upstream of us. It's janitorial work. I literally remember getting advice in around that time in my PhD from – not from my advisor, who, along with the rest of us was pushing on the data side, but from someone else who was well meaning, but they said, “I don't think it's a good idea to work on data, because you're not going to publish in any machine learning conference.”

I get it. Why does it happen? I think it is, because it's a lot messier than a vector, or a nice mathematical model. We shy away from it, because it's all the messiness of the real world. It also involves going across silos and talking to the other teams, the subject matter experts who know how to label the data and where it comes from and what it means and how to curate it. I think that's part of why just culturally, it's gotten overlooked in data science for a long time. I think now people are realizing, yeah, none of this stuff works with bad data and getting data and not just from a bad to a good state, but getting it curated in the right way for specific problems that you need to work really well. For each new specific problem, at each new time point as the problem and the data evolves is, yup, often the most critical piece in the whole puzzle.

[0:12:23] JB: Interesting. Yeah. I want to talk a little bit about the labeling functions and why what you're doing is different, and then we'll talk about flow. I want to look at the discrete thing, the capability, and then we'll talk about how if it's in the flow, if that's okay.

In the olden days, you have a bunch of data and you sit next to your subject matter expert and type in the labels. Now, I think most people have a semi-automated approach in which you're like, here's a collection of things that you can label as a group. Then I think it's got to be programmatic, and I want you to fill in the blank from yesterday to how it should work.

[0:13:08] AR: Yeah. Yeah. Let me take a step back here, actually, and this will get into the broader workflow part that we support in our platform for flow. Even independent of that, just even if I was just teaching a course on data development 101, I think there are two basic questions. That's, where do I label? Really, where do I label that I'm going to have the best –going to deliver the best delta on my model's performance? Then how do I label? How do I get this information that's in my head, or in a subject matter expert, colleague’s head the picture?

A lot of the work we do, we work with five with [inaudible 0:13:43] US banks, we work with health care systems, governments, so it's insurance. It's often a doctor, a lawyer, an underwriter, a government's meet. People with rich, technical knowledge. How do you get that into labels the model can learn from? The where to label part, often, this is referred to as error analysis, or guided analysis. Traditionally, it's often called active learning. That's the classical subfield. We include a lot of tooling to help with that guidance of, okay, where do I go next? What slice, or subset of the data where my model is messing up on, or confused about is going to be highest impact for me to label to teach the model? Then begins the labeling. Then the question becomes, how do I most efficiently label the data?

The standard legacy approach is, which still plays an important part in certain pieces, but is just, I go and manually annotate one data point at a time. Let's say, I'm trying to label this email. We recently did a trade surveillance use case at a large bank. This email is okay, or not okay type of discussion between players.

[0:15:00] JB: This is something – don't say already, but there's something in industry that you see a lot. There's just specific language, or lingo to a particular business operation in a type of business. Manufacturing one. They have their own things.

[0:15:12] AR: Oh, exactly. Every single domain, there's all this – there’s medical jargon, there's insurance jargon, there's manufacturing. Let's take this setting, for example. The legacy bridge, kind of okay, I'm going to go look at one emailer chat at a time and I'm going to label. Yeah, I'll just keep it really simple. This is okay. This is not okay. This is okay. This is not okay. One of the ideas, really simple, but really powerful in terms of what we've been able to help folks accomplish with it. It’s just to try to raise the abstraction level and go from getting one label at a time, to writing what we call labeling functions, which is writing down some function that takes in a data point and either labels, or abstains as I don't know.

That's really, in terms of the semantics, we deliberately kept it simple and go back to that. Actually, we were thinking about early on, making it much more complex, syntactically rich language. Then we realized, look, this is back in 2014, a lot of people were getting into Python and notebooks, even in another field outside of computer science and said, “Look, we just want to make it really simple, so people can jam in whatever they have.”

The labeling function construct is just really simple semantics of anything I can express as a function. It could be a heuristic. Look for this pattern. Look for these words. If so, label it as unacceptable conversation. It could be pulling in external resources. Cross-reference this dictionary of blacklisted phrases that traders are not supposed to say specific to this industry. If you see a reference that matches, then label it as unacceptable.

Apply a sentiment model and if it puts this out, it's very positive and assume it's okay. Whatever heuristic model-driven – even now we're using, and notice later, now we're using large language models and prompts as labeling functions. It actually provides a way to unify a lot of the prompt engineering stuff using the theory and algorithms that we've developed over the years of how to combine these things. The point is you just take some bit of knowledge that you have and you express as a function that can now label tens, hundreds, thousands of data points. It also is interpretable and modifiable and adaptable, because it's code. Use that to label, rather than just one manual label at a time. That's the basic idea in a nutshell.

[0:17:41] JB: Okay. When I know that this type of email is okay, as opposed to and everything else is everything else. I can create a live link function and then programmatically, but let's talk about what programmatically really means. Because it's going to sweep through the data and yes, what happens when it's wrong? How much noise is in the system? Can we just laser in on exactly what happens programmatically?

[0:18:09] AR: Yeah. Really simple semantics. It's a little program that sweeps over as you said, all the data in your – let's say, usually, you're starting with an unlabeled data set and you want to label it so your model can be trained on it. We can go back to this later, but this is whether you're training a model from scratch, or whether you're fine tuning a large language, or foundation model for a specific task. We can talk more about that. Basically, you write a little function that says – I'm going to give a really dumb example.

If I see the phrase ‘insider information’ in this trade chat that I'm going to label it as not okay, or flag for review. It's a really simple heuristic. It's just a pattern match. It's using some domain knowledge though about the problem setting. In actuality, I have no domain knowledge here, so I'm making up a dumb example, but you could imagine that, or there are usually much richer, more interesting ones. Its labeling function now will get applied to the whole training set. It's basically saying, if I see this phrase, label it as unacceptable, otherwise abstain.

Like most heuristics, and this is why heuristic, or role-based systems don't work well enough, and that's why we're trying to train models in the first place, it's not going to be perfectly accurate. What if someone's saying, I have no insider information on this. That might be unacceptable, or I don't want to use insider information. There's richer context that could make any of these incorrect, right? It's not perfectly accurate. Or it's noisy, like you had mentioned. It's also pretty brittle. It might only trigger on 3%, 4%, 5% of the data.

This is why we, in the academic literature, we've done this around the underlying theory and algorithms, we call it weak supervision. The idea is it's not as good as taking – let's say you have 10,000 of these emails. The traditional way is you ask an expert, or usually you have to label and triplicate with these difficult problems to label every single one of those 10,000 data points. This can take months, or years even. That would be normal supervision.

Here, we're giving this much more efficient, richer, more maintainable programmatic supervision, but it's also messier. A lot of the theory and the algorithmic work that we did was, how do you take a set of these labeling functions, each of which can be very brittle. Meaning, it only labels a little to the data. It can be very noisy, meaning it makes lots of mistakes. These labeling functions can conflict with each other. They can be correlated. They can be biased in terms of where they're better, or worse at performing. We use theoretically ground techniques to figure out how to estimate their qualities, relate and combine them.

The result is you write, let's say, a couple dozen of these labeling functions. Maybe you only label a subset of the data, but our technology takes it. I basically figured how to orchestrate, denoise, and combine these labeling functions into a clean training set that can then be used to train a machine learning model that can generalize to all of the data from that signal.

[0:21:28] JB: That's a great explanation. To me, just to boil it down to non-technical terms is when you're doing a heuristic, or straight filter, that's like, do what I say. Do what I say.

[0:21:38] AR: Yup.

[0:21:40] JB: What you're saying is, do what I meant to say.

[0:21:42] AR: Exactly. That's a great way of saying. Yeah. Because these models, especially as we head into a reality where most people are training, and this has been actually the case for a long time, but now everyone's very excited about it, people are training models that have been pre-trained in some way. Whether you call them a large language model. We like the term foundation model. I can get into that in a bit.

[0:22:04] JB: Let's get into that.

[0:22:05] AR: Oh, sure. Let's get into it. Yeah.

[0:22:10] JB: Let me just set this. I think everybody who listens sure knows, but you have these foundational models. They're trained and they're really good at something, for instance answering questions. Now the debate in Silicon Valley, techno land is okay, how do we get this into enterprise, or into specific situations? That's what you're going to talk about, I think.

[0:22:34] AR: Yup. There's a ton to say there. I'll start with the term foundation model, because it encapsulates a lot of how we view these things. This is a term. There's a long, very thoughtful, detailed post from the Stanford center on foundation models, where my co-founder, Chris, is a part of.

[0:22:56] JB: We can put that in the show notes, too.

[0:22:58] AR: Yeah. It's going into the details of this term. I'll just give why I like the term, right? Three reasons. Number one, these models are often referred to as large language models, but they really, the same self-supervision, or auto-aggressive techniques work over anything with graph structure. For texts, basically it works. You train these models in a very – yeah, to put it very simply is, here's a bunch of text. Try to predict the next couple of words. From that, at just incredible scale with modern deep learning architectures, usually it's a transformer architecture and tons of compute, you actually get just incredible results from just that really simple learning objective, which by the way, doesn't require labeled data.

It is unlabeled self-supervision. Again, you can do it over not just text. You can do it over image. You can do it over databases. You can do it over graphs. That's why we like the term foundation model. It's a super set of just large language model. Number two, a lot of people are excited about the generative use cases. Generating image. Generate a summary. These models are extremely useful for what we often call predictive use cases, like classify this email as good or bad, or classify this as fraud or not.

Again, we like the more general term. The real reason that I like using that term foundation model is because I think it sets expectations in a little bit of a better way, which is that these models truly are a step change; one of the biggest practical advances in AI in the last decade. But for most complex, high value, enterprise use cases, at least their foundations, they don't build the house for you. You need to do some adaptation beyond that to get them to be accurate – probably a lot of you have heard the phrase hallucination.

I don't often use that term, because hallucination makes it sound like these things are – it's like, some crazy, emergent property that they would make up facts when they were only ever trained to say, statistically plausible babble, right? They were never trained to be truthful, or accurate with respect to a specific task. Generally, if you really want high accuracy, meaning no errors, no hallucinations, no biases, the whole other subject, you need to actually do some further instruction of the model for the specific domain data type, past type that you're looking at.

One way that people talk about doing this is via prompting. You can engineer the instruction that you give to these models. Generally, if you want to get high performance, you need to actually train the model a bit further, which is called fine tuning. Fine tuning requires labeled data. Again, we come back to this need for labeled data. I'll get into actually, all the other data-centric operations that go into building these models, because we're actually announcing some stuff next week to support that. Just to stick on labeling for a second more, I'll give an example.

We were working with a large top 10 pharma company. They applied GPT4 to a use case that involved classifying and extracting information from clinical trial documents at a high level. They applied GPT4, actually in our platform, Snorkel Flow to these documents. They were getting about 66% accuracy just by trying to fiddle and engineer the prompts for this model. When they then went and labeled some data using this rapid programmatic labeling process in Snorkel Flow, this is just a couple hours of work. They got this model up to low nineties in just a couple of hours of work.

Actually, the little bit of detail there is that, I said this model as being sloppy that GPT4 actually doesn't support fine-tuning yet. Actually, they trained a significantly smaller model, first GPT3, and actually, just a basic – a logistic regression model over some embeddings. They had a model that's hundreds of thousands of times smaller that's now getting in the nineties, versus this large, out of the box prompted model that was getting 66. There are lots of settings, especially where the data is a little bit closer to the web data these models are trained on, where maybe the problem is a little bit simpler, where these models really do work out of the box. It's miraculous.

There are many, many settings and they're often some of the highest value and most complex ones, and they're often in the enterprise where there's private data that doesn't look like private data, private jargon that doesn't look like generic web data, where these models don't just magically work out of the box, or a little bit of prompt engineering. Generally, you need to fine tune them with some labeled data. This not only allows you to make them more accurate, but you can also shrink them down, or distill them into smaller models that are – think about as going from a generalist, that's jack of all trades, good at everything a little bit, but not perfect, to a specialist.

There's actually a paper by a title, about ChatGPT called jack of all trades. It’s a big benchmark survey and found that ChatGPT, on average, with some prompting was beaten by specialist models by about 25% across a wide array of tasks. I think what we're all learning, and this term foundation model is that these models are incredible starting points. Some tasks, they really will just solve entirely out of the box. When you have a complex domain-specific problem that has to achieve high accuracy before you can ship it to production, let's say it's about a really critically important problem, like catching potential insider trading, or catching fraud, you generally need to fine-tune, or specialize these models. That all comes down to labeling and curating data.

[0:29:13] JB: Amazing. Thank you for explaining that, because I'm a little behind on my reading on this, to be honest with you.

[0:29:20] AR: We all are. There's like, 25 papers a second these days in AI. It's quite fun, but sometimes overwhelming environment.

[0:29:30] JB: There's a debate right now when you think about foundation models of whether it's going to be open source, or it's going to be more proprietary models. Do you have some thoughts, or general opinions on that that you could share?

[0:29:42] AR: Oh, for sure. This is a fun topic. I'll note first of all that from the standpoint of the main platform that we build Snorkel Flow, we're pointedly neutral here, and we're, bring your own foundation model. You basically start with whether it's an open-source model, or a closed API. You plug it into the platform and we help identify error modes and then correct them via programmatic labeling, or other forms of feedback and use that to fine tune these models and/or distill them down to smaller deployment models.

There, it's whatever you want to start with. But I obviously have opinions here with both the Snorkel hat and the academic hat on. I think it's useful to frame this – Well, first, I think it's important to acknowledge that there's a lot that – I wouldn't claim that I know. Like many of us, I didn't call how – what a big inflection point we'd go through from scaling up these models. Thought that was a great –

[0:30:46] JB: Right?

[0:30:46] AR: Yeah, it's crazy. It's crazy. I mean, yeah. We were very excited. Actually, Chris and I had published some papers around this thing called multitask learning. It's an old field. We thought these – we had written about these – we were calling them massively multitask models that would be super generalist. But we thought that we wrote off self-supervision as a trick. We thought the multitask learning and other fancier techniques would be what really got them there. It turns out, self-supervision, you just scaled it up incredibly was just miraculous. Here, I'm going on a brief tangent, but actually, there's a really quick story.

[0:31:28] JB: No, it's been up and coming since 2001.

[0:31:31] AR: Oh, even earlier. I mean, so –

[0:31:33] JB: For me, I guess, that's when I engaged with it, but I was just like, it's always like, it's right around the corner every year.

[0:31:39] AR: Yeah. We never expected that it would – you'd get these kinds of inflection points and capabilities from scaling up. As the one example I was going to give, this is a tangent, but, I guess, this is an evidence made, we should have been able to predict it, because there's a paper back in 2007. I love this, because of the names they chose. It was a bunch of Google researchers, including Jeff D. They were studying a technology at the time called large language models. It did exist, even back in the 90s, and was powering your predict the next word that you're texting and things like that.

There were these algorithms at the time that had fancy names, like Laplace smoothing and Kneser-Ney smoothing and things like this. They showed that they had an algorithm that they called stupid back off, and it blew them out of the water. What was the secret? They trained on a hundred times the amount of data. We've always known, it's not a new thing that data is important. I'll get into this a little bit later with some of the stuff we're doing here. The exact degree to which this scale, what happened, I think, speak from, by definition, most of us in academia didn't guess. I'll be humble in all these predictions, but I think –

[0:32:59] JB: I love that you're saying that. I mean, just as a data person. Sorry to interrupt. Now this is a tangent, but I'm just going to say, I feel like, everyone just keeps coming up, showing up to the party. You know what's important? The data. I'm like, yes. Everywhere. Application security, cyber, privacy. Everyone's like, “You know, the real thing is the data.” I’m like, yeah. That's the real problem and is the hardest to solve.

[0:33:19] AR: Time is a flat circle. Everything's a pendulum. This always comes up. I mean, it's a machine learning. You have these waves where people start with a new technique, whether it's deep learning, or now foundation models. For a while, the limiting reagent is getting that technique, the model, the algorithm correct, but then once you solve it and this is especially true in machine learning, where everything is treated a vector. It's all very domain agnostic. That becomes widely available. Then surprise, surprise, the limiting reagent, the critical factor just goes right back to the data. Because guess what? When people say AI, they really mostly mean machine learning and machine learning is by definition, just about fitting to and learning from the data.

[0:34:00] JB: So sorry, back to open source versus very API.

[0:34:02] AR: Back to open sources closeness.

[0:34:03] JB: I'm interested to know your opinion, because I'm an open-source hippie from way back, and so I reflexively, I'm always like, it's open source, but now I'm doubting myself. I'm interested to hear your opinion.

[0:34:15] AR: I'm very bullish on open source. There's a bunch of the snorkel team and extended team in academia that has contributed a bunch of the open-source models that are out there. Open-source foundation models. I think of it on this spectrum, if I was forced to make predictions of generalist versus specialist. I think where closed models may continue to dominate, and I think it's a winner takes all environment, because of flywheel effects with feedback, so I think it'll be one or two.

I think where that really may win is in these very generalist, probably consumer use cases where you want an all-powerful GPT4, 5, 6, 7 style chatbot that interface, where you can just ask it anything and get good performance over these consumer web data settings, where you do actually have a ton of data to train on. Some people think we're tapping out in terms of how much juice can be squeezed out of the open internet. Some people don't think that. I think there's more room to go. The combination of just scaling, continuing to scale these models up, continuing to improve how the data that goes into them is curated, which is a fantastic challenge when you think about the scale of a web data out there. Then also, the feedback loops. That's why I think it's winner takes all. Someone starts using ChatGPT, or ChatGPT takes off, now OpenAI has this data flywheel of feedback, training data, labeled training from users.

I think in this setting, you really can continue to get benefits out of the scale. However, most enterprise, or really just uses as a proxy for real world. But I'll say, enterprise use cases. It's more of a specialist setting. You're not looking necessarily for this chatbot that you can ask to compose sonnets and help you with your math homework. You want a model that can do a task, or a set of tasks in a certain domain and setting with reliable, repeatable, robust accuracy. In those kinds of settings, I think we've already more than crossed the threshold. We're starting with an open-source model, and then further customizing, or fine tuning it for that setting is more than good enough compared to a closed model.

My prediction is that there will be these super generalists, probably more consumer use cases, where I think open source is giving closed source a run for its money. But there's at least a decent first principles chance that the closed model is like OpenAI. It’s always a fun phrase to say, continue to dominate. I think in the enterprise and anywhere where you have specific objectives, open-source models that are then customized, or tuned with enterprise specific data and feedback will take the day. We're already seeing this.

I think it's really for two reasons, or I'll give these anecdotally. One is going back to the example I shared earlier of the pharma company starting with GPT4 at 66, and then labeling and correcting error modes in the data to improve it. What do you really care if maybe, let's say GPT4 gets you 66 and an open-source model gets you 62. If this is just a starting point, how much do you actually care about that? Especially if you can then own the results, own the model that you start with in the 62% setting. If you still have to tune it for these specialist settings, how much do you actually care about that boost? That's one anecdotal reason.

The other one is that people are increasingly realizing that if you look at the space and this is per the comment about pendulum swings and all this new again, like this happens every time in machine learning is that the models are standardizing and commoditizing. Everyone's using the same model. The algorithms are commoditizing in standard. Everyone's using the same algorithm. Even the data that people are training the instruction tuning data sets that got from GPT3 to ChatGPT, for example, are getting really commoditized out there. People are reproducing them in the open source.

What is the one asset that is probably has a very durable mode? It's private enterprise data and knowledge that is needed to specialize these models for specific settings and use cases. Enterprises are saying, “Hey, if I have really the one valuable asset, do I really want to give that back to an API model provider, or do I want to own the results of all that specialization?” I think for those two reasons and just basic empirical data, we're going to see really open source models that enterprises can customize and build on their own really take over.

That gives me a shameless segue into sharing a little bit of what I mentioned of some of the stuff that we're announcing next week. Our main platform for data labeling, we call Snorkel Flow, but we're also announcing a broader foundation model data platform and two pieces of that are we call it Snorkel Foundry and Snorkel GenFlow. These are all about the broader set of data operations that help you curate your data, build it up to pre-train your own models, whether from scratch, or off of an open-source base. Also, align and instruction tune them to be good at generative use cases, like chat, summarization, Q and A, etc.

I'll give one example that's an open-source research artifact to motivate this, and then I'll give a little bit of details about what these new platforms support. Quick example, this is an awesome project that was done via students at UW and a whole bunch of other collaborators. It was a big consortium with Google and Apple and Stability and LAION and a bunch of academic places called Data Comp. You can check it out at datacomp.ai. It's actually set up as a benchmark.

The result that I like from this is that basically, in data comp, we set it up as a contest to start, but basically fixed all of the model architecture, all the model selection, the algorithm, all of the training code for a large foundation. All this was a multi-model one called clip. Only allowed people to modify how they selected the right mix of data, filtered it, cleaned it, did all those data operations, even before labeling.

If you just mess around with those data-centric operations, you get a new state of the art score, compute parity that beats OpenAI. It beats everything else out there. This just speaks to how – and when you say it out loud, it actually sounds completely intuitive, but the mixture of data that you pour into these models has a huge effect on how they either work or don't work. We're building lots of tooling and solutions to support that. It's not just about labeling anymore. It's about supporting operations like sampling the right mixture of different data sources to pour in, filtering out high versus low quality data points, as well as annotating them when you're then getting to more of the fine tuning. Really excited to share more of that next week.

The TLDR is basically, every stage of these models from either pre-training them from scratch, or often open-source base, which if you have the same view as I do, you think most enterprises are going to move to, so they can better own and utilize their unique data, all the way to instruction-tuning these models, so that they can be properly aligned. This is the RLHF style stuff that got GPT3 to ChatGPT. Then to fine tuning on very specific tasks where you need high accuracy. All of these steps, the most critical operation, especially these days is so much is available in the open source, and so much is standardized in the models and the algorithms, etc., is really how you curate the data. A lot of that is still labeling it.

A lot of it also is not just labeling. It's all these other data-centric operations, sampling, filtering, cleaning, curating, and we're excited both to support that with our tooling and solutions. But also, just to keep pushing on this broader notion that, hey, these data-centric operations are not second-class citizens. They're not upstream janitorial work. They really are, and they're going to increasingly be this way as models get bigger and more black box, they're the heart of what modern data scientists and data science teams need to actually do.

[0:43:01] JB: They need to care about.

[0:43:02] AR: Hopefully, we can get that word out there, because it's a big shift in really, what AI development looks like.

[0:43:08] JB: This is Foundry. What was the other one?

[0:43:11] AR: Snorkel GenFlow.

[0:43:13] JB: Snorkel Foundry and Snorkel GenFlow. All right. Well, that's interesting. I'm just going to repeat back what I think I hear you saying. Open-source foundational models are going to win the day, potentially, in corporate, certainly, opposed to prompts are okay for fine tuning. But prompts are okay for getting better answers, but fine-tuning’s where you want to be, where you do this very targeted training. Foundry and GenFlow are going to help you create a workbench to help you do this correct mixture, add mixture of training data in addition to helping you label it. That's really where you guys have landed with the whole offering is much more of a platform, for ML ops all the way through, as opposed to a point solution. You really got your arms around that whole data centric – these set of activities, not just one of the key ones.

[0:44:05] AR: Exactly. I like your summary better than mine. Exactly. I think, these major closed models from providers like OpenAI really have shown the pathway forward, have changed the field. I do think they're going to continue to be extremely relevant, but especially in specialized enterprise settings, where there's unique data, unique objectives, where you need to get really high accuracy on specific tasks. You need a specialist, not a generalist.

We're going to see a lot that's being done with open-source models that enterprises can then tune and adapt using their data knowledge for high accuracy on their objectives. A lot of this is going to be accomplished via labeling, but there's also a richer set of operations all around manipulating the data. That's our objective is to support that, and then make it more efficient, more first class, more programmatic.

[0:44:57] JB: Programmatic, I think, right? It's definitely more of an art than science today in the companies I've worked at of the data whisperer out there, who somehow magically does data science and the data work, right? But, I think, yeah, having a workbench that creates a programmatic approach to creating the right mixture, unless you turn the knobs as needed, I definitely – I can see a lot of people really being interested in that.

I know we're coming to the end, but let me ask you a couple quick things about the business. How is it going? You told me a little bit about where you're going. How’s Snorkel doing?

[0:45:34] AR: Oh, I mean, things are very exciting these days, obviously for AI companies. Yeah, we're really privileged to work with a ton of great customers. We work with five of the top 10 US banks. A number of large pharmaceutical companies, insurance, telecom, healthcare and life sciences, federal. Yeah, it's very exciting to see this top down, as well as bottoms up interest around AI. I think it's also an interesting time. There's a lot of head spinning, a lot of riding the roller coaster of the hype cycle.

There's always pros and cons to this kind of a moment. But obviously, we're trying to take advantage of it, of this upsurge and interest in AI, while also trying to be very pragmatic and grounded in what are the right use cases? What is doable versus not as we work with our customers, or prospective customers? Because we build a horizontal platform that can tackle lots of things, but we usually land around proving out value on a use case. I think a lot of it is separating the hype from the real value, and so we still try to be careful about that.

Yeah, it's a very exciting time in AI. We're very lucky to get to work with an awesome set of customers who teach us a lot about AI and AI in their domains as well, which is always how we like to work.

[0:47:01] JB: All right. It was great to talk with you. Thanks, Alex.

[0:47:03] AR: Jocelyn, thanks so much for having me.

[END]