Snorkel: Training Dataset Management with Braden Hancock

Machine learning models require the use of training data, and that data needs to be labeled. Today, we have high quality data infrastructure tools such as TensorFlow, but we don’t have large high quality data sets. For many applications, the state of the art is to manually label training examples and feed them into the training process.

Snorkel is a system for scaling the creation of labeled training data. In Snorkel, human subject matter experts create labeling functions, and these functions are applied to large quantities of data in order to label it. 

For example, if I want to generate training data about spam emails, I don’t have to hire 1000 email experts to look at emails and determine if they are spam or not. I can hire just a few email experts, and have them define labeling functions that can indicate whether an email is spam. If that doesn’t make sense, don’t worry. We discuss it in more detail in this episode.

Braden Hancock works on Snorkel, and he joins the show to talk about the labeling problems in machine learning, and how Snorkel helps alleviate those problems. We have done many shows on machine learning in the past, which you can find on SoftwareDaily.com. Also, if you are interested in writing about machine learning, we have a new writing feature that you can check out by going to SoftwareDaily.com/write.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

From their recent report on serverless adoption and trends, Datadog found half of their customer base using EC2s have now adopted AWS Lambda. You can easily monitor all your serverless functions in one place and generate serverless metrics straight from Datadog. Check it out yourself by signing up for a free 14-day trial and get a free t-shirt at softwareengineeringdaily.com/datadog

G2i is a hiring platform run by engineers that matches you with React, React Native, GraphQL, and mobile engineers who you can trust. Whether you are a new company building your first product or an established company that wants additional engineering help, G2i has the talent you need to accomplish your goals. Go to softwareengineeringdaily.com/g2i

It’s hard to get engineering resources to build back-office apps, and even harder to get engineers excited about maintaining them. The idea is that all internal tools kinda look the same – they’re made of tables, dropdowns, buttons, text inputs, etc. Retool gives you a drag and drop interface so engineers can build these internal UIs in hours, not days, and spend more time building features customers will see. Visit retool.com/sedaily to learn more.

MongoDB is an intuitive, flexible document database that lets you get to building.  MongoDB’s document model is a natural way to represent data, so you can focus on what matters. You can get started free with MongoDB Atlas at mongodb.com/atlas.

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.