HoloClean: Data Quality Management with Theodoros Rekatsinas

Many data sources produce new data points at a very high rate. With so much data, the issue of data quality emerges. Low quality data can degrade the accuracy of machine learning models that are built around those data sources. Ideally, we would have completely clean data sources, but that’s not very realistic. One alternative is a data cleaning system, which can allow us to clean up the data after it has already been generated.

HoloClean is a statistical inference engine that can impute, clean, and enrich data. HoloClean is centered around “The Probabilistic Unclean Database Model”, which allows for two systems–an “intension” and a “realizer” to work together to fill in missing fields and fix erroneous fields in data.

HoloClean was created by Theo Rekatsinas, and he joins the show to talk about the problem of fast, unclean data, and his work with HoloClean. We also talk about other problems in machine learning and the engineering workflows around data.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com


Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


SAP Data Intelligence connects and transforms data to extract value from the distributed data landscape. SAP Data Intelligence brings together data orchestration, metadata management, and powerful data pipelines with advanced machine learning, enabling close collaboration between data scientists and IT. To learn more about SAP Data Intelligence, visit sap.com/sedaily

With Triplebyte, you do one online interview, and then you get to go straight to final interviews at hundreds of companies (from tech giants like Dropbox to exciting startups). It’s like the Common App for software engineers. No resume needed. Apply now at triplebyte.com/sedaily. If you take a job through Triplebyte, you’ll get a $1000 signing bonus.

CockroachDB is a distributed SQL database that makes it simple to build resilient, scalable applications quickly. CockroachDB is Postgres compatible, giving the same familiar SQL interface database developers have used for years. Host it on prem, run it in a hybrid cloud, and even deploy it across multiple clouds. Sign up for a free 30-day trial and get a free t-shirt at cockroachlabs.com/sedaily.

G2i is a hiring platform run by engineers that matches you with React, React Native, GraphQL, and mobile engineers who you can trust. Whether you are a new company building your first product or an established company that wants additional engineering help, G2i has the talent you need to accomplish your goals. Go to softwareengineeringdaily.com/g2i