HoloClean: Data Quality Management with Theodoros Rekatsinas
Many data sources produce new data points at a very high rate. With so much data, the issue of data quality emerges. Low quality data can degrade the accuracy of machine learning models that are built around those data sources. Ideally, we would have completely clean data sources, but that’s not very realistic. One alternative is a data cleaning system, which can allow us to clean up the data after it has already been generated.
HoloClean is a statistical inference engine that can impute, clean, and enrich data. HoloClean is centered around “The Probabilistic Unclean Database Model”, which allows for two systems–an “intension” and a “realizer” to work together to fill in missing fields and fix erroneous fields in data.
HoloClean was created by Theo Rekatsinas, and he joins the show to talk about the problem of fast, unclean data, and his work with HoloClean. We also talk about other problems in machine learning and the engineering workflows around data.
Sponsorship inquiries: firstname.lastname@example.org
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.