Great Expectations: Data Pipeline Testing with Abe Gong

A data pipeline is a series of steps that takes large data sets and creates usable results from them. At the beginning of a data pipeline, a data set might be pulled from a database, a distributed file system, or a Kafka topic. Throughout a data pipeline, different data sets are joined, filtered, and statistically analyzed.

At the end of a data pipeline, data might be put into a data warehouse or Apache Spark for ad-hoc analysis and data science. At this point, the end-user of the data set expects that data to be clean and accurate. But how do we have any guarantees about the correctness?

Abe Gong is the creator of Great Expectations, a system for data pipeline testing. In Great Expectations, the developer creates tests called “expectations”, which verify certain characteristics of the data set at different phases in a data pipeline. This helps ensure that the end result of a multi-stage data pipeline is correct.

Abe joins the show to discuss the architecture of a data pipeline and the use cases of Great Expectations.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

This episode is sponsored by Datadog, which unifies metrics, logs, traces, and other advanced features in a single, scalable platform. View your SLOs in real time; search, group, and filter SLOs by facets like tags, teams, services, and more. Sign up for a free 14-day trial, and Datadog will send you one of their famously cozy t-shirts. softwareengineeringdaily.com/datadog

DigitalOcean makes infrastructure simple. And for an application that needs to scale, DigitalOcean has CPU-Optimized Droplets, Memory-Optimized Droplets, Managed Databases, Managed Kubernetes, and much more. Visit do.co/sedaily and receive $100 in credit over 60 days.

DataStax provides DataStax Enterprise, a powerful distribution of Cassandra, created by the team that has contributed the most to Cassandra. DataStax Enterprise enables teams to develop faster, scale further, achieve operational simplicity, ensure enterprise security, and run mixed workloads that work with latest Graph, Search, and Analytics technology—all running across the hybrid and multi-cloud. To learn more about Apache Cassandra and DataStax Enterprise, go to datastax.com/sedaily

With Triplebyte, you do one online interview, and then you get to go straight to final interviews at hundreds of companies (from tech giants like Dropbox to exciting startups). It’s like the Common App for software engineers. No resume needed. Apply now at triplebyte.com/sedaily. If you take a job through Triplebyte, you’ll get a $1000 signing bonus.

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.