Data Warehouse ETL with Matthew Scullion

A data warehouse provides low latency access to large volumes of data. 

A data warehouse is a crucial piece of infrastructure for a large company, because it can be used to answer complex questions involving a large number of data points. But a data warehouse usually cannot hold all of a company’s data at any given time. Users need to move a subset of the data into the data warehouse by reading large files from a data lake on disk and putting that data into the data warehouse.

The process of moving data from one place into another is broken down into three sequential steps, often called “ETL” (extract, transform, load) or “ELT” (extract, load, transform). In ETL, the data is extracted from a source such as a data lake, transformed into a schema that is customized for the data warehouse application, and then loaded into the data warehouse. In ELT, the last two steps are reversed, because modern systems can often leave the necessary schema transformation until after the data has been loaded into the data warehouse.

Matthew Scullion is the CEO of Matillion, a company that specializes in building tools for data transformations. Matthew joins the show to talk about the problem of data transformation, and how that problem has evolved over the nine years since he started Matillion.

If you enjoy the show, you can find all of our past episodes about data infrastructure by going to SoftwareDaily.com and searching for the technologies or companies mentioned. And if there is a subject that you want to hear covered, feel free to leave a comment on the episode, or send us a tweet @software_daily.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

With over 9 million apps created on Heroku, over 2 million managed data services and serving over 26 million request per day,  Heroku has earned the trust of developers–and it is as easy to start today as it always has been. Try Heroku for free today. Visit softwareengineeringdaily.com/heroku to get started today.

StrongDM is a system for managing and monitoring access to servers, databases, and Kubernetes clusters.  You already treat infrastructure as code; strongDM lets you do the same with access. Start your free 14 day trial of strongDM at: softwareengineeringdaily.com/strongdm

With Triplebyte, you do one online interview, and then you get to go straight to final interviews at hundreds of companies (from tech giants like Dropbox to exciting startups). It’s like the Common App for software engineers. No resume needed. Apply now at triplebyte.com/sedaily. If you take a job through Triplebyte, you’ll get a $1000 signing bonus.

VictorOps is a collaborative incident response tool. VictorOps brings your monitoring data and your collaboration tools into one place–so that you can fix issues more quickly, and reduce the pain of on-call. If you want to hear about how VictorOps works, you can listen to our episode with Chris Riley. Learn more about it as well as get a free t-shirt when you check it out at victorops.com/sedaily.

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.