Data Lakehouse with Michael Armbrust

A data warehouse is a system for performing fast queries on large amounts of data. A data lake is a system for storing high volumes of data in a format that is slow to access. A typical workflow for a data engineer is to pull data sets from this slow data lake storage into the data warehouse for faster querying.

Apache Spark is a system for fast processing of data across distributed datasets. Spark is not thought of as a data warehouse technology, but it can be used to fulfill some of the responsibilities. Delta is an open source system for a storage layer on top of a data lake. Delta integrates closely with Spark, creating a system that Databricks refers to as a “data lakehouse.”

Michael Armbrust is an engineer with Databricks. He joins the show to talk about his experience building the company, and his perspective on data engineering, as well as his work on Delta, the storage system built for the Spark ecosystem.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

From their recent report on serverless adoption and trends, Datadog found half of their customer base using EC2s have now adopted AWS Lambda. You can easily monitor all your serverless functions in one place and generate serverless metrics straight from Datadog. Check it out yourself by signing up for a free 14-day trial and get a free t-shirt at softwareengineeringdaily.com/datadog

The Intrazone is a bi-weekly conversation and interview podcast about SharePoint, OneDrive and related technology within Microsoft 365. Get highlights on usage, adoption, and how SharePoint works for you. Subscribe to The Intrazone today. It’s all about how SharePoint fits into your everyday work life – the goal being to more easily share and manage content, knowledge and applications, and to empower teamwork throughout your organization with the technology you already have. Subscribe to The Intrazone wherever you get your podcasts.

strongDM lets you manage and audit access to servers, databases, and Kubernetes clusters, no matter where your employees are. With StrongDM, you can easily extend your identity provider to manage infrastructure access. You can automate onboarding, offboarding, and moving people within roles. strongDM. Manage and audit remote access to infrastructure. Start your free 14 day trial today at: strongdm.com/SEDaily

G2i is a hiring platform run by engineers that matches you with React, React Native, GraphQL, and mobile engineers who you can trust. Whether you are a new company building your first product or an established company that wants additional engineering help, G2i has the talent you need to accomplish your goals. Go to softwareengineeringdaily.com/g2i