Spark and Streaming with Matei Zaharia

Podcast Monday, February 26 2018

Subscribe: RSS

Apache Spark is a system for processing large data sets in parallel. The core abstraction of Spark is the resilient distributed dataset (RDD), a working set of data that sits in memory for fast, iterative processing.

Matei Zaharia created Spark with two goals: to provide a composable, high-level set of APIs for performing distributed processing; and to provide a unified engine for running complete apps.

High-level APIs like SparkSQL and MLlib enables developers to build ambitious applications quickly. A developer using SparkSQL can work interactively with a huge dataset, which is a significant improvement on batch Hive jobs running on Hadoop. A developer training a machine learning model can put the model through multiple steps in the training process without checkpointing the data to disk.

The second goal of Spark–a “unified engine for running complete apps”–was the focus of my conversation with today’s guest Matei Zaharia.

Matei is the CTO of Databricks, a company that was started to implement his vision for Spark and to build highly usable products on top of the technology. Databricks Delta is a project that combines a data warehouse, data lake, and streaming system–all sitting on top of Amazon S3 and using Spark for processing.

In our recent episodes about streaming, we explored some common streaming architectures. A large volume of data comes into the system and is stored in Apache Kafka. Backend microservices and distributed streaming frameworks read that data and store it in databases and data lakes. A data warehouse allows for fast access to large volumes of data–so that machine learning systems and business analysts can work with data sets interactively.

The goal of Databricks Delta is to condense the streaming system, the data lake, and the data warehouse into a single system that is easy to use. If you listened to the previous episodes, you will have an idea of the level of complexity that is involved in managing these different systems.

For some companies, it makes complete sense to manage a Kafka cluster, a Spark cluster, a set of S3 buckets, and a data warehouse like Amazon Redshift. But we probably don’t want all of that management to the lowest barrier to entry. Delta will hopefully reduce that barrier to entry and make it easier for enterprises to set up large systems for processing data.

A few notes before we get started. We just launched the Software Daily, job board. To check it out, go to softwaredaily.com/jobs. You can post jobs, you can apply for jobs, and it’s all free. If you are looking to hire, or looking for a job, I recommend checking it out. And if you are looking for an internship, you can use the job board to apply for an internship at Software Engineering Daily.

Also, Meetups for Software Engineering Daily are being planned! Go to softwareengineeringdaily.com/meetup if you want to register for an upcoming Meetup. In March, I’ll be visiting Datadog in New York and Hubspot in Boston, and in April I’ll be at Telesign in LA.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.