Spark and Streaming with Matei Zaharia

Apache Spark is a system for processing large data sets in parallel. The core abstraction of Spark is the resilient distributed dataset (RDD), a working set of data that sits in memory for fast, iterative processing.

Matei Zaharia created Spark with two goals: to provide a composable, high-level set of APIs for performing distributed processing; and to provide a unified engine for running complete apps.

High-level APIs like SparkSQL and MLlib enable developers to build ambitious applications quickly. A developer using SparkSQL can work interactively with a huge dataset, which is a significant improvement on batch Hive jobs running on Hadoop. A developer training a machine learning model can put the model through multiple steps in the training process without checkpointing the data to disk.

The second goal of Spark–a “unified engine for running complete apps”–was the focus of my conversation with today’s guest Matei Zaharia.

Matei is the CTO of Databricks, a company that was started to implement his vision for Spark and to build highly usable products on top of the technology. Databricks Delta is a project that combines a data warehouse, data lake, and streaming system–all sitting on top of Amazon S3 and using Spark for processing.

In our recent episodes about streaming, we explored some common streaming architectures. A large volume of data comes into the system and is stored in Apache Kafka. Backend microservices and distributed streaming frameworks read that data and store it in databases and data lakes. A data warehouse allows for fast access to the large volumes of data–so that machine learning systems and business analysts can work with data sets interactively.

The goal of Databricks Delta is to condense the streaming system, the data lake, and the data warehouse into a single system that is easy to use. If you listened to the previous episodes, you will have an idea for the level of complexity that is involved in managing these different systems.

For some companies, it makes complete sense to manage a Kafka cluster, a Spark cluster, a set of S3 buckets, and a data warehouse like Amazon Redshift. But we probably don’t want all of that management to the lowest barrier to entry. Delta will hopefully reduce that barrier to entry and make it easier for enterprises to set up large systems for processing data.

A few notes before we get started. We just launched the Software Daily job board. To check it out, go to You can post jobs, you can apply for jobs, and it’s all free. If you are looking to hire, or looking for a job, I recommend checking it out. And if you are looking for an internship, you can use the job board to apply for an internship at Software Engineering Daily.

Also, Meetups for Software Engineering Daily are being planned! Go to if you want to register for an upcoming Meetup. In March, I’ll be visiting Datadog in New York and Hubspot in Boston, and in April I’ll be at Telesign in LA.


Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


There’s a new open source project called Dremio that is designed to simplify analytics. It’s also designed to handle some of the hard work, like scaling performance of analytical jobs. Dremio is the team behind Apache Arrow, a new standard for in-memory columnar data analytics. Arrow has been adopted across dozens of projects – like Pandas – to improve the performance of analytical workloads on CPUs and GPUs. It’s free and open source, designed for everyone, from your laptop, to clusters of over 1,000 nodes. At you can find all the necessary resources to get started with Dremio for free. If you like it, be sure to tweet @dremiohq and let them know you heard about it from Software Engineering Daily. Thanks again to Dremio, and check out to learn more.

I listen to a lot of podcasts about technical content, and the Google Cloud Platform podcast is one of my favorites. The GCP Podcast covers the technologies that Google Cloud is building–through interviews with the people building them. And these are often unique, Google cloud services–like BigQuery, AutoML, and Firebase. I am a big Firebase user, so I try to learn about how it works under the hood, and I want to hear about new features they are releasing. I also listen to the GCP Podcast to prepare for episodes of Software Engineering Daily–because when I do shows about Google Cloud technologies, I am doing research around them, and I find that the GCP Podcast covers topics before I do. So if you want to stay on the leading edge of what is being released at Google, and how these new technologies are built, check out I’ve been a listener for a few years now, and the content is consistently good–a few of my favorite recent episodes are the interview with Vint Cerf and the show about BigQuery. You can find those episodes and more by going to is a software conference for full-stack developers looking to uncover the real-world patterns, practices, and use cases for applying artificial intelligence/machine learning in engineering. Come to in San Francisco, from April 9th – 11th 2018–and see talks from companies like Instacart, Uber, Coinbase, and Stripe. These companies have built and deployed state of the art machine learning models–and they come to QCon to share their developments. The keynote of is Matt Ranney, a Sr. Staff Engineer at UberATG (the autonomous driving unit at Uber)–and he’s an amazing speaker–he was on SE Daily in the past, if you want a preview for what he is like. I have been to QCon three times and it is a fantastic conference. What I love about QCon is the high bar for quality–quality in terms of speakers, content, peer sharing as well as the food and general atmosphere. QCon is one of my favorite conferences, and if you haven’t been to a QCon before, make your first. Register for and use promo code SEDAILY for $100 off your ticket. 

Today’s podcast is sponsored by Datadog, a cloud-scale monitoring platform for infrastructure and applications. In Datadog’s new container orchestration report, Kubernetes holds a 41-percent share of Docker environments, a number that’s rising fast. As more companies adopt containers, and turn to Kubernetes to manage their containers, they need a comprehensive monitoring platform that’s built for dynamic, modern infrastructure. Datadog integrates seamlessly with more than 200 technologies, including Kubernetes and Docker, so you can monitor your entire container infrastructure in one place. And with Datadog’s new Live Container view, you can see every container’s health, resource consumption, and running processes in real time. See for yourself by starting a free trial and get a free Datadog T-shirt!

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.