Flink and BEAM Stream Processing with Maximilian Michels

Distributed stream processing systems are used to read large volumes of data and perform operations across those data streams. 

These stream processing systems often build off of the MapReduce algorithm for collecting and aggregating large volumes of data, but instead of processing a calculation over a single large batch of data, they process data on an ongoing basis. There are so many different stream processing system for this same use case–Storm, Spark, Flink, Heron, and many others. 

Why is that? When there seems to be much more consolidation around the Hadoop MapReduce batch processing technology, why are there so many stream processing systems?

One explanation is that aggregating the results of a continuous stream of data is a process that very much depends on time. At any given point in time, you can take a snapshot of the stream of data, and any calculation based on that data is going to be out of date by the time that your calculation is finished. There is a latency between when you start calculating something, and when you finish calculating it.

There are other design decisions for a distributed stream processing system. What data do you keep in memory? What do you keep on disk? How often do you snapshot your data to disk? What is the method for fault tolerance? What are the APIs for consuming and processing this data?

Maximilian Michels has worked on the Apache Flink and Apache BEAM stream processing systems, and currently works on data infrastructure at Lyft. Max joins the show to discuss the tradeoffs of different stream processing systems and his experiences in the world of data processing.

You can find all of our past episodes about data infrastructure by going to SoftwareDaily.com and searching for the technologies or companies mentioned. And if there is a subject that you want to hear covered, feel free to leave a comment on the episode, or send us a tweet @software_daily.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

It’s hard to get engineering resources to build back-office apps, and even harder to get engineers excited about maintaining them. The idea is that all internal tools kinda look the same – they’re made of tables, dropdowns, buttons, text inputs, etc. Retool gives you a drag and drop interface so engineers can build these internal UIs in hours, not days, and spend more time building features customers will see. Visit retool.com/sedaily to learn more.

Seen by Indeed is a tech-focused matching platform. Every Seen candidate also gets free access to technical career coaching, resume reviews, mock interviews, and even salary negotiation tips to seal the deal. If you are ready for a new job, you are ready for Seen by Indeed. You  Join today and get a free resume review when you go to beseen.com/dailypodcast.

NetSuite is a complete business management software platform that handles sales, financing, accounting, orders, and HR. Netsuite gives you visibility into your business, helping you to control and grow your business. Netsuite is offering a free guide: “seven key strategies to grow your profits” at netsuite.com/sedaily.

With Triplebyte, you do one online interview, and then you get to go straight to final interviews at hundreds of companies (from tech giants like Dropbox to exciting startups). It’s like the Common App for software engineers. No resume needed. Apply now at triplebyte.com/sedaily. If you take a job through Triplebyte, you’ll get a $1000 signing bonus.