Streaming Architecture with Tugdual Grall

At a big enough scale, every software product produces lots of data. Whether you are building an advertising technology company, a social network, or a system for IoT devices, you have thousands of events coming in at a fast pace that you want to aggregate, study and act upon.

For the last decade, engineers have been learning to store and process these vast quantities of data.

The first common technique was to store all your data to HDFS–the Hadoop Distributed File System–and run nightly Hadoop MapReduce jobs across that data. HDFS is cheap (stored on disk), effective (Hadoop had a revolutionary effect on business analysis), and easy to understand (“every night we take all the data from the previous day, analyze it with Hadoop, and send an email report to our analysts”).

The second common technique was the “Lambda Architecture.” The Lambda Architecture used a stream processing system like Apache Storm to process all incoming events as soon as they were created, so that software products could react quickly to the changes occurring in a large scale system. But events would sometimes be processed out of order, or they would get lost due to node failures. To fix those errors, the nightly Hadoop MapReduce jobs would still run, and would reconcile all the problems that might have occurred when the events were processed in the streaming system.

The Lambda Architecture worked pretty well–systems were becoming “real time”, and products like Twitter were starting to feel alive as they were able to rapidly process the massive volume of events on the fly. But managing a system with a Lambda Architecture was painful–you had to manage both a Hadoop cluster and a Storm cluster. You had to make sure that your Hadoop processing did not interfere with your Storm processing.

Today, a newer technique for ingesting and reacting to data has become more common, and is referred to as “streaming analytics.” Streaming analytics is a strategy for performing fast analysis of data coming into a system.

In streaming analytics systems, events are sent to a scalable, durable pubsub system such as Kafka. You can think of Kafka as a huge array of events that have occurred–such as users liking tweets or clicking on ads. Stream processing systems like Apache Flink or Apache Spark read the data from Kafka as if they were reading an array that was being continually appended to.

The sequence of events that get written to Kafka are called “streams”. This can be confusing–with a stream, you imagine this constantly moving, transient sequence of data. That’s partially true, but data will stay in Kafka as long as you want it to. You can set a retention policy for 2 weeks, 2 months, or 2 years. As long as that data is still retained in Kafka, your stream processing system can start reading from any place in the stream.

The stream processing systems like Flink or Spark that read from Kafka are still grabbing batches of data, and processing them in batches. They are reading from the event stream buffer in Kafka, which you can think of as an array. (This is something that confused me for a long time, so if you are still confused, don’t worry, we explain it more in this episode.)

Tugdual Grall is an engineer with MapR. In today’s episode, we explore use cases and architectural patterns for streaming analytics. Full disclosure: MapR is a sponsor of Software Engineering Daily.

In past shows, we have covered data engineering in detail–we’ve looked at Uber’s streaming architecture, talked to Matei Zaharia about the basics of Apache Spark, and explored the history of Hadoop. To find all of our episodes about data engineering , download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.

Sponsors


There’s a new open source project called Dremio that is designed to simplify analytics. It’s also designed to handle some of the hard work, like scaling performance of analytical jobs. Dremio is the team behind Apache Arrow, a new standard for in-memory columnar data analytics. Arrow has been adopted across dozens of projects – like Pandas – to improve the performance of analytical workloads on CPUs and GPUs. It’s free and open source, designed for everyone, from your laptop, to clusters of over 1,000 nodes. At dremio.com/sedaily you can find all the necessary resources to get started with Dremio for free. If you like it, be sure to tweet @dremiohq and let them know you heard about it from Software Engineering Daily. Thanks again to Dremio, and check out dremio.com/sedaily to learn more.



Today’s podcast is sponsored by Datadog, a cloud-scale monitoring platform for infrastructure and applications. In Datadog’s new container orchestration report, Kubernetes holds a 41-percent share of Docker environments, a number that’s rising fast. As more companies adopt containers, and turn to Kubernetes to manage their containers, they need a comprehensive monitoring platform that’s built for dynamic, modern infrastructure. Datadog integrates seamlessly with more than 200 technologies, including Kubernetes and Docker, so you can monitor your entire container infrastructure in one place. And with Datadog’s new Live Container view, you can see every container’s health, resource consumption, and running processes in real time. See for yourself by starting a free trial and get a free Datadog T-shirt! softwareengineeringdaily.com/datadog


Amazon Redshift powers the analytics of your business–and Intermix.io powers the analytics of your Redshift. Intermix.io gives you the tools you need to analyze your Amazon Redshift performance and improve the toolchain of everyone downstream from your data warehouse. The team at Intermix has seen so many Redshift clusters, they are confident they can solve whatever performance issues you are having. Go to intermix.io/sedaily to get a free 30-day trial. Intermix collects all your Redshift logs and makes it easy to figure out what’s wrong so you can take action. All in a nice, intuitive dashboard. Go to intermix.io/sedaily to start your free 30-day trial.