Podcast: Play in new window | Download
At a big enough scale, every software product produces lots of data. Whether you are building an advertising technology company, a social network, or a system for IoT devices, you have thousands of events coming in at a fast pace that you want to aggregate, study and act upon.
For the last decade, engineers have been learning to store and process these vast quantities of data.
The first common technique was to store all your data to HDFS–the Hadoop Distributed File System–and run nightly Hadoop MapReduce jobs across that data. HDFS is cheap (stored on disk), effective (Hadoop had a revolutionary effect on business analysis), and easy to understand (“every night we take all the data from the previous day, analyze it with Hadoop, and send an email report to our analysts”).
The second common technique was the “Lambda Architecture.” The Lambda Architecture used a stream processing system like Apache Storm to process all incoming events as soon as they were created, so that software products could react quickly to the changes occurring in a large scale system. But events would sometimes be processed out of order, or they would get lost due to node failures. To fix those errors, the nightly Hadoop MapReduce jobs would still run, and would reconcile all the problems that might have occurred when the events were processed in the streaming system.
The Lambda Architecture worked pretty well–systems were becoming “real time”, and products like Twitter were starting to feel alive as they were able to rapidly process the massive volume of events on the fly. But managing a system with a Lambda Architecture was painful–you had to manage both a Hadoop cluster and a Storm cluster. You had to make sure that your Hadoop processing did not interfere with your Storm processing.
Today, a newer technique for ingesting and reacting to data has become more common, and is referred to as “streaming analytics.” Streaming analytics is a strategy for performing fast analysis of data coming into a system.
In streaming analytics systems, events are sent to a scalable, durable pubsub system such as Kafka. You can think of Kafka as a huge array of events that have occurred–such as users liking tweets or clicking on ads. Stream processing systems like Apache Flink or Apache Spark read the data from Kafka as if they were reading an array that was being continually appended to.
The sequence of events that get written to Kafka are called “streams”. This can be confusing–with a stream, you imagine this constantly moving, transient sequence of data. That’s partially true, but data will stay in Kafka as long as you want it to. You can set a retention policy for 2 weeks, 2 months, or 2 years. As long as that data is still retained in Kafka, your stream processing system can start reading from any place in the stream.
The stream processing systems like Flink or Spark that read from Kafka are still grabbing batches of data, and processing them in batches. They are reading from the event stream buffer in Kafka, which you can think of as an array. (This is something that confused me for a long time, so if you are still confused, don’t worry, we explain it more in this episode.)
Tugdual Grall is an engineer with MapR. In today’s episode, we explore use cases and architectural patterns for streaming analytics. Full disclosure: MapR is a sponsor of Software Engineering Daily.
In past shows, we have covered data engineering in detail–we’ve looked at Uber’s streaming architecture, talked to Matei Zaharia about the basics of Apache Spark, and explored the history of Hadoop. To find all of our episodes about data engineering , download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.