Streaming Analytics with Scott Kidder

When you go to a website where a video is playing, and your video lags, how does the website know that you are having a bad experience?

Problems with video are often not complete failures–maybe part of the video loads, and plays just fine, and then the rest of the video is buffering. You have probably experienced sitting in front of a video, waiting for it to load as the loading wheel mysteriously spins.

Since problems with video are often not complete failures, troubleshooting a problem with a user’s video playback is not as straightforward as just logging whenever a crash occurs. You need to continuously monitor the video playback on every client device and aggregate it in a centralized system for analysis.

The centralized logging system will allow you to separate problems with a specific user from problems with the video service itself. A single user could have bad wifi, or have 50 tabs open with different videos. To identify problems that are caused by the video player rather than the user, you need to capture the playback from every video and every user.

Scott Kidder works at Mux, where he builds a streaming analytics system for video monitoring. In this episode, Scott explains how events make it from a video player onto the backend analytics system running on Kinesis and Apache Flink.

Events from the browser are constantly added to Kinesis (which is much like Kafka). Apache Flink reads those events off of Kinesis and map reduces them to discover anomalies. For example, if 100 users watch a 20 minute cat video, and the video stops playing at minute 12 for all 100 users, there is probably some data corruption in that video. You would only be able to discover that by assessing all users.

Scott and I discussed the streaming infrastructure that he works on at Mux, as well as other streaming systems like Spark, Apache Beam, and Kafka.

This episode is the first in a short series about streaming data infrastructure. I wanted to do some shows in preparation for Strata Data conference in March in San Jose, which I will be attending thanks to a complimentary ticket from O’Reilly. O’Reilly has been kind enough to give me free tickets since Software Engineering Daily started and did not have the money to attend any conferences. If you want to attend Strata, you can use promo code PCSED to get 20% off.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.

Sponsors


There’s a new open source project called Dremio that is designed to simplify analytics. It’s also designed to handle some of the hard work, like scaling performance of analytical jobs. Dremio is the team behind Apache Arrow, a new standard for in-memory columnar data analytics. Arrow has been adopted across dozens of projects – like Pandas – to improve the performance of analytical workloads on CPUs and GPUs. It’s free and open source, designed for everyone, from your laptop, to clusters of over 1,000 nodes. At dremio.com/sedaily you can find all the necessary resources to get started with Dremio for free. If you like it, be sure to tweet @dremiohq and let them know you heard about it from Software Engineering Daily. Thanks again to Dremio, and check out dremio.com/sedaily to learn more.



A thank you to our sponsor, Datadog, a cloud monitoring platform bringing full visibility to dynamic infrastructure and applications. Create beautiful dashboards, set powerful, machine learning–based alerts, and collaborate with your team to resolve performance issues. Datadog integrates seamlessly with more than 200 technologies, including Google Cloud Platform, AWS, Docker, PagerDuty, and Slack. With fast installation and setup, plus APIs and open source libraries for custom instrumentation, Datadog makes it easy for teams to monitor every layer of their stack in one place. But don’t take our word for it—start a free trial today & Datadog will send you a free T-shirt! Visit softwareengineeringdaily.com/datadog to get started.  


Amazon Redshift powers the analytics of your business–and Intermix.io powers the analytics of your Redshift. Intermix.io gives you the tools you need to analyze your Amazon Redshift performance and improve the toolchain of everyone downstream from your data warehouse. The team at Intermix has seen so many Redshift clusters, they are confident they can solve whatever performance issues you are having. Go to intermix.io/sedaily to get a free 30-day trial. Intermix collects all your Redshift logs and makes it easy to figure out what’s wrong so you can take action. All in a nice, intuitive dashboard. Go to intermix.io/sedaily to start your free 30-day trial.