High Volume Event Processing with John-Daniel Trask

Podcast Thursday, November 16 2017

Subscribe: RSS

A popular software application serves billions of user requests. These requests could be for many different things. These requests need to be routed to the correct destination, load balanced across different instances of a service, and queued for processing. Processing a request might require generating a detailed response to the user, or making a write to a database, or the creation of a new file on a file system.

As a software product grows in popularity, it will need to scale these different parts of infrastructure at different rates. You many not need to grow your database cluster at the same pace that you grow the number of load balancers at the front of your infrastructure. Your users might start making 70% of their requests to one specific part of your application, and you might need to scale up the services that power that portion of the infrastructure.

Today’s episode is a case study of a high-volume application: a monitoring platform called Raygun.

Raygun’s software runs on client applications and delivers monitoring data and crash reports back to Raygun’s servers. If I have a podcast player application on my iPhone that runs the Raygun software, and that application crashes, Raygun takes a snapshot of the system state and reports that information along with the exception, so that the developer of that podcast player application can see the full picture of what was going on in the user’s device, along with the exception that triggered the application crash.

Throughout the day, applications all around the world are crashing and sending requests to Rayguns servers. Even when crashes are not occurring, Raygun is receiving monitoring and health data from those applications. Raygun’s infrastructure routes those different types of requests to different services, queues them up, and writes the data to multiple storage layers–ElasticSearch, a relational SQL database, and a custom file server built on top of S3.

John-Daniel Trask is the CEO of Raygun and he joins the show to describe the end-to-end architecture of Raygun’s request processing and storage system. We also explore specific refactoring changes that were made to save costs at the worker layer of the architecture. This is useful memory management strategy for anyone working in a garbage collected language. If you would like to see diagrams that explain the architecture and other technical decisions, the show notes have a video that explains what we talk about in this show. Full disclosure: Raygun is a sponsor of Software Engineering Daily.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.