Observability Engineering with James Burns
FindCollabs Hackathon #1 has ended! Congrats to ARhythm, Kitspace, and Rivaly for winning 1st, 2nd, and 3rd place ($4,000, $1000, and a set of SE Daily hoodies, respectively). The most valuable feedback award and the most helpful community member award both go to Vynce Montgomery, who will receive both the SE Daily Towel and the SE Daily Old School Bucket Hat
Twilio is a communications infrastructure company with thousands of internal services and thousands of request per second. Each request generates logs, metrics, and distributed traces which can be used to troubleshoot failures and improve latency.
Since Twilio is used for 2-factor authentication and text message relaying, Twilio is critical infrastructure for most applications that implement it. The service must remain highly available even in times of peak application traffic, or outages at a particular cloud provider.
When he was at Twilio, James Burns worked on platform infrastructure and observability. James was at Twilio from 2014 to 2017, a time in which the company experienced rapid scalability. His work encompassed site reliability, monitoring, cost management and incident response. He also led chaos engineering exercises called “game days”, in which the company deliberately caused infrastructure to fail in order to ensure the reliability of failover systems and to discover problematic dependencies.
James joins the show to talk about his time at Twilio and his perspectives on how to instrument and observe complex applications. Full disclosure: James now works at LightStep, which is a sponsor of Software Engineering Daily.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.