Chaos Engineering with Kolton Andrus

Podcast Friday, February 2 2018

Subscribe: RSS

The number of ways that applications can fail is numerous. Disks fail all the time. Servers overheat. Network connections get flaky. You assume that you are prepared for such a scenario because you have replicated your servers. You have the database backed up. Your core application is spread across multiple availability zones.

But are you really sure that your system is resilient? The only way to prove that your system is resilient to failure is to experience failure and to make swift responsiveness to failure an integral part of your software.

Chaos engineering is the practice of routinely testing your system’s resilience by inducing controlled failures. Netflix was the first company to discuss chaos engineering widely, but more and more companies are starting to work it into their systems, and finding it tremendously useful. By inducing failures in your system, you can discover unknown dependencies, single points of failure, and problematic state conditions that can cause data corruption.

Kolton Andrus worked on chaos engineering at Netflix and Amazon, where he designed systems that would test system resiliency through routine failures. Since then, he founded Gremlin, a company that provides chaos engineering as a service. In a previous episode, Kolton and I discussed why chaos engineering is useful, and he told some awesome war stories about working at Amazon and Netflix. In this show, we explore how to build a chaos engineering service–which involves standing up Gremlin containers that institute controlled failures.

To find the previous episode I recorded with Kolton, as well as other supplementary materials described in this show, download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links, and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.