Chaos Engineering with Kolton Andrus

The number of ways that applications can fail are numerous. Disks fail all the time. Servers overheat. Network connections get flaky. You assume that you are prepared for such a scenario, because you have replicated your servers. You have the database backed up. Your core application is spread across multiple availability zones.

But are you really sure that your system is resilient? The only way to prove that your system is resilient to failure is to experience failure, and to make swift responsiveness to failure an integral part of your software.

Chaos engineering is the practice of routinely testing your system’s resilience by inducing controlled failures. Netflix was the first company to discuss chaos engineering widely, but more and more companies are starting to work it into their systems, and finding it tremendously useful. By inducing failures in your system, you can discover unknown dependencies, single points of failure, and problematic state conditions that can cause data corruption.

Kolton Andrus worked on chaos engineering at Netflix and Amazon, where he designed systems that would test system resiliency through routine failures. Since then, he founded Gremlin, a company that provides chaos engineering as a service. In a previous episode, Kolton and I discussed why chaos engineering is useful, and he told some awesome war stories about working at Amazon and Netflix. In this show, we explore how to build a chaos engineering service–which involves standing up Gremlin containers that institute controlled failures.

To find the previous episode I recorded with Kolton, as well as other supplementary materials described in this show, download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links and discussions around the episodes.


