Chaos Engineering with Kolton Andrus

The number of ways that applications can fail are numerous. Disks fail all the time. Servers overheat. Network connections get flaky. You assume that you are prepared for such a scenario, because you have replicated your servers. You have the database backed up. Your core application is spread across multiple availability zones.

But are you really sure that your system is resilient? The only way to prove that your system is resilient to failure is to experience failure, and to make swift responsiveness to failure an integral part of your software.

Chaos engineering is the practice of routinely testing your system’s resilience by inducing controlled failures. Netflix was the first company to discuss chaos engineering widely, but more and more companies are starting to work it into their systems, and finding it tremendously useful. By inducing failures in your system, you can discover unknown dependencies, single points of failure, and problematic state conditions that can cause data corruption.

Kolton Andrus worked on chaos engineering at Netflix and Amazon, where he designed systems that would test system resiliency through routine failures. Since then, he founded Gremlin, a company that provides chaos engineering as a service. In a previous episode, Kolton and I discussed why chaos engineering is useful, and he told some awesome war stories about working at Amazon and Netflix. In this show, we explore how to build a chaos engineering service–which involves standing up Gremlin containers that institute controlled failures.

To find the previous episode I recorded with Kolton, as well as other supplementary materials described in this show, download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommendations, categories, related links and discussions around the episodes. It’s all free and also open source–if you are interested in getting involved in our open source community, we have lots of people working on the project and we do our best to be friendly and inviting to new people coming in looking for their first open source project. You can find that project at Github.com/softwareengineeringdaily.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.

Sponsors


Digital Ocean is a reliable, easy-to-use cloud provider. More and more people are finding out about Digital Ocean, and realizing that Digital Ocean is perfect for their application workloads. This year, Digital Ocean is making that even easier, with new node types–a $15 flexible droplet that can mix and match different configurations of CPU and RAM, to get the perfect amount of resources for your application. There are also CPU optimized droplets–perfect for highly active frontend servers, or CI/CD workloads. And running on the cloud can get expensive, which is why Digital Ocean makes it easy to choose the right size instance. And the prices on standard instances have gone down too–you can check out all their new deals by going to do.co/sedaily. And as a bonus to our listeners you will get $100 in credit over 60 days. Use the credit for hosting or infrastructure–that includes load balancers, object storage, and computation. Get your free $100 credit at do.co/sedaily. Thanks to Digital Ocean for being a sponsor of Software Engineering Daily.


The octopus: a sea creature known for its intelligence and flexibility. Octopus Deploy: a friendly deployment automation tool for deploying applications like .NET apps, Java apps and more. Ask any developer and they’ll tell you it’s never fun pushing code at 5pm on a Friday then crossing your fingers hoping for the best. That’s where Octopus Deploy comes into the picture. Octopus Deploy is a friendly deployment automation tool, taking over where your build/CI server ends. Use Octopus to promote releases on-prem or to the cloud. Octopus integrates with your existing build pipeline–TFS and VSTS, Bamboo, TeamCity, and Jenkins. It integrates with AWS, Azure, and on-prem environments. Reliably and repeatedly deploy your .NET and Java apps and more. If you can package it, Octopus can deploy it! It’s quick and easy to install. Go to Octopus.com to trial Octopus free for 45 days. That’s Octopus.com


Your company needs to build a new app, but you don’t have the spare engineering resources. There are some technical people in your company who have time to build apps–but they are not engineers. OutSystems is a platform for building low-code apps. As an enterprise grows, it needs more and more apps to support different types of customers and internal employee use cases. OutSystems has everything that you need to build, release, and update your apps without needing an expert engineer. And if you are an engineer, you will be massively productive with OutSystems. Find out how to get started with low-code apps today–at OutSystems.com/sedaily. There are videos showing how to use the OutSystems development platform, and testimonials from enterprises like FICO, Mercedes Benz, and SafeWay. OutSystems enables you to quickly build web and mobile applications–whether you are an engineer or not. Check out how to build low-code apps by going to OutSystems.com/sedaily.


Azure Container Service simplifies the deployment, management and operations of Kubernetes. Eliminate the complicated planning and deployment of fully orchestrated containerized applications with Kubernetes. You can quickly provision clusters to be up and running in no time, while simplifying your monitoring and cluster management through auto upgrades and a built-in operations console. Avoid being locked into any one vendor or resource. You can continue to work with the tools you already know, such as Helm, and move applications to any Kubernetes deployment. Integrate with your choice of container registry, including Azure Container Registry. Also, quickly and efficiently scale to maximize your resource utilization without having to take your applications offline. Isolate your application from infrastructure failures and transparently scale the underlying infrastructure to meet growing demands—all while increasing the security, reliability, and availability of critical business workloads with Azure. Check out the Azure Container Service at aka.ms/sedaily.

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.