Incident Reproduction with Tammy Butow

Databases go offline. Services fail to scale up. Deployment errors can cause an application backend to get DDoS’d.

When an event happens that prevents your company from operating as expected, it is known as an incident. Software teams respond to an incident by issuing a fix. Sometimes that fix returns the software to its ideal state. Other times the software remains in a degraded state, and it takes more fixing to return the software to the place it should be.

One way that a software team can learn from an incident is through incident reproduction. When an incident is turned into a reproducible system, it becomes a predictable training exercise rather than a surprising and painful outage.

Tammy Butow is an engineer with Gremlin, a company that makes chaos engineering software. Chaos engineering is the process of creating controlled experiments that simulate outages. Tammy joins the show to discuss common incident types, and how those can be made reproducible for training exercises.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Check out our active projects:

  • We are hiring a head of growth. If you like Software Engineering Daily and consider yourself competent in sales, marketing, and strategy, send me an email: jeff@softwareengineeringdaily.com
  • FindCollabs is a place to build open source software.
  • The SEDaily app for iOS and Android includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. Subscribe for ad-free episodes.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Software Daily

Software Daily

 
Subscribe to Software Daily, a curated newsletter featuring the best and newest from the software engineering community.