Incident Reproduction with Tammy Butow

Databases go offline. Services fail to scale up. Deployment errors can cause an application backend to get DDoS’d.

When an event happens that prevents your company from operating as expected, it is known as an incident. Software teams respond to an incident by issuing a fix. Sometimes that fix returns the software to its ideal state. Other times the software remains in a degraded state, and it takes more fixing to return the software to the place it should be.

One way that a software team can learn from an incident is through incident reproduction. When an incident is turned into a reproducible system, it becomes a predictable training exercise rather than a surprising and painful outage.

Tammy Butow is an engineer with Gremlin, a company that makes chaos engineering software. Chaos engineering is the process of creating controlled experiments that simulate outages. Tammy joins the show to discuss common incident types, and how those can be made reproducible for training exercises.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Check out our active projects:

  • We are hiring a head of growth. If you like Software Engineering Daily and consider yourself competent in sales, marketing, and strategy, send me an email: jeff@softwareengineeringdaily.com
  • FindCollabs is a place to build open source software.
  • The SEDaily app for iOS and Android includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. Subscribe for ad-free episodes.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

Datadog unites metrics, traces, and logs in one platform so you can get full visibility into your infrastructure and applications. Check out new features like Trace Search & Analytics for rapid insights into high-cardinality data, and Watchdog, an auto-detection engine that alerts you to performance anomalies across your applications. Datadog makes it easy for teams to monitor every layer of their stack in one place, but don’t take our word for it—start a free trial today & Datadog will send you a T-shirt! softwareengineeringdaily.com/datadog

If you’re a SaaS or Software vendor looking to modernize your application distribution to gain more enterprise adoption, check out Replicated.com. Replicated provides tools to deliver your Kubernetes-based application to enterprise customers as a modern on-prem, private instance.

Cruise is a San Francisco-based company building a fully electric self-driving car service. Cruise is a place where you can build on your existing skills while developing new skills and experiences that are pioneering the future of industry. There are opportunities for backend engineers, frontend developers, machine learning programmers, and many more positions. At Cruise you will be surrounded by talented, driven engineers-all while helping make cities safer and cleaner. Apply to work at Cruise, by going to getcruise.com/careers.

MongoDB is the most popular document-based database built for modern application developers and the cloud era. Try MongoDB today with Atlas, the global cloud database service that runs on AWS, Azure, and Google Cloud. Configure, deploy, and connect to your database in just a few minutes. Check it out at mongodb.com/atlas.

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.