Podcast: Play in new window | Download
On January 31st 2017, GitLab experienced a major outage of their online repository hosting service. The primary database server experienced data loss due to a combination of malicious spam attacks and engineering mistakes that occurred while trying to respond to those spam attacks.
GitLab responded to the event transparently. The company put up a postmortem describing the event in detail. In subsequent posts, GitLab expressed sympathy for the employee who made engineering mistakes that led to the deletion of data. The employee was not judged or disciplined for an understandable error.
The response from the developer community was very positive. Engineers know that building cloud services is hard. Engineering is as much about avoiding errors as it is about appropriately responding to the inevitable mistakes.
GitLab is a developer platform that combines repository hosting with several other features–issue tracking, code review, and CD. Today’s guest is Pablo Carranza, who works on infrastructure at GitLab. In this episode, he walks us through GitLab’s product, the engineering stack, and a postmortem of the outage. We also discuss working at Amazon, and the importance of postmortems, which I first encountered at Amazon.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.