Site Reliability Management with Mike Hiraga

Software engineers have interacted with operations teams since software was being written. In the 1990s, most operations teams worked with physical infrastructure. They made sure that servers were provisioned correctly and installed with the proper software. When software engineers shipped bad code that took down a software company, the operations teams had to help recover the system—which often meant dealing with the physical servers.

During the 90s and early 2000s, these operations engineers were often called “sysadmins,” “database admins” (if they worked on databases), or “infrastructure engineers.” Over the last decade, virtualization has led to many more logical servers across a company. Cloud computing has made infrastructure remote and programmable.

The progression of infrastructure led to a change in how operations engineers work. Since infrastructure can be interacted with through code, operations engineers are now writing a lot more code.

The “DevOps” movement can be seen through this lens. Operations teams were now writing software—and this meant that software engineers could now work on operations. Both software engineers and operators could create deployment pipelines, monitor application health, and improve the system scalability—all through written code.

Site reliability engineering (or SRE) is a newer point along the evolutionary timeline of operations. Web applications can be unstable sometimes, and SRE is focused on making a site work more reliably. This is especially important for a company that makes business applications which other companies rely on.

Mike Hiraga is the head of site reliability engineering at Atlassian. Atlassian makes several products that many businesses rely on—such as JIRA, Confluence, HipChat, and Bitbucket. Since the infrastructure is at a massive scale, Mike has a broad set of experiences from his work managing SRE at Atlassian.

One particularly interesting topic is Atlassian’s migration to the cloud. Atlassian was started in 2002, before the cloud was widely used, and they have more recently made a push to move applications into the cloud. Full disclosure: Atlassian is a sponsor of Software Engineering Daily—and they are hiring, so if you are looking for a job, check out Atlassian jobs, or send me an email directly and I’m happy to introduce you to the team.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.

Sponsors


Sumo Logic is a cloud-native, machine data analytics service that helps you Run and Secure your Modern Application. If you are feeling the pain of managing your own log, event, and performance metrics data, check out sumologic.com/sedaily. Even if you have tools already, it’s worth checking out Sumo Logic and seeing if you can leverage your data even more effectively, with real-time dashboards and monitoring, and improved observability – to improve the uptime of your application and keep your day-to-day runtime more secure. Check out sumologic.com/sedaily for a free 30-day Trial of Sumo Logic, to find out how Sumo Logic can improve your productivity and your application observability–wherever you run your applications. That’s sumologic.com/sedaily.


Speaking of reliability, do you find yourself worrying about system downtime or missing an alert while you’re on-call? If so, VictorOps is THE incident management tool you need. VictorOps integrates with a large number of the monitoring, alerting, and messaging tools you already have in place to help your DevOps teams communicate better, diagnose incidents, and resolve any problems that come up. All in one place, on both your smartphone and your computer, you can view highly contextual, detailed alerts that will help your on-call engineers to understand and respond to incidents more quickly and effectively. Head to victorops.com/sedaily to see how VictorOps can help you. Be victorious with VictorOps!


Failure is unpredictable. You don’t know when your system will break, but you know it will happen. Gremlin prepares for these outages. We provide resilience as a service, using chaos engineering techniques pioneered at Netflix and Amazon. Prepare your team for disaster by proactively testing failure scenarios. Max out CPU, blackhole or slow down network traffic to a dependency, terminate processes and hosts. Each of these show you how your system reacts, allowing you to harden things before a production incident. Check out Gremlin and get a free demo by going to gremlin.com/sedaily.


GoCD is a continuous delivery tool created by ThoughtWorks. GoCD agents use Kubernetes to scale as needed. Check out gocd.org/sedaily and learn about how you can get started. GoCD was built with the learnings of the ThoughtWorks engineering team, who have talked about building the product in previous episodes of Software Engineering Daily. It’s great to see the continued progress on GoCD with the new Kubernetes integrations–and you can check it out for yourself at gocd.org/sedaily.