Google Cluster Evolution with Brian Grant

Podcast Friday, April 27 2018

Subscribe: RSS

Google’s central system for managing to compute resources is called Borg. On Borg, millions of Linux containers process a wide variety of workloads. When a new application is spun up, Borg provides that application with the resources it needs.

Workloads at Google usually fall into one of two distinct categories: long-running application workloads (such as Gmail) and batch workloads (such as a MapReduce job). In the early days of Google, the long-lived workloads were scheduled onto a system called “BabySitter” and the batch workloads were scheduled onto a system called “Global Work Queue.”

Borg was the first cluster manager at Google designed to service both long-running and batch workloads from a single system. The second cluster manager at Google was Omega, a project that was created to improve the engineering behind Borg. The innovations of Omega improved the efficiency and architecture of Borg.

More recently, Kubernetes was created as an open source implementation of the ideas pioneered in Borg and Omega. Google has also built a Kubernetes as a service offering that companies use to run their infrastructure in the same way that Google does.

Brian Grant is an engineer at Google who has seen the iteration of all three cluster management systems that have come out of Google. He joins the show to discuss how the workloads at Google have changed over time, and how his perspective on how to build and architect distributed systems has evolved. Full disclosure: Google is a sponsor of Software Engineering Daily.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.