Linkedin Resilience with Bhaskaran Devaraj and Xiao Li

Podcast Monday, February 5 2018

Subscribe: RSS

How do you build resilient, failure tested systems? Redundancy, backups, and testing are all important. But there is also an increasing trend towards chaos engineering–the technique of inducing controlled failures in order to prove that a system is fault tolerant in the way that you expect.

In last week’s episode with Kolton Andrus, we discussed one way to build chaos engineering as a routine part of testing a distributed system. Kolton discussed his company Gremlin, which injects failures by spinning up a Gremlin container and having that container induce network failures, memory errors, and filled up disks. In this episode, we explore another insertion point for testing controlled failures, this time from the point of view of Linkedin.

Linkedin is a social network for working professionals. As Linkedin has grown, the increased number of services has led to more interdependency between those services. The more dependencies a given service has, the more partial failure cases there are. That’s not to say there is anything wrong with having a lot of service dependencies–this is just the way we build modern applications. But it does suggest that we should try to test the failures that can emerge from so many dependencies.

Bhaskaran Devaraj and Xiao Li are engineers at Linkedin and are working on a project called Waterbear, with the goal of making the infrastructure more resilient.

Linkedin’s backend system consists of a large distributed application with thousands of microservices communicating with each other. Most of those services communicate over Rest.li, a proxy for standardizing interactions between services. Rest.li can assist with routing, AB testing, circuit breaking, and other aspects of service-to-service communication. This proxy can also be used for executing controlled failures. As services are communicating with each other, creating a controlled failure can be as simple as telling your proxy not to send traffic to downstream services.

If that sounds confusing, don’t worry, we will explain it in more detail.

In this episode, Bhaskaran and Xiao describe their approach to resilience engineering at Linkedin–including the engineering projects and the cultural changes that are required to build a resilient software architecture.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.