Linkedin Resilience with Bhaskaran Devaraj and Xiao Li

How do you build resilient, failure tested systems? Redundancy, backups, and testing are all important. But there is also an increasing trend towards chaos engineering–the technique of inducing controlled failures in order to prove that a system is fault tolerant in the way that you expect.

In last week’s episode with Kolton Andrus, we discussed one way to build chaos engineering as a routine part of testing a distributed system. Kolton discussed his company Gremlin, which injects failures by spinning up a Gremlin container and having that container induce network failures, memory errors, and filled up disks. In this episode, we explore another insertion point for testing controlled failures, this time from the point of view of Linkedin.

Linkedin is a social network for working professionals. As Linkedin has grown, the increased number of services has led to more interdependency between those services. The more dependencies a given service has, the more partial failure cases there are. That’s not to say there is anything wrong with having a lot of service dependencies–this is just the way we build modern applications. But it does suggest that we should try to test the failures that can emerge from so many dependencies.

Bhaskaran Devaraj and Xiao Li are engineers at Linkedin, and are working on a project called Waterbear, with the goal of making the infrastructure more resilient.

Linkedin’s backend system consists of a large distributed application with thousands of microservices communicating between each other. Most of those services communicate over Rest.li, a proxy for standardizing interactions between services. Rest.li can assist with routing, AB testing, circuit breaking, and other aspects of service-to-service communication. This proxy can also be used for executing controlled failures. As services are communicating with each other, creating a controlled failure can be as simple as telling your proxy not to send traffic to downstream services.

If that sounds confusing, don’t worry, we will explain it in more detail.

In this episode, Bhaskaran and Xiao describe their approach to resilience engineering at Linkedin–including the engineering projects and the cultural changes that are required to build a resilient software architecture.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.

Sponsors


Your company needs to build a new app, but you don’t have the spare engineering resources. There are some technical people in your company who have time to build apps–but they are not engineers. OutSystems is a platform for building low-code apps. As an enterprise grows, it needs more and more apps to support different types of customers and internal employee use cases. OutSystems has everything that you need to build, release, and update your apps without needing an expert engineer. And if you are an engineer, you will be massively productive with OutSystems. Find out how to get started with low-code apps today–at OutSystems.com/sedaily. There are videos showing how to use the OutSystems development platform, and testimonials from enterprises like FICO, Mercedes Benz, and SafeWay. OutSystems enables you to quickly build web and mobile applications–whether you are an engineer or not. Check out how to build low-code apps by going to OutSystems.com/sedaily.


Failure is unpredictable. You don’t know when your system will break, but you know it will happen. Gremlin prepares for these outages. We provide resilience as a service, using chaos engineering techniques pioneered at Netflix and Amazon. Prepare your team for disaster by proactively testing failure scenarios. Max out CPU, blackhole or slow down network traffic to a dependency, terminate processes and hosts. Each of these show you how your system reacts, allowing you to harden things before a production incident. Check out Gremlin and get a free demo by going to gremlin.com/sedaily.


Azure Container Service simplifies the deployment, management and operations of Kubernetes. Eliminate the complicated planning and deployment of fully orchestrated containerized applications with Kubernetes. You can quickly provision clusters to be up and running in no time, while simplifying your monitoring and cluster management through auto upgrades and a built-in operations console. Avoid being locked into any one vendor or resource. You can continue to work with the tools you already know, such as Helm, and move applications to any Kubernetes deployment. Integrate with your choice of container registry, including Azure Container Registry. Also, quickly and efficiently scale to maximize your resource utilization without having to take your applications offline. Isolate your application from infrastructure failures and transparently scale the underlying infrastructure to meet growing demands—all while increasing the security, reliability, and availability of critical business workloads with Azure. Check out the Azure Container Service at aka.ms/acs.