Service Mesh with William Morgan

Containers make it easier for engineers to deploy software. Orchestration systems like Kubernetes make it easier to manage and scale the different containers that contain services. The popular container infrastructure powered by Kubernetes is often called “cloud native.”

On Software Engineering Daily, we have been exploring cloud native software to get a complete picture of the problems in the space, and the projects that are being worked on as solutions.

One area of interest: how should services communicate with each other? What should be standardized? How can you make it easy to identify problems and avoid cascading failures? One solution is the service mesh, a tool that allows services to communicate with each other more safely and effectively.

William Morgan was an engineer who helped scale Twitter in the early days when the company was dealing with lots of outages. He was on the show previously to discuss scaling Twitter, and in today’s episode we go into the company that he is running, Buoyant, where he works on building a service mesh called Linkerd.

Software Engineering Daily is looking for sponsors for Q3. If your company has a product or service, or if you are hiring, Software Engineering Daily reaches 23,000 developers listening daily. Send me an email:

Show Notes

Scaling Twitter

What’s a Service Mesh and Why Do I Need One?

Buoyant is hiring: email



Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Deep learning promises to dramatically improve how our world works. To make deep learning easier and faster, we need new kinds of hardware and software–which is why Intel acquired Nervana Systems, a platform for deep learning. Intel Nervana is hiring engineers to help develop a full stack for AI, from chip design to software frameworks. Go to to apply for a job at Intel Nervana. If you know don’t know much about the company, check out the interviews I have conducted with engineers from the company. You can find these at

Oracle Dyn provides DNS that is as dynamic and intelligent as your applications. Dyn DNS gets your users to the right cloud service, CDN, or data center, using intelligent response to steer traffic based on business policies, as well as real-time internet conditions, like the security and performance of the network path. Get started with a free 30-day trial for your application by going to  After the free trial, Dyn’s developer plans start at just $7 a month for world-class DNS. Rethink DNS. Go to to learn more and get your free trial of Dyn DNS.

Datadog brings you visibility into every part of your infrastructure, plus APM for monitoring your application’s performance. Dashboarding, collaboration tools, and alerts let you develop your own workflow for observability and incident response. Datadog integrates seamlessly with all of your apps and systems, from Slack to Amazon Web Services, so you can get visibility in minutes. Go to to get started with Datadog and get a free t-shirt.

  • Matt B

    This was a great episode, thank you both!

    I have to weigh in on the Google “cosmic ray” controversy. William’s take on that story cracked me up. But while I’ve heard some lame excuses in my life, I’ve never met a developer brazen enough to blame a bug on cosmic rays. Based on the Google story in a previous episode I agree the “cosmic ray” theory is bogus, but it’s the chip maker that’s blowing smoke.

    First, some software entomology:
    SOFT BUG: A one-time failure. Soft bugs are wonderful because you can just blame the user and move on.
    HARD BUG: A repeatable failure. Hard bugs are OK because you can always track them down.
    FLAKY BUG: An intermittent failure. Flaky bugs suck.

    Google had a service that would run most of the time but occasionally crashed. They tracked it down to a single-bit error in one particular CPU core. Every time the service was assigned to that core, it would crash. That is a flaky bug.

    Flaky bugs are repeatable so they are hard, not soft. Cosmic rays only cause soft bugs. The worst thing a cosmic ray can do is flip a bit in a register. The whole system could crash, but on reboot there will be no trace of the problem.

    Google had a defective chip. The manufacturer helped them diagnose it, but it sounds like they also concocted the “cosmic ray” story. Given that the majority of chips never work at all it’s hard to blame a manufacturer for shipping a bad one every now and then. But the cosmic ray story shifts the blame, it sounds cool, and if the customer buys it, why not?

  • Mike Wilcox

    Great job on the Mesh topic. I am not sure about the Amazon vs Google discussion, but my interest was tweeked for further investigation. Particularly, since parts of IBM are going the way of Kube and Istio/amalgm8. IMHO, successful Instrumentation of an application as it flows around the Mesh will drive adoption in the enterprise scale world. IMHO, we need to focus on ‘over instrumentation’ of the Mesh. Otherwise, as mentioned in the podcast capacity planning and debugging will be a challenge and the mesh will not be deemed ‘reliable’ for serious workloads. Another aspect is repeatability. Can we repeat the problem/scenario or are we lacking enough instrumentation and ‘stack knowledge’ to do so? Note, to be successful in many industries today, particularly Healthcare, you must be able to show and repeat the path to the outcome. Think about a health record being changed due to a change in the LPR (longitudinal patient record) which drives a new diagnosis which drives a new treatment. If your app runs around the mesh and gives a different answer each time the application is run, this is a problem. And worse of all, Healthcare will require the application to be run at a point in time in the past, eg: the clinician or auditor may say, run the treatment plan as it was run 5 days ago. Maybe there are tools being built on top of the mesh to capture all the information about the ‘stack’ and data at a specific time to meet this requirement?