Service Mesh with William Morgan

Containers make it easier for engineers to deploy software. Orchestration systems like Kubernetes make it easier to manage and scale the different containers that contain services. The popular container infrastructure powered by Kubernetes is often called “cloud native.”

On Software Engineering Daily, we have been exploring cloud native software to get a complete picture of the problems in the space, and the projects that are being worked on as solutions.

One area of interest: how should services communicate with each other? What should be standardized? How can you make it easy to identify problems and avoid cascading failures? One solution is the service mesh, a tool that allows services to communicate with each other more safely and effectively.

William Morgan was an engineer who helped scale Twitter in the early days when the company was dealing with lots of outages. He was on the show previously to discuss scaling Twitter, and in today’s episode we go into the company that he is running, Buoyant, where he works on building a service mesh called Linkerd.

Software Engineering Daily is looking for sponsors for Q3. If your company has a product or service, or if you are hiring, Software Engineering Daily reaches 23,000 developers listening daily. Send me an email: jeff@softwareengineeringdaily.com

Show Notes

Scaling Twitter

What’s a Service Mesh and Why Do I Need One?

Buoyant is hiring: email william@buoyant.io

 

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.

Sponsors


Deep learning promises to dramatically improve how our world works. To make deep learning easier and faster, we need new kinds of hardware and software–which is why Intel acquired Nervana Systems, a platform for deep learning. Intel Nervana is hiring engineers to help develop a full stack for AI, from chip design to software frameworks. Go to softwareengineeringdaily.com/intel to apply for a job at Intel Nervana. If you know don’t know much about the company, check out the interviews I have conducted with engineers from the company. You can find these at softwareengineeringdaily.com/intel.


Oracle Dyn provides DNS that is as dynamic and intelligent as your applications. Dyn DNS gets your users to the right cloud service, CDN, or data center, using intelligent response to steer traffic based on business policies, as well as real-time internet conditions, like the security and performance of the network path. Get started with a free 30-day trial for your application by going to dyn.com/sedaily.  After the free trial, Dyn’s developer plans start at just $7 a month for world-class DNS. Rethink DNS. Go to dyn.com/sedaily to learn more and get your free trial of Dyn DNS.


Datadog brings you visibility into every part of your infrastructure, plus APM for monitoring your application’s performance. Dashboarding, collaboration tools, and alerts let you develop your own workflow for observability and incident response. Datadog integrates seamlessly with all of your apps and systems, from Slack to Amazon Web Services, so you can get visibility in minutes. Go to softwareengineeringdaily.com/datadog to get started with Datadog and get a free t-shirt.

  • Matt B

    This was a great episode, thank you both!

    I have to weigh in on the Google “cosmic ray” controversy. William’s take on that story cracked me up. But while I’ve heard some lame excuses in my life, I’ve never met a developer brazen enough to blame a bug on cosmic rays. Based on the Google story in a previous episode I agree the “cosmic ray” theory is bogus, but it’s the chip maker that’s blowing smoke.

    First, some software entomology:
    SOFT BUG: A one-time failure. Soft bugs are wonderful because you can just blame the user and move on.
    HARD BUG: A repeatable failure. Hard bugs are OK because you can always track them down.
    FLAKY BUG: An intermittent failure. Flaky bugs suck.

    Google had a service that would run most of the time but occasionally crashed. They tracked it down to a single-bit error in one particular CPU core. Every time the service was assigned to that core, it would crash. That is a flaky bug.

    Flaky bugs are repeatable so they are hard, not soft. Cosmic rays only cause soft bugs. The worst thing a cosmic ray can do is flip a bit in a register. The whole system could crash, but on reboot there will be no trace of the problem. https://en.wikipedia.org/wiki/Soft_error

    Google had a defective chip. The manufacturer helped them diagnose it, but it sounds like they also concocted the “cosmic ray” story. Given that the majority of chips never work at all it’s hard to blame a manufacturer for shipping a bad one every now and then. But the cosmic ray story shifts the blame, it sounds cool, and if the customer buys it, why not?