Podcast: Play in new window | Download
Containers make it easier for engineers to deploy software. Orchestration systems like Kubernetes make it easier to manage and scale the different containers that contain services. The popular container infrastructure powered by Kubernetes is often called “cloud native.”
On Software Engineering Daily, we have been exploring cloud native software to get a complete picture of the problems in the space, and the projects that are being worked on as solutions.
One area of interest: how should services communicate with each other? What should be standardized? How can you make it easy to identify problems and avoid cascading failures? One solution is the service mesh, a tool that allows services to communicate with each other more safely and effectively.
William Morgan was an engineer who helped scale Twitter in the early days when the company was dealing with lots of outages. He was on the show previously to discuss scaling Twitter, and in today’s episode we go into the company that he is running, Buoyant, where he works on building a service mesh called Linkerd.
Software Engineering Daily is looking for sponsors for Q3. If your company has a product or service, or if you are hiring, Software Engineering Daily reaches 23,000 developers listening daily. Send me an email: firstname.lastname@example.org
What’s a Service Mesh and Why Do I Need One?
Buoyant is hiring: email email@example.com
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.
This was a great episode, thank you both!
I have to weigh in on the Google “cosmic ray” controversy. William’s take on that story cracked me up. But while I’ve heard some lame excuses in my life, I’ve never met a developer brazen enough to blame a bug on cosmic rays. Based on the Google story in a previous episode I agree the “cosmic ray” theory is bogus, but it’s the chip maker that’s blowing smoke.
First, some software entomology:
SOFT BUG: A one-time failure. Soft bugs are wonderful because you can just blame the user and move on.
HARD BUG: A repeatable failure. Hard bugs are OK because you can always track them down.
FLAKY BUG: An intermittent failure. Flaky bugs suck.
Google had a service that would run most of the time but occasionally crashed. They tracked it down to a single-bit error in one particular CPU core. Every time the service was assigned to that core, it would crash. That is a flaky bug.
Flaky bugs are repeatable so they are hard, not soft. Cosmic rays only cause soft bugs. The worst thing a cosmic ray can do is flip a bit in a register. The whole system could crash, but on reboot there will be no trace of the problem. https://en.wikipedia.org/wiki/Soft_error
Google had a defective chip. The manufacturer helped them diagnose it, but it sounds like they also concocted the “cosmic ray” story. Given that the majority of chips never work at all it’s hard to blame a manufacturer for shipping a bad one every now and then. But the cosmic ray story shifts the blame, it sounds cool, and if the customer buys it, why not?