Podcast: Play in new window | Download
Containers make it easier for engineers to deploy software. Orchestration systems like Kubernetes make it easier to manage and scale the different containers that contain services. The popular container infrastructure powered by Kubernetes is often called “cloud native.”
On Software Engineering Daily, we have been exploring cloud native software to get a complete picture of the problems in the space, and the projects that are being worked on as solutions.
One area of interest: how should services communicate with each other? What should be standardized? How can you make it easy to identify problems and avoid cascading failures? One solution is the service mesh, a tool that allows services to communicate with each other more safely and effectively.
William Morgan was an engineer who helped scale Twitter in the early days when the company was dealing with lots of outages. He was on the show previously to discuss scaling Twitter, and in today’s episode we go into the company that he is running, Buoyant, where he works on building a service mesh called Linkerd.
Software Engineering Daily is looking for sponsors for Q3. If your company has a product or service, or if you are hiring, Software Engineering Daily reaches 23,000 developers listening daily. Send me an email: firstname.lastname@example.org
What’s a Service Mesh and Why Do I Need One?
Buoyant is hiring: email email@example.com
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.
This was a great episode, thank you both!
I have to weigh in on the Google “cosmic ray” controversy. William’s take on that story cracked me up. But while I’ve heard some lame excuses in my life, I’ve never met a developer brazen enough to blame a bug on cosmic rays. Based on the Google story in a previous episode I agree the “cosmic ray” theory is bogus, but it’s the chip maker that’s blowing smoke.
First, some software entomology:
SOFT BUG: A one-time failure. Soft bugs are wonderful because you can just blame the user and move on.
HARD BUG: A repeatable failure. Hard bugs are OK because you can always track them down.
FLAKY BUG: An intermittent failure. Flaky bugs suck.
Google had a service that would run most of the time but occasionally crashed. They tracked it down to a single-bit error in one particular CPU core. Every time the service was assigned to that core, it would crash. That is a flaky bug.
Flaky bugs are repeatable so they are hard, not soft. Cosmic rays only cause soft bugs. The worst thing a cosmic ray can do is flip a bit in a register. The whole system could crash, but on reboot there will be no trace of the problem. https://en.wikipedia.org/wiki/Soft_error
Google had a defective chip. The manufacturer helped them diagnose it, but it sounds like they also concocted the “cosmic ray” story. Given that the majority of chips never work at all it’s hard to blame a manufacturer for shipping a bad one every now and then. But the cosmic ray story shifts the blame, it sounds cool, and if the customer buys it, why not?
Great job on the Mesh topic. I am not sure about the Amazon vs Google discussion, but my interest was tweeked for further investigation. Particularly, since parts of IBM are going the way of Kube and Istio/amalgm8. IMHO, successful Instrumentation of an application as it flows around the Mesh will drive adoption in the enterprise scale world. IMHO, we need to focus on ‘over instrumentation’ of the Mesh. Otherwise, as mentioned in the podcast capacity planning and debugging will be a challenge and the mesh will not be deemed ‘reliable’ for serious workloads. Another aspect is repeatability. Can we repeat the problem/scenario or are we lacking enough instrumentation and ‘stack knowledge’ to do so? Note, to be successful in many industries today, particularly Healthcare, you must be able to show and repeat the path to the outcome. Think about a health record being changed due to a change in the LPR (longitudinal patient record) which drives a new diagnosis which drives a new treatment. If your app runs around the mesh and gives a different answer each time the application is run, this is a problem. And worse of all, Healthcare will require the application to be run at a point in time in the past, eg: the clinician or auditor may say, run the treatment plan as it was run 5 days ago. Maybe there are tools being built on top of the mesh to capture all the information about the ‘stack’ and data at a specific time to meet this requirement?