High Volume Distributed Tracing with Ben Sigelman

You are requesting a car from a ridesharing service such as Lyft.

Your request hits the Lyft servers and begins trying to get you a car. It takes your geolocation, and passes the geolocation to a service that finds cars that are nearby, and puts all those cars into a list. The list of nearby cars is sent to another service, which sorts the list of cars by how close they are to you, and how high their star rating is. Finally, your car is selected, and sent back to your phone in a response from the server.

In a “microservices” environment, multiple services often work together to accomplish a user task. In the example I just gave, one service took geolocation and turned it into a list, another service took a list and sorted it, and another service sent the actual response back to the user.

This is a common pattern: Service A calls service B, which calls service C, and so on.  When one of those services fails along the way, how do you identify which one it was? When one of those services fails to deliver a response quickly, how do you know where your latency is coming from?

The solution is distributed tracing. To implement distributed tracing, each user level request gets a request identifier associated with it. When service A calls service B, it also hands off the unique request ID, so that the overall request can be traced as it passes through the distributed system (and if that doesn’t make sense–don’t worry, we explain it again during the show).

Ben Sigelman began working on distributed tracing when he was at Google and authored the “Dapper” paper. Dapper was implemented at Google to help debug some of the distributed systems problems faced by the engineers who work on Google infrastructure.

A request that moves through several different services spends time processing each of those services. A distributed tracing system measures the time spent in each of those services–that time spent is called a span. A single request that has to hit 20 different services will have 20 spans associated with it. Those spans get collected into a trace. A trace can be evaluated to look at the latencies of each of those services.

If you are trying to improve the speed of a distributed systems infrastructure, distributed tracing can be very helpful for choosing where to focus your attention.

The published Google papers of ten years ago often turn out to be the companies of today. Some examples include MapReduce (which formed the basis of Cloudera), Spanner (which formed the basis of CockroachDB), and Dremel (which formed the basis of Dremio).

Today, a decade after he started thinking about distributed tracing, Ben Sigelman is the CEO of Lightstep, a company that provides distributed tracing and other monitoring technologies.

Lightstep’s distributed tracing model still bears a resemblance to the same techniques described in the paper–so I was eager to learn the differences between open source versions of distributed tracing (such as OpenZipkin) and enterprise providers such as Lightstep.

The key feature of Lightstep that we discussed: garbage collection.

If you are using a distributed tracing system, you could be collecting a lot of traces. You could collect a trace for every single user request. Not all of these traces are useful–but some of them are very useful. Maybe you only want to keep track of traces that take an exceptionally long latency. Maybe you want to keep every trace in the last 5 days and destroy them over time. So, the question of how to manage the storage footprint of those traces was as interesting as the discussion of distributed tracing itself.

Beyond the distributed tracing features of his product, Ben has a vision for how his company can provide other observability tools over time. I spoke to Ben at Kubecon–and although this conversation does not talk about Kubernetes specifically, this topic is undoubtedly interesting to people who are building Kubernetes technologies.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.