High Volume Distributed Tracing with Ben Sigelman

You are requesting a car from a ridesharing service such as Lyft.

Your request hits the Lyft servers and begins trying to get you a car. It takes your geolocation, and passes the geolocation to a service that finds cars that are nearby, and puts all those cars into a list. The list of nearby cars is sent to another service, which sorts the list of cars by how close they are to you, and how high their star rating is. Finally, your car is selected, and sent back to your phone in a response from the server.

In a “microservices” environment, multiple services often work together to accomplish a user task. In the example I just gave, one service took a geolocation and turned it into a list, another service took a list and sorted it, and another service sent the actual response back to the user.

This is a common pattern: Service A calls service B, which calls service C, and so on.  When one of those services fails along the way, how do you identify which one it was? When one of those services fails to deliver a response quickly, how do you know where your latency is coming from?

The solution is distributed tracing. To implement distributed tracing, each user level request gets a request identifier associated with it. When service A calls service B, it also hands off the unique request ID, so that the overall request can be traced as it passes through the distributed system (and if that doesn’t make sense–don’t worry, we explain it again during the show).

Ben Sigelman began working on distributed tracing when he was at Google and authored the “Dapper” paper. Dapper was implemented at Google to help debug some of the distributed systems problems faced by the engineers who work on Google infrastructure.

A request that moves through several different services spends time processing on each of those services. A distributed tracing system measures the time spent in each of those services–that time spent is called a span. A single request that has to hit 20 different services will have 20 spans associated with it. Those spans get collected into a trace. A trace can be evaluated to look at the latencies of each of those services.

If you are trying to improve the speed of a distributed systems infrastructure, distributed tracing can be very helpful for choosing where to focus your attention.

The published Google papers of ten years ago often turn out to be the companies of today. Some examples include MapReduce (which formed the basis of Cloudera), Spanner (which formed the basis of CockroachDB), and Dremel (which formed the basis of Dremio).

Today, a decade after he started thinking about distributed tracing, Ben Sigelman is the CEO of Lightstep, a company that provides distributed tracing and other monitoring technologies.

Lightstep’s distributed tracing model still bears a resemblance to the same techniques described in the paper–so I was eager to learn the differences between open source versions of distributed tracing (such as OpenZipkin) and enterprise providers such as Lightstep.

The key feature of Lightstep that we discussed: garbage collection.

If you are using a distributed tracing system, you could be collecting a lot of traces. You could collect a trace for every single user request. Not all of these traces are useful–but some of them are very useful. Maybe you only want to keep track of traces that take an exceptionally long latency. Maybe you want to keep every trace in the last 5 days, and destroy them over time. So, the question of how to manage the storage footprint of those traces was as interesting as the discussion of distributed tracing itself.

Beyond the distributed tracing features of his product, Ben has a vision for how his company can provide other observability tools over time. I spoke to Ben at Kubecon–and although this conversation does not talk about Kubernetes specifically, this topic is undoubtedly interesting to people who are building Kubernetes technologies.


Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Today’s episode of Software Engineering Daily is sponsored by Datadog. With infrastructure monitoring, distributed tracing, and now logging, Datadog provides end-to-end visibility into the health and performance of modern applications. Datadog’s distributed tracing and APM generates detailed flame graphs from real requests, enabling you to visualize how requests propagate through your distributed infrastructure. See which services or calls are generating errors or contributing to overall latency, so you can troubleshoot faster or identify opportunities for performance optimization. Start monitoring your applications with a free trial and Datadog will send you a free T-shirt! softwareengineeringdaily.com/datadog.

Azure Container Service simplifies the deployment, management and operations of Kubernetes. Eliminate the complicated planning and deployment of fully orchestrated containerized applications with Kubernetes. You can quickly provision clusters to be up and running in no time, while simplifying your monitoring and cluster management through auto upgrades and a built-in operations console. Avoid being locked into any one vendor or resource. You can continue to work with the tools you already know, such as Helm, and move applications to any Kubernetes deployment. Integrate with your choice of container registry, including Azure Container Registry. Also, quickly and efficiently scale to maximize your resource utilization without having to take your applications offline. Isolate your application from infrastructure failures and transparently scale the underlying infrastructure to meet growing demands—all while increasing the security, reliability, and availability of critical business workloads with Azure. Check out the Azure Container Service at aka.ms/sedaily.

Simplify continuous delivery with GoCD, the on-premise, open source, continuous delivery tool by ThoughtWorks. With GoCD, you can easily model complex deployment workflows using pipelines and visualize them end-to-end with the Value Stream Map. You get complete visibility into and control of your company’s deployments. At gocd.org/sedaily, find out how to bring continuous delivery to your teams. Say goodbye to deployment panic and hello to consistent, predictable deliveries. Visit gocd.org/sedaily to learn more about GoCD. Commercial support and enterprise add-ons, including disaster recovery, are available.

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.