Why Distributed Tracing is Essential for Performance and Reliability
Daniel “Spoons” Spoonhower, CTO and Co-founder, Lightstep
Author of Distributed Tracing in Practice (O’Reilly Media, 2020)
New software architectures like microservices — alongside agile practices like DevOps — were supposed to accelerate application development. But in many software engineering organizations, they’ve failed to deliver on that promise, leaving teams to sink more and more time into trying to understand and address degrading performance and reliability.
Microservices and DevOps can enable teams to make decisions more independently, but of course their services still depend on one another, often in dynamic ways that no one expects. So while teams can make their own choices about which features to build and when to deploy, they still need a shared perspective on what’s happening in the application as a whole.
Distributed tracing can provide that perspective. By showing how each request passes through the application, tracing provides the context necessary for developers to understand the connections between cause and effect. And by leveraging that understanding, developers can quickly validate releases, choose the right optimizations, and find the root causes of production incidents.
But what is distributed tracing? For many, it sounds like yet another tool to add to their production tool belt. While in some cases, it is just another tool, fundamentally tracing is about adding a new data type to the way you observe your application. In some ways, traces are like metrics and logs. But traces are different because they encode casual relationships between disparate parts of your application: they enable analysis across services. And distributed tracing is more than just collecting traces: the real value of distributed tracing lies in analyzing those traces and deriving a holistic understanding of what’s happening across the application.
Accelerating developer velocity
While developer velocity is often cited as the reason for adopting microservices, organizations can only reap that benefit if they roll out distributed tracing as part of that architectural change. That’s because nearly any action that one team takes will affect other teams across the organization. Without tracing, teams will spend more and more time tracking down those effects.
For example, when a team deploys a new version of a service, verifying the success of that deployment means understanding how it has affected other services. Teams can take a wait-and-see approach, but it’s much more efficient to use tracing to immediately isolate those effects: that’s less time staring at dashboards and fewer interruptions for affected teams.
Understanding cross-team impact is also important during incident response, and tracing plays a key role in root cause analysis as well. One example of this occurred at Lightstep recently when developers noticed a small uptick in errors from our API service. They used tracing to quickly analyze logs from that service as well as all of its dependencies and to compare requests that returned errors with those that did not. That analysis quickly surfaced logs that showed that an upstream service was having issues decoding certain requests and that those issues correlated with user-visible errors.
The alternative (without tracing) would have been grepping through logs and looking for anomalies, not only for the API service but for all of its (many) dependencies. This style of manual analysis is slow because it requires humans to enumerate every potential cause — every hypothesis — and to check each one in turn. Tracing can immediately identify which service and which logs are relevant.
Improving software performance
Improving software performance is also a frequent use case for distributed tracing and was, in fact, one of the initial use cases for tracing when it was developed and deployed at Google. Google used an early version of microservices as part of implementing its search functionality: each service was responsible for finding different kinds of results, including web results, news items, images, and products. Aggregate analysis of traces show us which service or services were having the biggest impact on user-perceived latency, and this analysis was then used by individual teams to drive optimization work.
Tracing can also be used to solve problems on smaller scales. For example, even within a team, tracing can be used to understand how requests are being fanned out across service instances. If this fanout is not well-balanced — if some instances are taking longer than others — then this design will have little impact on performance as users will still be waiting for the slowest instance.
In all cases, without tracing, it’s difficult or impossible to know whether or not an optimization will actually affect what end users are experiencing, which means that, without tracing, all that work might be for nothing.
Managing costs
Finally, distributed tracing can help manage costs within your organization. Of course, many of the use cases above can directly or indirectly impact costs. For example, improving developer velocity can obviate the need for an additional hire. Improving root cause analysis can reduce downtime and therefore save on lost revenue. And improving performance can often both improve user experience and reduce infrastructure costs. However, tracing can also help manage costs associated with observability itself.
Take log aggregation. From a developer’s perspective, more logs are always better. But most application logs are never looked at, and especially as the number of services proliferate, the costs of storing and managing all of these logs can skyrocket.
Fortunately, distributed tracing can help. By associating logs with traces, we can use the sampling mechanisms already present in most tracing tools to also sample logs. But because tracing is capturing entire end-to-end requests — and biasing toward slow or problematic ones — developers can be confident that they will still have the logs that they need. This means that logs can still be used to ensure application reliability (including the sorts of root cause analysis described above) but at the fraction of the cost of a typical log aggregation solution.
Getting started with tracing
Even if you are convinced of the benefits of distributed tracing, getting started can still seem overwhelming at times. While I often hear talk of a tracing “migration,” I think it’s better to think about the process as a journey. Adopting distributed tracing is best thought of as an incremental process where at each step you will learn more about your software and your organization. Each of these lessons can — and should — be used to inform subsequent steps.
Like any new tool or process introduced into a DevOps organization, tracing must integrate with the current workflows and, above all, provide value to individual application development teams. If it doesn’t, adoption will prove to be a slow and arduous process. But if it does provide value, application developers will be your biggest distributed tracing champions!
Ultimately, application developers are most interested in delivering new and better experiences to their users. Understanding how distributed tracing can do that — through use cases like validating new releases, accelerating root cause analysis, guiding optimizations, and controlling costs — is key to any successful tracing implementation.