How To Remove the Observability Silos Between Frontend and Backend Engineers

Article Thursday, April 29 2021

For an application engineer encountering a critical user-experience bug happening on a web browser or a mobile app, a Real User Monitoring (RUM) tool is often a key plank of the resolution process. By collecting telemetry about the application from the perspective of the end user, RUM helps the engineer understand the major steps of the user journey before the issue occurred, especially for front-end specific issues (JavaScript Errors, image size, etc.) But, in many cases, their ability to diagnose the problem is limited for issues at the interface between frontend and backend, because information about the root cause of the issue resides in a separate Application Performance Monitoring (APM) product.

The steps needed to find that information — toggling between different products, trying to identify the relevant team, asking for further information from that team — add up. The time it takes to connect information from the RUM and APM respective telemetry is time that prolongs degraded customer experiences, unfixed errors, and lost revenue. There’s a better way to handle these kinds of incidents, but it requires truly connecting the data from APM and RUM, and breaking silos between the teams who use them.

Real User Monitoring Provides Insight Into the User’s Experience

RUM grants teams real-time insight into how users are experiencing any Browser or Mobile applications. This is incredibly valuable, especially to engineers working on the client-side, who need to know if aspects of the customer’s experience aren’t working as intended — including key details like how long the user waited for a page to load, the user’s location, device type, browser, OS, application version, URL, and the details about errors and crashes. RUM goes beyond the collection and processing of this data — it also provides a dedicated UI to ensure this data is easily accessible and explicit enough to guide explorations. This powerful tool is limited, however, to the data that can be collected on the client-side. Since applications are querying backend services through variously exposed public APIs, an important part of the potential solution is missing if the information it surfaces isn’t correlated with the backend data.

APM Tracks Request Propagations, Latencies, and Errors Across Micro-services

APM, and specifically distributed tracing, provides visibility into the lifespan of individual requests going through your applications. By collecting and visualizing backend application traces as a flamegraph, APM can help engineers really understand what’s happening as an application interacts with the infrastructure that supports it. In particular, an APM suite can visualize how code executes across the stack, while highlighting key performance metrics including request throughput, latency, and error rates — for every service in the stack. In a troubleshooting scenario, the root causes of user experience issues identified in a RUM product can often be found in the traces surfaced by an APM product.

Correlating APM and RUM Allows for Full Stack Troubleshooting

For many teams, though, it isn’t this simple. Most monitoring tools keep APM and RUM separate, requiring cumbersome toggling and confusing context-switching to use them simultaneously. Frontend and backend teams, correspondingly, are often structured with similar barriers, making the troubleshooting process siloed in respect to both the data and the teams.

A more fruitful approach requires stitching the data between the two products, and making it available without having to toggle between them. If a frontend engineer can use their RUM tool to not only identify the user experience issue, but also pull up information about the spikes of latencies and errors over the various underlying micro-services, they can alert their backend counterparts and share this info, saving time and offering greater context. And for fullstack engineering teams, having the correlated data at hand makes resolving the issue faster and easier for the engineers responsible. When the products are connected, the collaboration and resolution are both faster and more productive.

APM and RUM, Supporting Machine Learning and Business Use Cases

The benefits of connecting APM and RUM can also extend beyond improved collaboration and more effective issue resolution. When done right, this correlated data allows for the use of Machine Learning and the ability to detect the business impact of backend incidents. With comprehensive frontend and backend data correlated within one system, an ML-based alerting framework has fullstack data to learn with, and can evolve over time to pinpoint the root causes of issues across an entire environment. And when used in context with business-level data, the financial impact of an outage can be tied directly to a backend incident. With these use-cases, APM and RUM data together have powerful implications for both understanding the connections between the business and its infrastructure, and for improving root-cause identification with ML.

Greater Resiliency, More Innovation, Better Business Outcomes

That frontend engineer, and their backend counterpart, can save themselves a lot of time, and improve the user’s experience, but the benefits of fully correlated APM and RUM pay dividends across the business. For one, an engineering organization with fewer errors, and errors that are resolved more quickly, is more resilient overall. In addition, the time saved from troubleshooting faster is time that can be put towards building new things rather than fixing existing things — a new customer-facing signup flow, for example, or a new service architecture that reduces the latency on that signup flow. What ultimately drives a business forward is innovation and new products, but that work won’t be successful if a standard of resilience isn’t being met.