Crash Reporting: Improving the Customer and Dev Experience

When things go badly, everyone needs a plan. 

When things break in a production system, a solid strategy is necessary for recovery and minimizing impact. When an alert goes off in a production system, a plan helps keep the engineers calm, composed, and focused on how to fix the problem.

With expansive investment in the software development continuous deployment landscape, deployment rollbacks are widely available. When something breaks, tools detect the failure, but a lot of time is still spent trying to figure out what actually went wrong when things break.

Crash Reporting provides tooling that allows for automated detection of problems and allows developers to figure out how to solve them right away. A Crash Reporting tool like Raygun solves this problem by automatically detecting software errors in production and beyond.

Continuous delivery tool landscape Source: http://www.jamesbowman.me/post/continuous-delivery-tool-landscape/

A Series of Unfortunate Deployments to Production

For example, say someone pushed code to production, a small one-line change. It nicely flows through the deployment and integration pipeline. A few minutes after deployment to production, alerts start firing. That one line change must have had unforeseen side effects. The on-call engineer intervenes and rolls back the deployment. Analytics look normal again.

The team decided to employ industry standard tooling for continuous delivery and continuous integration. The engineer did the same thing by writing tests and running them before pushing the code.

Still, the alerts went off. Thankfully the teams use good monitoring and alarming tools like Grafana or Prometheus to aggregate metrics and trigger alarms through Pagerduty. They also have Elasticsearch and Kibana for log analysis.

Not Quite The Right Tooling

The classic monitoring and alarming tools don’t tell you what broke exactly.

The engineer might spend a lot of time trying to figure out what broke. Latency was up. But the spikes were inconsistent, going up and down randomly. There is no correlation between the number of requests and the latency spikes. There are no other fault or error alerts. Staring at the dashboard, scrolling through logs and looking at code takes hours.

Eventually, the engineer identifies the problem. The fix takes five minutes to write. Safari phone browsers were triggering a weird behavior in your web app.

Pushing to production is scary. A failed deployment can be a traumatic experience, and can lead to a culture of fear–with fewer releases.

The Hidden Cost of Lengthy Debugging

Every engineer wants to write quality code and push it to production with ease. We might have tooling to write and push, but what about the times when things don’t go as planned/expected?

The mistakes have a huge, sometimes unforeseen cost:

  • Software problems have a financial cost. You lose customers who wanted to buy stuff on your website. You lose customers who are no longer satisfied with your service. You lose customers who uninstalled your app because it failed. Getting customers is expensive and losing them is costly.
  • Bad customer experience includes crashes as well as high latencies/response times. If someone is using your product for the first time, they are unlikely to try it again if it fails or it is slow.
  • High Time To Recovery costs you customers. Sometimes it’s easy to rollback a bad deployment and not suffer any negative impact (with a good canary deployment and retry strategy). Sometimes the rollback doesn’t fix the problem (bad change to a database or is because of a dependency). The time it takes to figure out what is causing the problem and to resolve could be very long. Today anything over 5 minutes is a huge outage. The biggest factor in time to recovery is usually time to discover the root cause.
  • The Time to Root Cause the bug and fix it beyond mitigating the problem in production is costly. If we rollback a deployment we still have to spend a lot of time figuring out exactly what caused the problem and in what part of the code it happened.
  • The scarred engineers might be afraid of deploying code. They might become less innovative and slower because they will try to optimize for making sure this never happens again. But they are doing that by slowing down how they write code.

The good news is there a tool that can tell you exactly what broke in your system, and we will explore such a tool with examples.

Observability & Crash Reporting

Crash Reporting is a way to automatically collect, analyze and triage stack traces of the software you write. You want your tool to be simple to use and reusable for any language, and on any platform, front-end or back-end, mobile or web, or VR! It should integrate directly into your development and debugging workflows.

You can’t test everything. There are always edge cases you miss. In contrast to figuring out all the way our system works by testing it, we add metrics, monitoring, dashboarding, alarming, log analysis, and analytics in order to figure out how it is actually behaving. Monitoring allows us to figure out how the APIs are doing. But monitoring is dumb, in the sense that it’s just data that has no context and no value by itself. You are just spewing off numbers to look at.

You add alarms to add some meaning to them. But the alarm thresholds you create again just warn you when something is wrong. You don’t usually know exactly what’s causing the problem. Observability is the domain of adding a layer of understanding so you can tell how your system is behaving and how it is interacting with other parts of the system.

Monitoring helps you understand how your system is behaving. But you want more. You want to understand interactions and links between data. You want to figure out that the service A is crashing because service B is not behaving like expected. You want to be able to tell that latency in your API is due to just a specific superuser/couple who has 20 million items on their wedding registry. You want to be able to tell that Internet Explorer 9 is the only browser that is causing all these failures in your system.

Figure out what Went Wrong in Seconds, Not Minutes or Hours

In order to build observability, we need to build tooling that is smarter than monitoring (we still need monitoring, by the way, it is foundational but not enough by itself). We need Crash Reporting, Performance Monitoring, and Insights about User Interactions.

When your app crashes or you see failures or you see faults, you want to figure out what went wrong in literally seconds. You want all of this to happen fast and in real-time. Crash Reporting is the tool that can do just that for you.

Crash Reporting: The Hero Your Engineers and Customers Deserve

When your systems break, you start looking at metrics, then log messages. The log you are always trying to look for is the one with the error message/the crash data. Why can’t we just filter that data out, automatically and systematically?

It doesn’t matter if you are deploying an iOS phone app, an AngularJS web application, a backend .NET web application or a serverless Node.js Lambda. You usually can solve the problem if you can just get your hands on the stack trace.

The Benefits of Crash Reporting

With a great Crash Reporting tool, you can spend all your time coding and actually fix the bugs, rather than going through millions of lines of logs to try to figure out what happened.

Leave your excellent debugging skills to actual hard problems that need them.

Design for Crash Reporting & Observability

As you start using crash reporting in your system, think about how to improve your observability. How can you figure out the problems that you haven’t anticipated? You can do several things to improve your experience

Click here for more information on Raygun and their Crash Reporting tooling.

For more Information on Crash Reporting check out High Volume Event Processing with John-Daniel Trask.

Using Raygun For Crash Reporting

Raygun can help uncover failures in many different scenarios. Click the tabs below to explore failures with examples.

Someone hits the front page of the Netflix of books, Bookl.y. The website returns an error. They refresh; and this time the website works. There was a deployment that triggered a problem. The monitoring and the alarming system detects the problem in less than two minutes and rolls back the deployment. Your users are not too annoyed.

Before the deployment starts rolling back, you get a notification in your team’s Slack room, along with an email with a full stack trace as well as links to view more details for this failure. You can probably tell in 2 seconds why the application failed and how it failed.

No more NullPointerExceptions in production! Ever! Again! (Seriously those should all go away. Should be Null of them! Crash Reporting can do exactly that.) We are using Raygun which is one of the best Crash Reporting tools out there.

In computing, if it failed once, it WILL fail again. If I click the email, I can see how many times the failure happened and what devices/machines/servers/hosts were affected.

Raygun is very smart about aggregating the failures, even if the code changed/lines/functions got moved around. It is built to filter out non-errors and only surface actual problems.

I know exactly when the problem started happening, and how many times it happens.

Reliable dependencies are good until they are not. Thousands of software projects broke when  11 lines of code changed in LeftPad. By now, every time you get paged, the first you do is log in to look at the Raygun dashboard. You immediately see the exception rates going up.

Raygun’s crash reporting tools should be part of your production tooling in order to asses all problems in production in seconds, as they happen. Want to decide if you should rollback or continue a deployment? Want to decide if a given fix actually resolves all problems/crashes of your system? Use a Crash Reporting tool.

When you first start using Crash Reporting for an existing/legacy software, you will be bombarded with tens to thousands of failures in your system. These are the silent failures: failures that happen at very low rates, at just a few devices/browsers. They are just failed code that has been lurking in your system, not enough to trigger an alarm, but enough that there is a customer walking away from your product every day. What if your website always crashes on Opera but you never test on it? What if your software fails only on old versions of Android.

Not only can you see what exact device, OS and hardware your software was running on but you can pinpoint exactly what is causing problems.

If you decide that customer acquisition is important to you and that you want to retain every customer, you need to make sure everyone is having a good experience, on every device, on every platform. Or at least acknowledge where your experience is not up to par.

Sometimes it’s code you don’t control, or maybe a certain failure is acceptable in your line of service. It’s very easy to filter out the error types you don’t care about. You can filter by device, IP address, as well as version of the software. You can also easily mark errors as Resolved, Ignored or Permanently Ignored so they are stored but are not triggering notifications.
Since Raygun allows integrations with Pagerduty, Slack, Github, Trello and many others, having control over your workflow of fixing the errors is important.
You want your Crash Reporting tool to integrate directly with systems and software you use. You want a notification when a problem happens and you want to get paged or slack messaged if you want.
Because you have full control and you can filter things out, you can ensure that you are only engaged for things that matter. There is nothing that makes a tool useless like noisy data. By having control of everything, you can ensure that there is absolutely no noise. You are only seeing real failures and addressing them as they come up.

Beyond Crashing Reporting, you might want to also track what causes your code to respond slowly. Since high response time is equivalent to a failure, a good tool, like Raygun, provides just that. You can analyze how your applications and APIs are doing. You can also analyze specific user interactions that might be causing problems. With User Tracking, you can track user sessions in order to track the user’s interactions leading up to the error. You can also tell exactly how many customers got impacted. Did one customer hit the same error 100 times, or did 100 customers hit the same error once each? Raygun is also GDPR compliant so you don’t have to worry about how data is stored and handled.


Raygun is a sponsor of Software Engineering Daily.

If you are interested in reading about specific topics about software development, technology, continuous integration and deployment, observability or monitoring, email me at abdallah@softwareengineeringdaily.com. All feedback is welcome.

Views presented in this article are Abdallah’s and don’t represent the views of his employer or Software Engineering Daily.


 

Abdallah Abu-Ghazaleh

Toronto

A software developer who is passionate about building things, Abdallah is interested in tackling the problems of scale for software, people, and ideas. Abdallah has experience working on embedded and distributed systems. Abdallah enjoys talking about technology, software development, tech ethics, and human development.

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.