High Volume Logging with Steve Newman

Podcast Friday, December 15 2017

Subscribe: RSS

Google Docs is used by millions of people to collaborate on documents together. With today’s technology, you could spend a weekend coding and build a basic version of a collaborative text editor. But in 2004 it was not so easy.

In 2004 Steve Newman built a product called Writely, which allowed users to collaborate on documents together. Initially, Writely was hosted on a single server that Steve managed himself. All of the reads and writes to the documents went through that single server. Writely rapidly grew in popularity, and Steve went through a crash course in distributed systems as he tried to keep up with the user base.

In 2006, Writely was acquired by Google, and Steve spent his next four years turning Writely into Google Docs. Eventually he moved onto other projects within Google—“Cosmo” and “Megastore Replication.” When Steve left the company in 2010, he took with him the lessons of logging and monitoring that keep Google’s infrastructure observable.

Large organizations have terabytes of log data to manage. This data streams off the servers that are running our applications. That log data gets processed in a “metrics pipeline” and turned into monitoring data. Monitoring data aggregates log data in a more presentable format.

Most of the log messages that get created will never be seen with human eyes. These logs get aggregated into metrics, then compressed, and (in many cases) eventually thrown away. Different companies have different sensitivity around their logs, so some companies may not garbage collect any of their logs!

When a problem occurs in our infrastructure, we need to be able to dig into our terabytes of log data and quickly find the root cause of a problem. If our log data is compressed and stored on disk, it will take longer to access it. But if we keep all of our logs in memory, it could get expensive.

To review: if I want to build a logging system from scratch today I need to build: a metrics pipeline for converting log data into monitoring data; a complicated caching system, a way to store and compress logs; a query engine that knows how to ask questions to the log storage system; a user interface so I don’t have to inspect these logs via command line…

The list of requirements goes on and on—which is why there is a huge industry around log management. And logging keeps evolving! One example we covered recently is distributed tracing, which is used to diagnose requests that travel through multiple endpoints.

After Steve Newman left Google, he started Scalyr, a product that allows developers to consume, store, and query log messages. I was looking forward to talking to Steve about data engineering, and the query engine that Scalyr has architected, but we actually spent most of our conversation talking about the early days of Writely, and his time at Google—particularly the operational challenges of Google’s infrastructure. Full disclosure: Scalyr is a sponsor of Software Engineering Daily.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.