Podcast: Play in new window | Download
The New York Times is a newspaper that evolved into a digital publication. Across its 166 year history, The Times has been known for longform journalistic quality, in addition to its ability to quickly churn out news stories. Some content on the New York Times is old but timeless “evergreen” content.
Readers of the New York Times website are not only looking for the most recent news–they want to know what the headlines were the day after Pearl Harbor. They want to read editorials about Martin Luther King. Over the last 30 years, New York Times has moved itself online, bringing old material with it.
Since the 90s, several different content management systems (CMS) have been used by journalists within The Times. These different sources of content store data in different formats.
This is a data management problem. Users want to search over the entire history of articles published by The Times, which means that The Times needs to unify those articles in a single index. These are articles from the 1920s that were digitized using OCR, articles from 1998 that were written on a legacy CMS, and articles from 2017 that use the latest CMS.
Boerge Svingen is the director of engineering at NYT, and he wrote about this problem and its solution on Medium. This story describes the flexibility of Kafka; in contrast to the applications of Kafka as a place to buffer high volumes of data, the New York Times uses Kafka as a place to unify data and allow for other specific materialized views to be built on top of it.
We have covered Kafka in the past with interviews of some of its creators–including Jay Kreps and Neha Narkhede. To find these old episodes, you can download the Software Engineering Daily app for iOS and for Android. With these apps, we are building a new way to consume content about software engineering. They are open-sourced at github.com/softwareengineeringdaily. If you are looking for an open source project to get involved with, we would love to get your help.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.