Podcast: Play in new window | Download
The New York Times is a newspaper that evolved into a digital publication. Across its 166 year history, The Times has been known for longform journalistic quality, in addition to its ability to quickly churn out news stories. Some content on the New York Times is old but timeless “evergreen” content.
Readers of the New York Times website are not only looking for the most recent news–they want to know what the headlines were the day after Pearl Harbor. They want to read editorials about Martin Luther King. Over the last 30 years, New York Times has moved itself online, bringing old material with it.
Since the 90s, several different content management systems (CMS) have been used by journalists within The Times. These different sources of content store data in different formats.
This is a data management problem. Users want to search over the entire history of articles published by The Times, which means that The Times needs to unify those articles in a single index. These are articles from the 1920s that were digitized using OCR, articles from 1998 that were written on a legacy CMS, and articles from 2017 that use the latest CMS.
Boerge Svingen is the director of engineering at NYT, and he wrote about this problem and its solution on Medium. This story describes the flexibility of Kafka; in contrast to the applications of Kafka as a place to buffer high volumes of data, the New York Times uses Kafka as a place to unify data and allow for other specific materialized views to be built on top of it.
We have covered Kafka in the past with interviews of some of its creators–including Jay Kreps and Neha Narkhede. To find these old episodes, you can download the Software Engineering Daily app for iOS and for Android. With these apps, we are building a new way to consume content about software engineering. They are open-sourced at github.com/softwareengineeringdaily. If you are looking for an open source project to get involved with, we would love to get your help.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.
Pingback: Dew Drop - October 30, 2017 (#2592) - Morning Dew()
Very good episode I’m very interested in this topic.
For my project, I’ve started to move to this kind of architecture (I heard people calling it event sourcing). It is much harder than just plain Api… But I really like the benefits (being able to change DB schemas, make load balancing and high availability easier,…). I feel that there is a parallel between event sourcing and microservices. Microservices have a lot of advantages compared to monolithic applications but it comes with a cost of development that people shouldn’t underestimate… I feel like event sourcing is the equivalent for databases.
I’ve heard that some people are using CouchDB to store the event log. I’ve started to use it for a smaller new project. I don’t have enough experience to recommend it or not but so far I find that it works fine.
Thanks again for this episode.
Thanks for listening!
Pingback: Bluesoft News #25 - Melhores de Outubro - Labs Bluesoft()