Kafka at NY Times with Boerge Svingen

The New York Times is a newspaper that evolved into a digital publication. Across its 166 year history, The Times has been known for longform journalistic quality, in addition to its ability to quickly churn out news stories. Some content on the New York Times is old but timeless “evergreen” content.

Readers of the New York Times website are not only looking for the most recent news–they want to know what the headlines were the day after Pearl Harbor. They want to read editorials about Martin Luther King. Over the last 30 years, New York Times has moved itself online, bringing old material with it.

Since the 90s, several different content management systems (CMS) have been used by journalists within The Times. These different sources of content store data in different formats.

This is a data management problem. Users want to search over the entire history of articles published by The Times, which means that The Times needs to unify those articles in a single index. These are articles from the 1920s that were digitized using OCR, articles from 1998 that were written on a legacy CMS, and articles from 2017 that use the latest CMS.

Boerge Svingen is the director of engineering at NYT, and he wrote about this problem and its solution on Medium. This story describes the flexibility of Kafka; in contrast to the applications of Kafka as a place to buffer high volumes of data, the New York Times uses Kafka as a place to unify data and allow for other specific materialized views to be built on top of it.

We have covered Kafka in the past with interviews of some of its creators–including Jay Kreps and Neha Narkhede. To find these old episodes, you can download the Software Engineering Daily app for iOS and for Android. With these apps, we are building a new way to consume content about software engineering. They are open-sourced at github.com/softwareengineeringdaily. If you are looking for an open source project to get involved with, we would love to get your help.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.

Sponsors


Auth0 makes authentication easy.  As a developer, you love building things that are fun–and authentication is not fun. Authentication is a pain. It can take hours to implement, and even once you have authentication, you have to keep all your authentication code up to date! Auth0 is the easiest and fastest way to implement real world authentication and authorization architectures into your apps and APIs. Allow your users to log in however you want–regular username and password, Facebook, Twitter, enterprise identity providers like AD and Office 365–or let them login without passwords, using an email login like Slack or phone login like WhatsApp. Getting started is easy. Just grab the Auth0 SDK for any platform you need and add a few lines of code to your project–whether you are building a mobile app, a website, or an API–they all need authentication. Sign up for Auth0 and get the free plan or try the enterprise plan for 21 days at auth0.io/sedaily. No credit card required. Auth0 is trusted by developers at Atlassian, Mozilla, and Wall Street Journal. Try it out at auth0.io/sedaily. Stop struggling with authentication–get back to building core features, with Auth0. 


The octopus: a sea creature known for its intelligence and flexibility. Octopus Deploy: a friendly deployment automation tool for deploying applications like .NET apps, Java apps and more. Ask any developer and they’ll tell you it’s never fun pushing code at 5pm on a Friday then crossing your fingers hoping for the best. That’s where Octopus Deploy comes into the picture. Octopus Deploy is a friendly deployment automation tool, taking over where your build/CI server ends. Use Octopus to promote releases on-prem or to the cloud. Octopus integrates with your existing build pipeline–TFS and VSTS, Bamboo, TeamCity, and Jenkins. It integrates with AWS, Azure, and on-prem environments. Reliably and repeatedly deploy your .NET and Java apps and more. If you can package it, Octopus can deploy it! It’s quick and easy to install. Go to Octopus.com to trial Octopus free for 45 days. That’s Octopus.com


You are programming a new service for your users. Or, you are hacking on a side project. Whatever you are building, you need to send email. For sending email, developers use SendGrid. SendGrid is the API for email, trusted by developers. Send transactional emails through the SendGrid API. Build marketing campaigns with a beautiful interface for crafting the perfect email. SendGrid is used by Uber, Airbnb, and Spotify–but anybody can start for free and get 100 emails per day. Just go to SendGrid.com/sedaily to get started. Your email is important–make sure it gets delivered properly, with SendGrid, the most reliable email delivery service. Get started with 100 emails per day at SendGrid.com/sedaily.

 


Thanks to Symphono for sponsoring Software Engineering Daily. Symphono is a custom engineering shop where senior engineers tackle big tech challenges while learning from each other. Check it out at symphono.com/sedaily. Thanks to Symphono for being a sponsor of Software Engineering Daily for almost a year now. Your continued support allows us to deliver content to the listeners on a regular basis.

 

 

  • Pingback: Dew Drop - October 30, 2017 (#2592) - Morning Dew()

  • Didier Hoarau

    Very good episode I’m very interested in this topic.

    For my project, I’ve started to move to this kind of architecture (I heard people calling it event sourcing). It is much harder than just plain Api… But I really like the benefits (being able to change DB schemas, make load balancing and high availability easier,…). I feel that there is a parallel between event sourcing and microservices. Microservices have a lot of advantages compared to monolithic applications but it comes with a cost of development that people shouldn’t underestimate… I feel like event sourcing is the equivalent for databases.

    I’ve heard that some people are using CouchDB to store the event log. I’ve started to use it for a smaller new project. I don’t have enough experience to recommend it or not but so far I find that it works fine.

    Thanks again for this episode.

    • Jeff Meyerson

      Thanks for listening!

  • Pingback: Bluesoft News #25 - Melhores de Outubro - Labs Bluesoft()