LinkedIn Data Infrastructure

Article Tuesday, February 18 2020

LinkedIn has become a staple for the modern professional, whether it’s used for searching for a new job, reading industry news, or keeping up with professional connections.

As a rapidly growing platform that serves more than 675 million users today, LinkedIn is a company that can boast of having one of the largest user bases in the world. How these users interact with the site and react to recommendations aggregates into a massive dataset. On a scale that not many companies experience, LinkedIn has a large amount of data that brings interesting engineering problems and opens up ripe opportunity for innovation in areas like data infrastructure and tooling.

Even though LinkedIn is a 16-year-old company, its data infrastructure journey is far from over. LinkedIn’s infrastructure quest covers a wide range of practices, having approximately 20 servers in a small data center in 2008 to building smarter data centers around the world, and more recently, as of July 2019, having begun a multi-year migration to the public cloud with Azure. Throughout this journey, LinkedIn engineers have faced a variety of challenges and documented their solutions as lessons to be learned along the way, as well as built and open-sourced invaluable tools like Kafka and Voldemort, used by millions of other engineers.

In the early years of LinkedIn, the data infrastructure relied on a single data center, hosted with a retail data center provider. In those days, the priority, with data being served from a single data center, was availability – keeping the site up for users.

As the number of users grew and new features were released, adding data center capacity through a retail provider became less cost-effective. This is when LinkedIn started its own data center, gradually fanning out to rely on multiple data centers. By not only expanding their data centers in number, but also designing the fabric of the data centers in a smart way, LinkedIn grew into its modern infrastructure, able to handle millions of users.

LinkedIn has showcased a multi-perspective strategy on handling growth. The most prominent strategies have been expanding the number and the capacity of data centers, building smarter data centers, and creating tooling around massive data to enable faster integration of data into workflows to propel innovation.

Data Sources at LinkedIn

LinkedIn has a couple of main sources of data, as Kapil Surlaker explains in our episode on the company’s data infrastructure.

The first source is transactional data from the users: every action taken by a user in the form of status updates to post “likes” and job views must be stored. The second source is telemetry data, which comes from monitoring applications to gain insight into how the different components of the platform are performing. The third source, one without an upper bound according to Surlaker, is derived data, generated by developers for numerous purposes such as data sets to be used for analysis and building machine learning models.

These types of data are common for web applications with user interactions. Things get complicated when data has to be consolidated in a standard format to enable a unified experience for the developers in a company.

The data sources can be widely different – historical data usually comes from RDBMSs designed for OLAP, current transactional data comes from NoSQL databases and streams, and logs can be delivered in a variety of formats. In which paradigm the data comes in is also important: ingesting streaming data and using batch data may have different requirements.

LinkedIn’s main answer to handling these diverse sets of data has been through tooling. Luckily for the general developer community, many of these tools have been open-sourced over the years.

Open Source Tools

One of LinkedIn’s strategies for dealing with the massive amounts of data that are being constantly generated is to empower engineers by developing tools to deal with different aspects of the data, from ingestion to storage.

LinkedIn has built and open-sourced a variety of tools over the years. One of these tools, Kafka, built by LinkedIn and donated to Apache Software Foundation, forms the backbone of data operations at LinkedIn alongside Hadoop. Kafka, a distributed streaming platform, acts as a low-latency data collection system for the real-time data generated by LinkedIn’s user base.

Complementing Kafka is a tool called Gobblin, a distributed data integration framework. Gobblin is used to ease and unify the integration of data between different sources and sinks, providing scalability, fault tolerance, and quality assurance in one tool. Developed initially to serve as an “uber-ingestion framework” for Hadoop at LinkedIn, Gobblin was open-sourced and donated to Apache where it has taken on new integrations and a diverse community of committers.

Project InVersion

In the fast-moving world of startups, technical debt is often overlooked. It refers to an accumulation of deficiencies that make it harder to add new features to the system. The most common way of accumulating technical debt is by releasing features quickly without thinking of the future sustainability of the overall system, a practice that is prominent for startups that are looking to attract users and investors with shiny new features.

Technical debt occurs in many ways, and it’s not always easy to prevent it. Developers at LinkedIn faced their technical debt in a hard way.

In 2011, after the company’s initial public offering, LinkedIn’s technical debt hit a critical point. Practices in the infrastructure that had been in use for years and problems that were compounded as new features were added on top of them could not be held down anymore. LinkedIn went for a risky infrastructure overhaul, now referred to as Project InVersion.

For two months in 2011, LinkedIn stopped rolling out new features as developers focused on improving and modernizing their infrastructure – a full team effort to get rid of the technical debt of the last eight years. This overhaul included developing new tools that automated testing, accelerated the process of rolling out features and updating the platform, and in the end, completely transformed LinkedIn’s backbone.

Challenges with ML

LinkedIn offers a personalized experience to each of its users. The way that posts in their feed are sorted, the job recommendations they see, and other recommendations need to be specific for everyone on the platform. The main power behind these operations are machine learning models.

An example from recommendations on LinkedIn, powered by AI.

LinkedIn has many teams for each ML application, from Feeds to Communities. Each of these areas poses unique challenges in defining the right objectives, applying the correct modeling technique, and successfully serving complex models with low latency at scale. Each model must be tightly integrated within the serving stack specific to its problem space. At the same time, there must be a single unified framework that provides a battery of tools to solve the myriad challenges that come with dealing with complex models that operate on a very large set of data.

LinkedIn’s solution is Pro-ML.

The goal of Pro-ML is to double the effectiveness of machine learning engineers while simultaneously opening the tools for AI and modeling to engineers from across the LinkedIn stack.

Source

Pro-ML approach divides ML practices into layers as part of the machine learning development lifecycle

Each of these layers is a step towards building machine learning models for production. LinkedIn finds it helpful to standardize these steps so that engineers across teams can share innovations by simply swapping components with one another. We also provide automation and additional hints to help users find mistakes in their models faster.

In machine learning parlance, a “feature” is a piece of the data that the model uses to make a prediction. An example might be how many connections in common a user has with someone who posted an item in his or her feed. Features used in various machine learning models are collected into the Feature Marketplace in a searchable format. These features are available when making predictions when the user visits the site, but must be simulated when testing out an idea during model training. LinkedIn has had many challenges in the past with ensuring features are computed the same way during model training and prediction. Pro-ML offers a tool called Frame that unifies feature access and computation in all of the environments.

LinkedIn also has several open-source tools to integrate machine learning workflows into their infrastructure needs, such as TonY and Photon ML.

TonY, originally an acronym for TensorFlow on YARN, was developed out of a need to run distributed deep learning training jobs on large Hadoop clusters. Because other options such as TensorFlow on Spark fell short for LinkedIn’s specific needs, such as lack of GPU scheduling, an internal tool was created and later open-sourced. TonY currently supports not only TensorFlow, but also PyTorch and MXNet.

Photon ML was built out of similar needs as a machine learning library on Spark. Rather than deep learning, Photon ML focuses on Generalized Linear Models and Generalized Linear Mixed Models (GLMix). These models built by Photon ML power features where response prediction is useful, namely for recommendation components such as job recommendation, feed ranking, and “People You May Know.”

Journey to Cloud

LinkedIn has been using Azure for some of its operations, such as Microsoft’s Content Moderator APIs as part of Cognitive Service for detecting inappropriate content and Text Analytics APIs for machine translation. The choice to use Azure services from Cognitive Service is an important point: LinkedIn has proven over the years through numerous projects built and open-sourced by its engineers that the company is not averse to tackling a problem from the root and developing the necessary solution. There is a trade-off, in terms of the developer effort put in by engineers in LinkedIn and the cost of using a service from a provider. Beyond this trade-off, however, comes the question of reliability and scale, especially for a company like LinkedIn, unique in the amount of data and number of users its platform serves.

Recently, Senior VP of Engineering of LinkedIn, Mohak Shroff announced that the company will be making the switch to the public cloud under the umbrella of Azure. This is a critical move, and a deliberate one, according to Shroff – periodically weighing the pros and cons of public cloud from a multi-faceted approach, ranging from applicability to the bare economics, the company recently decided that it would be a worthy next step.

These considerations are significant. The decisions to use Azure services show the company’s trust in Azure to handle some of the data operations on the scale of LinkedIn.

To learn more about what the engineers over at LinkedIn are building to connect the world’s professionals, check out the company’s blog.

Gokhan Simsek

Eindhoven, The Netherlands

Gokhan is a computer science graduate, currently pursuing a MSc. degree in Data Science at Eindhoven University of Technology. He’s interested in big data, NLP, and machine learning.