What’s Behind Lyft’s Choices in Big Data Tech

Article Thursday, May 30 2019

This post was originally written by Alex Woodie at Datanami. Reposted with permission.

Lyft was a late entrant to the ride-sharing business model, at least compared to its competitor Uber, which pioneered the concept and remains the largest provider. That delay in starting out actually gave Lyft a bit of an advantage in terms of architecting its big data infrastructure in the cloud, as it was able to sidestep some of the challenges that Uber faced in building out its on-prem system.

Lyft and Uber, like many of the young Silicon Valley companies shaking up established business models, aren’t shy about sharing information about their computer infrastructure. They both share an ethos of openness in regards to using and developing technology. That openness is also pervasive at Google, Facebook, Twitter, and other Valley outfits that created much of the big data ecosystem, most of which is, of course, open source.

So when the folks at Lyft were blueprinting how to construct a system that could do all the things that a ride-sharing app has to do – tracking and connecting drivers and riders through space and time being a big one – it was natural for them to take a peek at how their equals at Uber tackled the problem.

The first notable difference between the firms is Lyft’s decision to park their data in cloud object stores, whereas Uber invested in Apache Hadoop infrastructure. This was one of the topics that Lyft Data Engineer Li Gao discussed in a recent interview with Software Engineering Daily‘s Jeff Meyerson, who published the talk in a podcast last month.

“If you went to any of the Uber talks in the past, they talk about a lot of the pain points when they’re dealing with large HDFS deployment,” Gao tells Meyerson. “So we bypassed those issues to the cloud storage.”

Lyft stores much of its data – including raw data and normalized data — in AWS S3, and it uses AWS EC2 to process the data. Its AWS bills are quite large, as you might expect from a company as large as Lyft, which operates in 600 cities and had revenues of $2.1 billion last year. Lyft, which went public earlier this year, reported in its S-1 filling that it pays AWS a minimum of $8 million per month. In fact, its contract with AWS calls for it to spend at least $300 million through 2021.

Despite being tied to AWS, Lyft doesn’t use AWS services exclusively. And while it started out using a lot of higher level services in AWS, including the Kinesis message bus and Redshift data warehouse, it has since migrated much of its data processing and analytics infrastructure to other components (all still running within AWS and subject to AWS billing, of course).

Lyft was a big Redshift user in the early days, but when it started to run into scalability issues due to the tight coupling of compute and storage in the 2016 timeframe, it migrated from Redshift to Apache Hive, Gao says. AWS has addressed the tight coupling between storage and processing, but in 2016 it was still an issue, Gao says.

Today, Lyft uses Hive for many of the big ETL workloads that serve data for business intelligence dashboards for executives and business analysts. In 2018, Lyft upgraded to a faster version of Hive that increased the size and number of ETL jobs that it could run, Gao says. It also introduced Presto (which was created as a successor to Hive at Facebook, which created both) to provide a more powerful query engine. Presto’s advantage is how easily it powers ad hoc analytics that involve joining a lot of different data, Gao says.

Lyft has also made big bets on Apache Spark, which serves multiple use cases at Lyft, including ETL batch processing and also training machine learning models. The company also makes use of Druid, a column-oriented in-memory OLAP data store that excels at performing drill-downs and roll-ups over a large set of high dimensional data. You will also find in Lyft’s big data kit copies of Apache Superset, an open source BI Web application that features a SQL editor with interactive querying.

“That’s where we’ve been,” Gao says in the podcast. “We have batch processing running mostly on Hive and Spark and interactive system running on Presto serve those interactive datasets. We do also have very fast metrics data storage using Druid, and we…also enriched our user interface tools using Superset mostly to start many of our internal dashboards and metrics virtualization as well.”

Lyft engineers build many data pipelines that route incoming to many of these data marts and processing engines. Gao says Lyft is currently building connection between Druid and Presto that will enable developers to utilize the relative advantages of both of those query engines and surface insights within Superset. There’s also a smattering of relational databases, including Postgres and MySQL, that serve other needs.

On the data science front, Lyft is a big user of Jupyter, a popular notebook-style interface for working with data and machine learning algorithms. It’s primarily a Python shop, and so it avails itself of the PySpark library.

Much of the data engineering and data science work is connected via Apache Airflow, a data workflow tool that was originally developed at Airbnb, another disruptive tech firm headquartered in the Bay Area that has enmeshed itself in the open source big data software ecosystem. Airflow creates repeatable data engineering or data science workflows that can be executed atop Kubernetes, the workflow orchestration tool that emanated from Google, and which Lyft is also using in the AWS cloud.

Gao says Airflow provides a great abstraction layer for Lyft’s data engineers and data scientists to bring all of these various components – including AWS Lambda functions – together in a resusable manner. “Through Airflow, you can orchestrate different unit work through a repeatable DAG,” he says. “It can define an SLA and a dependency even between different DAG rounds, which is extremely helpful for us to have a predictable result.”

Lyft has also made big investments in real-time stream data processing. Since the Lyft service is time-dependent, it’s not surprising to learn that the company uses a mixture of Apache Kafka, Apache Flink, and Spark to build streaming services. It has also played around a bit with Apache Beam.

Lyft originally built its real-time data infrastructure atop Amazon Kinesis, but it migrated to Kafka when it discovered scalability limitations in Kinesis, Gao says. “The Kinesis behavior is lagging or inconsistent from what I heard from our streaming team,” he says. “So Kafka has more predictable performance when you have a large number of consumers.”

The company today gets its enterprise Kafka service via a contract with Confluent, which is the company that helped to commercialize Kafka as it emerged from LinkedIn. Today, Confluent provides Kafka-as-a-service on the AWS cloud, as well as other clouds. “Kafka is a critical piece for us to move data, the real-time metrics data or an event data around,” Gao says.

Kafka is used extensively at Lyft. In addition to streaming data that will eventually be normalized and presented for analysis in a data mart or data warehouse, Kafka has a hand in touching data that will be used for machine learning. But instead of using Kafka Streams, the real-time data processing element that is also open source, Lyft has favored Flink to program and power much of its real-time data processing needs.

Lyft chose Flink over Kafka Streams because it offers more powerful transformations, Gao says. “We do have a lot of geo-based model, session -based models,” he says. “We have many different machine learning-trained models that are rolling inside those Flink pipelines. That is much easier to maintain, enhance, and enrich through Flink than through Kafka Streams.”

For some real-time jobs at Lyft, however, Spark is the better solution, Gao says. One of the reasons for that is because many of the ETL pipelines are evolving to include some machine learning capabilities. And since Lyft prefers Python for machine learning over Java, and Python is better supported within Spark than Flink, the company is basing a lot of those workloads on Spark, Gao says.

What’s more, Spark is closer to Tensorflow, which the company is also exploring for deep learning capabilities. “Spark also has a very good integration, especially the newer version, 2.3, 2.4 versions, with TensorFlow,” he says. “We do have a few use cases [where] they need to even embed certain deep learning models inside this ETL pipeline to work with a dataset, enhance the datasets or classify datasets before they save the result into another…Hive dataset in the data lake. So using Spark can unify those processes in a single platform.”

Lyft uses a mix of different tools to orchestrate all of these data services, including YARN and Kubernetes. Gao says the company has nearly 100 small Spark clusters running on AWS that rely on YARN as the scheduler. The company is also looking at getting Spark to run on Kubernetes. It has made some progress in that regard, but it’s an ongoing development project.

“Spark on Kubernetes is still in this very early stage,” Gao says. “We still see there’s a huge gap of [how] we would like to see how Spark perform and what the Spark 2 actually is doing.”

YARN, which originated in the Apache Hadoop project, actually does some things better than Kubernetes, Gao says. “Spark in Kubernetes does not have equivalent of the scheduler — we call it customizable scheduler — that YARN provides,” he says. “YARN has this different schedule called fair scheduler, or capacity scheduler, or a combination of both, to serve the different use cases to maximize both the job performance and the cluster utilization. In Kubernetes, we don’t have any of those. So we have to build our” own.

Software Engineering Daily’s Meyerson pointed out to Gao that Uber has standardized much of its on-premise infrastructure on Mesos, and that it has developed its own scheduler that is general enough that it can be used with Kubernetes. “So you may have your scheduling problems solved by the Uber team,” he says.

Gao says he’s familiar with the Uber product, called Peloton, but doubts that it will work in Lyft’s multi-tenant environment. Instead, Lyft is looking forward to the day when the best of all three schedulers – Mesos, Kubernetes, and YARN – can be used at the same time.

And while YARN has some advantages, it’s clear Kubernetes has the momentum, including over Mesos, which is another differentiator between Lyft and Uber (besides the stylish mustache on Lyft cars). It’s another case reason why you don’t want to be the first one at the big data party.

“When we started looking at container orchestration, it was already 2018,” Gao says. “So when we started looking at Kubernetes, it’s already the dominant….container orchestration engine in the open source world.”

To hear a recording of Software Engineering Daily’s interview with Li Gao, head on over to softwareengineeringdaily.com/2019/04/29/lyfts-data-platform-with-li-gao/.