Lyft’s Data Platform with Li Gao
FindCollabs Hackathon #1 has ended! Congrats to ARhythm, Kitspace, and Rivaly for winning 1st, 2nd, and 3rd place ($4,000, $1000, and a set of SE Daily hoodies, respectively). The most valuable feedback award and the most helpful community member award both go to Vynce Montgomery, who will receive both the SE Daily Towel and the SE Daily Old School Bucket Hat
Lyft generates petabytes of data. Driver and rider behavior, pricing information, the movement of cars through space; all of this data is received by Lyft’s backend services, buffered into Kafka queues, and processed by various stream processing systems.
Lyft moves the high volumes of data into a data lake for different users throughout the company to use offline. Machine learning jobs, batch jobs, streaming jobs and materialized databases can be created on top of that data lake. Druid and Superset are used for operational analytics and dashboarding.
Li Gao is a data engineer at Lyft. He joins the show to explore the different aspects of Lyft’s data platform. We also talk about the tradeoffs of streaming frameworks, and how to manage machine learning infrastructure. This episode is a great companion to our show about Uber’s data platform, and illustrates some fundamental differences in how the two ridesharing companies operate.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.