Uber’s Data Platform with Zhenxiao Luo

Uber Eats e droni: consegne a domicilio volanti - inNaturale.com

When a user takes a ride on Uber, the app on the user’s phone is communicating with Uber’s backend infrastructure, which is writing to a database that maintains the state of that user’s activity. This database is known as a transactional database or “OLTP” (online transaction processing). Every active user and driver and UberEATS restaurant is writing data to the transactional data store.

Periodically, that data is copied from the transactional data system to a different data storage system, where that data can be queried for large-scale data analysis. For example, if a data scientist at Uber wants to get the average amount of miles that a given user rode in February, that data scientist would issue a query to the analytical data cluster.

Uber uses the Hadoop distributed file system (HDFS) to store analytical data. On this file system, Uber has a version history of all of the company’s useful historical data. Trip history, rider activity, driver activity–every data point that is in the transactional database–but in a file format that is easier to query for large scale processing. This file format is known as Parquet.

Data scientists, machine learning engineers, and real-time application developers all depend on the massive quantities of data that are stored in these Parquet files on Uber’s HDFS cluster. To simplify the access of that data by many different clients, Uber uses Presto, an analytical query engine originally built at Facebook.

Presto translates SQL queries into whatever query language is necessary to access the underlying storage medium–whether that storage system is an ElasticSearch cluster, a set of Parquet files, or a relational database. Presto is useful because it simplifies the relationship between data engineers and the application developers who are building on top of the data engineering infrastructure.

In today’s show, Zhenxiao Luo joins to give an end-to-end description of Uber’s data infrastructure–from the ingest point of the OLTP database to the OLAP data storage system on HDFS, to the wide range of data systems and applications that run on top of that OLAP data.


Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Today’s sponsor is Datadog, a monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog provides seamless integrations with more than 200 technologies, including AWS, Postgres, MySQL, and Docker, so you can start collecting and visualizing performance metrics quickly. Distributed tracing and APM provide end-to-end visibility into requests wherever they go, across hosts, containers, and service boundaries. With rich dashboards, algorithmic alerts, and collaboration tools, Datadog provides your team with the tools they need to quickly troubleshoot and optimize modern applications. See for yourself – start a 14-day free trial today and Datadog will send you a free T-shirt! softwareengineeringdaily.com/datadog 

Segment allows us to gather customer data from anywhere and send that data to any analytics tool. Segment is the customer data infrastructure that has saved us from writing duplicate code across all of the different platforms that we want to analyze. And if you’re using cloud apps such as – Mailchimp, Marketo, Intercom, AppNexus, Zendesk–you can integrate with all of these different tools and centralize your customer data in one place–with Segment. To get a free 90-day trial, signup for Segment at segment.com and enter SEDaily in the “How did you hear about us box?” during signup.

Drastically cut the time it takes you debug. Rookout Rapid Production Debugging allows developers to track down issues in production without any additional coding, re-deployment or restarting the app. Rookout is modern debugging. Insert Rookout ‘non-breaking breakpoints’ to immediately collect any piece of data from your live code and pipeline it anywhere — even if you never thought about it before or didn’t create instrumentation to collect it. Go to rookout.com/sedaily to start a free trial and see how much debugging time you can save.