Facebook Data Infrastructure with Dhruba Borthakur
Facebook generates high volumes of data at a rapid pace.
Dhruba Borthakur joined Facebook in 2008 to work on data infrastructure. His early projects at Facebook were around Hadoop, the distributed file system and MapReduce computation platform that laid the foundation for the “big data” movement.
At the time, Facebook was generating as much data as any other startup, and the company needed to stay at the leading edge of scalability techniques for its Hadoop Distributed File System (HDFS) cluster.
Traditionally, Hadoop managed its file system by synchronizing the coordination of the different data nodes with the help of a single master node. At Facebook, the scale of the data was such that the HDFS cluster had thousands of data nodes, which was too much volume for a single master node to handle. Dhruba helped implement redundancy at the master node to create a more resilient system.
The early days of the big data movement was focused on batch processing. A company like Facebook would gather large amounts of data into databases and HDFS, and run offline analytics workloads to gather reports on an hourly, daily, or weekly basis.
Over time, data infrastructure has moved closer to a “real time” processing model. Data infrastructure does not only support batch offline reporting–it also supports machine learning jobs that need to be run on a more frequent basis. These jobs have lower latency requirements, and have driven the adoption of in-memory stream processing systems like Spark and Flink.
Dhruba joins the show to discuss his time at Facebook building data infrastructure. He takes us through the major projects he worked on, including the early Hadoop infrastructure, the refactoring of online user workloads to be more “pull” based than “push” based, and the creation of RocksDB, a storage engine he helped create at Facebook.
Today, Dhruba is the CTO and co-founder of Rockset, a company that builds data infrastructure and database APIs on top of RocksDB. Rockset is building infrastructure for modern technology companies–many of which are facing problems that bear significant resemblance to the ones Facebook encountered as it scaled.
- New SEDaily app for iOS and for Android. It includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. You can comment on episodes and have discussions with other members of the community. I’ll be commenting on each episode, so if you hear an episode that you have some commentary on, jump onto the app, or on SoftwareDaily.com to share your thoughts. And you can become a paid subscriber for ad free episodes at softwareengineeringdaily.com/subscribe. Altalogy is the company who has been developing much of the software for the newest app, and if you are looking for a company to help you with your mobile and web development, I recommend checking them out.
- FindCollabs is a place to find collaborators and build projects. FindCollabs is the company I am building, and we are having an online hackathon with $2500 in prizes. If you are working on a project, or you are looking for other programmers to build a project or start a company with, check out FindCollabs. I’ve been interviewing people from some of these projects on the FindCollabs podcast, so if you want to learn more about the community you can hear that podcast.
- Upcoming conferences I’m attending: Datadog Dash July 16th and 17th in NYC, Open Core Summit September 19th and 20th in San Francisco.
- We are hiring two interns for software engineering and business development! If you are interested in either position, send an email with your resume to firstname.lastname@example.org with “Internship” in the subject line.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.