Podcast: Play in new window | Download
Large-scale data analysis was pioneered by Google, with the MapReduce paper. Since then, Google’s approach to analytics has evolved rapidly, marked by papers such as Dataflow and Dremel.
Dremel combined a column-oriented, distributed file system with a novel way of processing queries. A single Dremel query is distributed into a tree of servers, starting with the root server, splitting into the intermediate servers, and ending with the leaf servers talking to the file system. Once the data is pulled from the file system into the leaves, the data propagates back to the root server, and is shuffled along the way so that the root server receives a sorted response.
When Google started turning its internal services into customer-facing cloud products, the effort to productize Dremel began, and BigQuery was born. Jordan Tigani is an engineering lead who works on BigQuery, and he joins the show to discuss the evolution of the data warehouse.
Large scale distributed queries still can take a long time–but queries get faster every year. Queries that required a nightly Hadoop job 10 years ago can be viewed in a frequently updated user-facing dashboard. Power users of BigQuery talk about the speed and the query interface as being two of its most valuable differentiating features. As the job of a large scale data analyst becomes less technically intensive, tools like BigQuery will continue to rise in popularity.
We have done some great shows about Google papers like Spanner, Dremel, and Dataflow. To find these old episodes, you can download the Software Engineering Daily app for iOS and for Android. In other podcast players, you can only access the most recent 100 episodes. With these apps, we are building a new way to consume content about software engineering. They are open-sourced at github.com/softwareengineeringdaily. If you are looking for an open source project to get involved with, we would love to get your help.
Shout out to today’s featured contributor Shreyans Sheth. Shreyans has worked on the Software Engineering Daily search API, and has also helped us understand open source best practices, which we are still learning. Thanks again Shreyans for your work.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.