Dremio with Tomer Shiran

The MapReduce paper was published by Google in 2004. MapReduce is an algorithm that describes how to do large-scale data processing on large clusters of commodity hardware.

The MapReduce paper marked the beginning of the “big data” movement. The Hadoop project is an open source implementation of the MapReduce paper. Doug Cutting and Mike Cafarella wrote software that allowed anybody to use MapReduce, as long as they had significant server operations knowledge and a rack of commodity servers.

Hadoop got deployed first at companies with the internal engineering teams that could recognize its importance and implement it–companies like Yahoo and Microsoft. The word quickly spread about the leverage Hadoop could provide.

Around this time, every large company was waking up to the fact that they had tons of data and didn’t know how to take advantage of it. Billion dollar corporations in areas like banking, insurance, manufacturing, and agriculture all wanted to take advantage of this amazing new way of looking at their data. But these companies did not have the engineering expertise to deploy Hadoop clusters.

Three big companies were formed to help bring Hadoop to large enterprises: Cloudera, Hortonworks, and MapR. Each of these companies worked with hundreds of large enterprise clients to build out their Hadoop clusters and help them access their data. Tomer Shiran spent five years at MapR, seeing the data problems of these large enterprises and observing how much value could be created by solving these data problems.

In 2015, eleven years had passed since MapReduce was first published, and companies were still having data problems. Tomer started working on Dremio, a company that was in stealth for another two years. I interviewed Tomer two years ago, when he still could not say much about what Dremio was doing. We talked about Apache Drill, an open-source project related to what Dremio eventually built.

Earlier this year, two of Tomer’s colleagues Jacques Nadeau and Julien Le Dem came on to discuss columnar data storage and interoperability. What I took away from that conversation was that today, data within an average enterprise is accessible, but the different formats are a problem. Some data is in MySQL, some is in Amazon S3, some is in ElasticSearch, some is on HDFS stored in Parquet files. Different teams will set up different BI tools and charts that read from a specific silo of data.

At the lowest level, the different data formats are incompatible–you have to transform MySQL data in order to merge it with S3 data. On top of that, engineers doing data science work are using Spark, Pandas, and other tools that pull lots of data into memory–if the in-memory formats are not compatible, the data teams can’t get the most out of their work. On top of THAT, at the highest level, data analysts are working with different data analysis tools, so there is even more siloing.

Now I understand why Dremio took two years to bring to market.

They are trying to solve data interoperability by making it easy to transform data sets between different formats. They are trying to solve data access speed by creating a sophisticated caching system. And they are trying to improve the effectiveness of the data analysts by providing the right abstractions for someone who is not a software engineer to study the different data sets across an organization.

Dremio is an exciting project because it is rare to see a pure software company put so many years into up-front stealth product development. After talking to Tomer in this conversation, I’m looking forward to seeing Dremio come to market. It was fascinating to hear him talk about how data engineering has evolved to today.

Some of the best episodes of Software Engineering Daily cover the history of data engineering, including an interview with Mike Cafarella, the co-founder of Hadoop, and another episode called “The History of Hadoop” in which we explored how Hadoop made it from a Google research paper into a multibillion dollar industry.

To find these old episodes, you can download the Software Engineering Daily app for iOS and for Android. With these apps, we are building a new way to consume content about software engineering. They are open-sourced at github.com/softwareengineeringdaily. If you are looking for an open source project to get involved with, we would love to get your help.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.

Sponsors


You are programming a new service for your users. Or, you are hacking on a side project. Whatever you are building, you need to send email. For sending email, developers use SendGrid. SendGrid is the API for email, trusted by developers. Send transactional emails through the SendGrid API. Build marketing campaigns with a beautiful interface for crafting the perfect email. SendGrid is used by Uber, Airbnb, and Spotify–but anybody can start for free and get 100 emails per day. Just go to SendGrid.com/sedaily to get started. Your email is important–make sure it gets delivered properly, with SendGrid, the most reliable email delivery service. Get started with 100 emails per day at SendGrid.com/sedaily.


Digital Ocean Spaces gives you simple object storage with a beautiful user interface. You need an easy way to host objects like images and videos. Your users need to upload objects like pdfs and music files. To try Digital Ocean Spaces, go to do.co/sedaily and get 2 months of Spaces plus a $10 credit to use on any other Digital Ocean products–and you get this credit even if you have been with Digital Ocean for awhile. It’s a nice added bonus just for trying out Spaces. If you become a customer, the pricing is simple:  $5 per month price and includes 250GB of storage and 1TB of outbound bandwidth. There are no costs per request and additional storage is priced at the lowest rate available: $0.01 per GB transferred and $0.02 per GB stored. There won’t be any surprises on your bill. Digital Ocean simplifies the cloud–they look for every opportunity to remove friction from a developer’s experience. I love it, and I think you will too–check it out at do.co/sedaily.


Incapsula can protect your API servers and microservices from responding to unwanted requests. To try Incapsula for yourself, go to incapsula.com/2017podcasts and get a free enterprise trial of Incapsula. Incapsula’s API gives you control over the security and performance of your application–whether you have a complex microservices architecture or a WordPress site, like Software Engineering Daily. Incapsula has a global network of over 30 data centers that optimize routing and cache your content. The same network of data centers that are filtering your content for attackers are operating as a CDN, and speeding up your application. To try Incapsula today, go to incapsula.com/2017podcasts and check it out. Thanks again, Incapsula.


Simplify continuous delivery with GoCD, the on-premise, open source, continuous delivery tool by ThoughtWorks. With GoCD, you can easily model complex deployment workflows using pipelines and visualize them end-to-end with the Value Stream Map. You get complete visibility into and control of your company’s deployments. At gocd.org/sedaily, find out how to bring continuous delivery to your teams. Say goodbye to deployment panic and hello to consistent, predictable deliveries. Visit gocd.org/sedaily to learn more about GoCD. Commercial support and enterprise add-ons, including disaster recovery, are available.