Podcast: Play in new window | Download
http://traffic.libsyn.com/sedaily/Dremio.mp3Podcast: Play in new window | Download The MapReduce paper was published by Google in 2004. MapReduce is an algorithm that describes how to do large-scale data processing on large clusters of commodity hardware. The MapReduce paper marked the beginning of the “big data” movement. The Hadoop project is an open source implementation of the MapReduce paper. Doug Cutting and Mike Cafarella wrote software that allowed anybody to use MapReduce,
http://traffic.libsyn.com/sedaily/Alluxio_Edited.mp3Podcast: Play in new window | Download “Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ” Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio
From Eric Sammer’s answer via Quora: At Cloudera (company) we regularly work on open source code right along side our competitors. I tend to joke that the engineers at our competitors are effectively our coworkers. Since the question specifically asks about how one deals with code (rather than the working relationship) I’ll focus on that. Honestly though, that’s probably the less interesting part. In (almost) all ways, our engineers shed their
http://traffic.libsyn.com/sedaily/Hadoop_2_Edited.mp3Podcast: Play in new window | Download “HDFS is going to be a cockroach – I don’t think its ever going away.” Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community. Today’s guest is Mike Cafarella,
http://traffic.libsyn.com/sedaily/Airbnb_Edited.mp3Podcast: Play in new window | Download “One big transformation we’re seeing right now is the slow agonizing death of MapReduce.” When a company gets big enough, there is so much data to be processed that an entire data engineering team becomes responsible for managing this data and making it available to other teams. Airbnb is one such company. Max Beauchemin works on the data engineering team at Airbnb, where
“I found Spark and I was really excited because I’m a functional programming nerd, and it was written in Scala.”
“It’s a great world to be a developer in. If you’re a developer today, you have so many tools and so many options.”
“Systems are giving up correctness for latency, and I’m arguing that stream processing systems have to be designed to allow the user to pick the tradeoffs that the application needs.”
“I normally try to sit together or very close to a product team or engineering team. And by doing so, I get very close to the source of all kinds of challenging problems.”
“We want people to be able to pick up whatever tool it is and really push themselves to get something done with it in a short amount of time, because that’s ultimately what they need to do as a data engineer in the industry.”
“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.”
“Sometimes there’s a misconception that Genie is a job scheduling platform… Genie really represents our extraction layer, from what our computational resources are, to our end user jobs.”
Genie is an open-source tool that provides job and resource management for the Hadoop ecosystem in the cloud.
“There are a lot more people who have the problem that Hadoop solves than there are people using Hadoop.”
Pachyderm is a containerized data analytics platform that seeks to replace Hadoop.
Data science is saving and improving lives by leveraging sensor data and machine learning. Pivotal makes software platforms and database products to enable enterprises to make use of their data.
Sarah Aerni is principal data scientist at Pivotal.
Databases Week began with a set of fundamental questions. What is a database? Every interviewee during Database Week has given a different answer to the question of "What is a database?" — SE Daily (@software_daily) August 21, 2015 One definition: “an application component for storing and retrieving data”. All of the different databases companies have this functionality. But similarities end there. RethinkDB pushes data to the application MemSQL is a faster, proprietary version
http://traffic.libsyn.com/sedaily/neo4j_ryan.mp3Podcast: Play in new window | DownloadGraph databases use graph structures for semantic queries. Ryan Boyd is a developer advocate for Neo4j, an open-source graph database. Questions Why does Monsanto use graph databases? In a social network graph, how would you query for “people you may know”? What CAP tradeoffs does Neo4j make? Why isn’t BASE good enough? Links Hadoop and Graph Databases for Bioinformatics Neo4j availability discussion (explores ZooKeeper option)
Fundamental questions as big as data itself loomed at the beginning of Big Data Week. Some answers: How do customers of multiple managed big data companies deal with the heterogeneity? Confluent provides Kafka, Rocana provides ops, Databricks gives you data science, Cloudera and Hortonworks give you everything else. Each company has a proprietary layer meshed with open-source software. Generally, the more proprietary software you are running, the more you will need
http://traffic.libsyn.com/sedaily/presto_chris.mp3Podcast: Play in new window | DownloadPresto is a low latency SQL language built for interactive analysis. Christopher Berner works on Presto at Facebook. Questions: Is Presto for data scientists, developers, or everyone? What are the problems with Hive? How does Hive break a query into mapreduces? How do the clients, coordinators, and workers interact? Is Presto both fast and cheap? How does Presto tune Java to get speed
http://traffic.libsyn.com/sedaily/venkatesth_hortonworks.mp3Podcast: Play in new window | DownloadHortonworks Data Platform is a managed Hadoop architecture for enterprises. Venkatesh Seetharam is a software engineer at Hortonworks. He has worked on several Apache projects, including Hadoop, Falcon, and Atlas. Questions include: Will Hadoop ever be so big we will have to start over from scratch? What is the YARN data operating system? How are customers of Hortonworks dealing with numerous managed Big Data
http://traffic.libsyn.com/sedaily/fpj_zookeeper.mp3Podcast: Play in new window | DownloadApache ZooKeeper enables highly reliable distributed coordination. Flavio Junqueira is a committer and PMC of Apache ZooKeeper, and former VP of ZooKeeper. Questions include: Why is master election so important in Hadoop? How does a new user begin working with ZooKeeper? How do nodes “watch” each other? Should ZooKeeper be used as a message queue or notification system? What is ZooKeeper’s place in a data center