Tag Hadoop

Dremio with Tomer Shiran

http://traffic.libsyn.com/sedaily/Dremio.mp3Podcast: Play in new window | Download The MapReduce paper was published by Google in 2004. MapReduce is an algorithm that describes how to do large-scale data processing on large clusters of commodity hardware. The MapReduce paper marked the beginning of the “big data” movement. The Hadoop project is an open source implementation of the MapReduce paper. Doug Cutting and Mike Cafarella wrote software that allowed anybody to use MapReduce,

Continue reading…

Alluxio and Memory-centric Distributed Storage with Haoyuan Li

http://traffic.libsyn.com/sedaily/Alluxio_Edited.mp3Podcast: Play in new window | Download “Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ” Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio

Continue reading…

Competition in the Open Source Ecosystem

From Eric Sammer’s answer via Quora: At Cloudera (company) we regularly work on open source code right along side our competitors. I tend to joke that the engineers at our competitors are effectively our coworkers. Since the question specifically asks about how one deals with code (rather than the working relationship) I’ll focus on that. Honestly though, that’s probably the less interesting part. In (almost) all ways, our engineers shed their

Continue reading…

Hadoop: Past, Present and Future with Mike Cafarella

http://traffic.libsyn.com/sedaily/Hadoop_2_Edited.mp3Podcast: Play in new window | Download “HDFS is going to be a cockroach – I don’t think its ever going away.” Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community. Today’s guest is Mike Cafarella,

Continue reading…

Data Engineering at Airbnb with Maxime Beauchemin

http://traffic.libsyn.com/sedaily/Airbnb_Edited.mp3Podcast: Play in new window | Download “One big transformation we’re seeing right now is the slow agonizing death of MapReduce.” When a company gets big enough, there is so much data to be processed that an entire data engineering team becomes responsible for managing this data and making it available to other teams. Airbnb is one such company. Max Beauchemin works on the data engineering team at Airbnb, where

Continue reading…

Spark in Practice with Holden Karau

“I found Spark and I was really excited because I’m a functional programming nerd, and it was written in Scala.”

Continue reading…

Apache Drill with Tomer Shiran

“It’s a great world to be a developer in. If you’re a developer today, you have so many tools and so many options.”

Continue reading…

Demystifying Stream Processing with Neha Narkhede

“Systems are giving up correctness for latency, and I’m arguing that stream processing systems have to be designed to allow the user to pick the tradeoffs that the application needs.”

Continue reading…

Data Science at Spotify with Boxun Zhang

“I normally try to sit together or very close to a product team or engineering team. And by doing so, I get very close to the source of all kinds of challenging problems.”

Continue reading…

Data Engineering with David Drummond and Austin Ouyang

“We want people to be able to pick up whatever tool it is and really push themselves to get something done with it in a short amount of time, because that’s ultimately what they need to do as a data engineer in the industry.”

Continue reading…

Kudu with Todd Lipcon

“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.”

Continue reading…

Netflix Genie with Tom Gianos

“Sometimes there’s a misconception that Genie is a job scheduling platform… Genie really represents our extraction layer, from what our computational resources are, to our end user jobs.”

Genie is an open-source tool that provides job and resource management for the Hadoop ecosystem in the cloud.

Continue reading…

Replacing Hadoop with Joe Doliner

“There are a lot more people who have the problem that Hadoop solves than there are people using Hadoop.”

Pachyderm is a containerized data analytics platform that seeks to replace Hadoop.

Continue reading…

Data Science at Pivotal with Sarah Aerni

Data science is saving and improving lives by leveraging sensor data and machine learning. Pivotal makes software platforms and database products to enable enterprises to make use of their data.

Sarah Aerni is principal data scientist at Pivotal.

Continue reading…

Databases: Fundamental Answers

Databases Week began with a set of fundamental questions. What is a database? Every interviewee during Database Week has given a different answer to the question of "What is a database?" — SE Daily (@software_daily) August 21, 2015 One definition: “an application component for storing and retrieving data”. All of the different databases companies have this functionality. But similarities end there. RethinkDB pushes data to the application MemSQL is a faster, proprietary version

Continue reading…

Graph Databases with Ryan Boyd of Neo4j

http://traffic.libsyn.com/sedaily/neo4j_ryan.mp3Podcast: Play in new window | DownloadGraph databases use graph structures for semantic queries. Ryan Boyd is a developer advocate for Neo4j, an open-source graph database. Questions Why does Monsanto use graph databases? In a social network graph, how would you query for “people you may know”? What CAP tradeoffs does Neo4j make? Why isn’t BASE good enough? Links Hadoop and Graph Databases for Bioinformatics Neo4j availability discussion (explores ZooKeeper option)

Continue reading…

Big Data: Fundamental Answers

Fundamental questions as big as data itself loomed at the beginning of Big Data Week. Some answers: How do customers of multiple managed big data companies deal with the heterogeneity? Confluent provides Kafka, Rocana provides ops, Databricks gives you data science, Cloudera and Hortonworks give you everything else. Each company has a proprietary layer meshed with open-source software. Generally, the more proprietary software you are running, the more you will need

Continue reading…

Facebook Presto with Christopher Berner

http://traffic.libsyn.com/sedaily/presto_chris.mp3Podcast: Play in new window | DownloadPresto is a low latency SQL language built for interactive analysis.   Christopher Berner works on Presto at Facebook. Questions: Is Presto for data scientists, developers, or everyone? What are the problems with Hive? How does Hive break a query into mapreduces? How do the clients, coordinators, and workers interact? Is Presto both fast and cheap? How does Presto tune Java to get speed

Continue reading…

Hortonworks Data Platform with Venkatesh Seetharam

http://traffic.libsyn.com/sedaily/venkatesth_hortonworks.mp3Podcast: Play in new window | DownloadHortonworks Data Platform is a managed Hadoop architecture for enterprises. Venkatesh Seetharam is a software engineer at Hortonworks. He has worked on several Apache projects, including Hadoop, Falcon, and Atlas. Questions include: Will Hadoop ever be so big we will have to start over from scratch? What is the YARN data operating system? How are customers of Hortonworks dealing with numerous managed Big Data

Continue reading…

Apache ZooKeeper with Flavio Junqueira

http://traffic.libsyn.com/sedaily/fpj_zookeeper.mp3Podcast: Play in new window | DownloadApache ZooKeeper enables highly reliable distributed coordination. Flavio Junqueira is a committer and PMC of Apache ZooKeeper, and former VP of ZooKeeper. Questions include: Why is master election so important in Hadoop? How does a new user begin working with ZooKeeper? How do nodes “watch” each other? Should ZooKeeper be used as a message queue or notification system? What is ZooKeeper’s place in a data center

Continue reading…

  • 1 2