Podcast: Play in new window | Download
http://traffic.libsyn.com/sedaily/Apache_Beam__Edited.mp3Podcast: Play in new window | Download Unbounded data streams create difficult challenges for our application architectures. The data never stops coming, and we are forced to assume that we will never know if or when we have seen all of our data. Some streaming systems give us the tools to deal partially with unbounded data streams, but we have to complement those streaming systems with batch processing, in a
http://traffic.libsyn.com/sedaily/database_crisis_edited_fixed.mp3Podcast: Play in new window | Download Breakthroughs in modern data research tend to come from companies like Google, Facebook, and Amazon, with projects like MapReduce, Cassandra, and Dynamo. Twenty years ago, this types of breakthroughs would be happening in academia, which causes today’s guest Peter Bailis to ask: is the academic data community having an identity crisis? Peter is an assistant professor at Stanford University, where he
http://traffic.libsyn.com/sedaily/arrow_edited_fixed.mp3Podcast: Play in new window | Download In a typical data analytics system, there are a variety of technologies interacting. HDFS for storing files, Spark for distributed machine learning, pandas for data analysis in Python–each of these different technologies has a different format for how data is represented. Serialization and deserialization between these different formats causes significant latency across the overall system. Apache Arrow is a tool for improving
http://traffic.libsyn.com/sedaily/uber_danny_edited.mp3Podcast: Play in new window | Download “Be aggressive in vision, but conservative in operation.” Uber is a transportation company with a high volume of temporal spacial data, constantly being collected from the devices of its users. At any given time, the engineers and data scientists at Uber need to be able to query the system, and understand what is going on with drivers and riders. The unique real-time engineering
http://traffic.libsyn.com/sedaily/Alluxio_Edited.mp3Podcast: Play in new window | Download “Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ” Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio
http://traffic.libsyn.com/sedaily/Filodb_Edited.mp3Podcast: Play in new window | Download “The world is becoming more and more interactive, and people want answers right away, so you’re seeing the rise of stream processing and real-time.” Big data is yesterday–fast data is now. FiloDB is a reactive columnar OLAP database that is built on Cassandra and Spark. Today’s guest is Evan Chan, creator of FiloDB. In our discussion today, we talk about the use cases
http://traffic.libsyn.com/sedaily/Neuroscience_Edited.mp3Podcast: Play in new window | Download “You want to take a scientist who knows a little bit of matlab programming and try to teach them mapreduce, and write a mapreduce program in java to do image processing? It’s a disaster!” Apache Spark is replacing MATLAB in the domain of computational neuroscience. The constraints of running MATLAB on a single machine can’t support the demands of neuroscience, which has huge
“Benchmarks are all crap, but there are some benchmarks that are better than others.”
“I found Spark and I was really excited because I’m a functional programming nerd, and it was written in Scala.”
http://traffic.libsyn.com/sedaily/Mesos_Edited_2.mp3Podcast: Play in new window | Download Apache Mesos is an open-source cluster manager that enables resource sharing in a fine-grained manner, improving cluster utilization. Michael Hausenblas is a developer and cloud advocate with Mesosphere, which builds the Datacenter Operating System (DCOS), a distributed OS that uses Apache Mesos as its kernel. Questions Can you give the historical context for cluster computing? How are the distributed systems needs of different
“We still need to see in the long run how much of community and industry adoption is there. Because at the end of the day, these are the single two most important things which define and determine the success of any platform.”
“We want people to be able to pick up whatever tool it is and really push themselves to get something done with it in a short amount of time, because that’s ultimately what they need to do as a data engineer in the industry.”
“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.”
There is a need for more data scientists to make sense of the vast amounts of data we produce and store. Dataquest is an in-browser platform for learning data science that is tackling this problem.
Vik Paruchuri is the founder of Dataquest. He was previously a machine learning engineer at EdX and before that a U.S. diplomat.
Data science is saving and improving lives by leveraging sensor data and machine learning. Pivotal makes software platforms and database products to enable enterprises to make use of their data.
Sarah Aerni is principal data scientist at Pivotal.
Fundamental questions as big as data itself loomed at the beginning of Big Data Week. Some answers: How do customers of multiple managed big data companies deal with the heterogeneity? Confluent provides Kafka, Rocana provides ops, Databricks gives you data science, Cloudera and Hortonworks give you everything else. Each company has a proprietary layer meshed with open-source software. Generally, the more proprietary software you are running, the more you will need
http://traffic.libsyn.com/sedaily/venkatesth_hortonworks.mp3Podcast: Play in new window | DownloadHortonworks Data Platform is a managed Hadoop architecture for enterprises. Venkatesh Seetharam is a software engineer at Hortonworks. He has worked on several Apache projects, including Hadoop, Falcon, and Atlas. Questions include: Will Hadoop ever be so big we will have to start over from scratch? What is the YARN data operating system? How are customers of Hortonworks dealing with numerous managed Big Data
http://traffic.libsyn.com/sedaily/guozhang_kafka.mp3Podcast: Play in new window | DownloadApache Kafka is a publish-subscribe messaging system rethought as a distributed commit log. Kafka serves as the central repository for data streams in a distributed system. Guozhang Wang is an engineer at Confluent, which offers a stream data platform built using Kafka. Questions include: What is a central repository for data streams? How does Kafka improve transportation between systems? How does Kafka allow for richer
http://traffic.libsyn.com/sedaily/rocana_esammer.mp3Podcast: Play in new window | DownloadRocana applies big data, advanced analytics, and visualizations to dev ops in order to guide users to the root causes of problems. Eric Sammer is the co-founder and CTO of Rocana. At Cloudera, he served as an Engineering Manager responsible for tools and partner integrations. Within that role, he developed many of Cloudera’s best practices for developing large, distributed, data processing infrastructure. Questions include: Does
Sean Owen, Director, Data Science @ Cloudera via Quora Although people use the word in different ways, Hadoop refers to an ecosystem of projects, most of which are not processing systems at all. It contains MapReduce, which is a very batch-oriented data processing paradigm. Spark is also part of the Hadoop ecosystem, I’d say, although it can be used separately from things we would call Hadoop. Spark is a batch