Podcast: Play in new window | Download
“We still need to see in the long run how much of community and industry adoption is there. Because at the end of the day, these are the single two most important things which define and determine the success of any platform.”
“We want people to be able to pick up whatever tool it is and really push themselves to get something done with it in a short amount of time, because that’s ultimately what they need to do as a data engineer in the industry.”
“It’s a great time to be an engineer.”
Information retrieval and search engineering are becoming more intertwined with machine learning and natural language processing, leading to a wealth of work to be done in the field.
“Everybody that sees SQL thinks its ugly and dirty and they want to try and rewrite it to be better. There’s a bazillion attempts to do this – I’ve tried it several times myself. But somehow, everybody always comes back to SQL.”
“My bet is that there is going to be a big shift towards streaming technologies in the future.”
Apache Flink is an open-source framework for distributed stream and batch data processing.
“There’s not enough data scientists out there, and every company wants them to do everything. So, you really have to focus on ‘How can I be most impactful with the limited time and resources I have?’ ”
“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.”
“There are a lot more people who have the problem that Hadoop solves than there are people using Hadoop.”
Pachyderm is a containerized data analytics platform that seeks to replace Hadoop.
“A lot of data science teams – if you ask them what their ten most important questions are… a lot of people can’t even come up with those.”
Many companies find themselves drowning in data. The quantity of data matters far less than the right questions in the pursuit of actionable insights.
Data science is saving and improving lives by leveraging sensor data and machine learning. Pivotal makes software platforms and database products to enable enterprises to make use of their data.
Sarah Aerni is principal data scientist at Pivotal.
http://traffic.libsyn.com/sedaily/voltdb_rbetts.mp3Podcast: Play in new window | DownloadStreaming pipelines and in-memory analytics are difficult to support with old database systems. VoltDB provides streaming analytics with transactions. Questions How does VoltDB exemplify Michael Stonebraker’s thesis that one size does not fit all? What is the difference between OLTP and Streaming? How does VoltDB serve the common Zookeeper-Kafka-Storm-Cassandra stack? What trends and requirements among OLTP and OLAP systems are changing most
http://traffic.libsyn.com/sedaily/neo4j_ryan.mp3Podcast: Play in new window | DownloadGraph databases use graph structures for semantic queries. Ryan Boyd is a developer advocate for Neo4j, an open-source graph database. Questions Why does Monsanto use graph databases? In a social network graph, how would you query for “people you may know”? What CAP tradeoffs does Neo4j make? Why isn’t BASE good enough? Links Hadoop and Graph Databases for Bioinformatics Neo4j availability discussion (explores ZooKeeper option)
http://traffic.libsyn.com/sedaily/influxdb_pauldix.mp3Podcast: Play in new window | DownloadInfluxDB is an open-source time-series database. Time-series data can be used by for metrics and analytics. Paul Dix is the CEO of InfluxDB. Questions What differentiates InfluxDB from a regular database with a timestamp on every entry? What is the full-stack architecture of a typical user of InfluxDB? Why are distributed time series databases so hard? What CAP tradeoffs does InfluxDB make? Does Go’s
http://traffic.libsyn.com/sedaily/pipelinedb_derek.mp3Podcast: Play in new window | DownloadPipelineDB is a streaming SQL database. Derek Nelson is the CEO of PipelineDB. Questions What are continuous views? Why is PipelineDB a good fit for the Kafka+Storm+HBase-type architecture? How does PipelineDB affect the application tier or the browser tier? What are the latency guarantees for how long it takes raw data streams to be converted into the refined queries provided by a continuous view?
http://traffic.libsyn.com/sedaily/rethinkdb_slava.mp3Podcast: Play in new window | DownloadRethinkDB is an open-source database for the realtime web. RethinkDB pushes changes to the application rather than waiting for a request. Slava Akhmechet is the CEO of RethinkDB. Questions RethinkDB supports a “push” model rather than request handling. Why? What are some use cases for pushing data? What does the full-stack architecture look like when the database has push? What did you learn from the
http://traffic.libsyn.com/sedaily/memsql_nikita_2.mp3Podcast: Play in new window | DownloadMemSQL is a high-performance, in-memory database that combines the horizontal scalability of distributed systems with the familiarity of SQL. Nikita Shamgunov is co-founder and CTO of MemSQL. Questions What types of data does a user want to keep on disk versus on an in-memory database? How does MemSQL compare to MySQL? How do MemSQL users leverage Apache Spark? How does a user onboard with
http://traffic.libsyn.com/sedaily/presto_chris.mp3Podcast: Play in new window | DownloadPresto is a low latency SQL language built for interactive analysis. Christopher Berner works on Presto at Facebook. Questions: Is Presto for data scientists, developers, or everyone? What are the problems with Hive? How does Hive break a query into mapreduces? How do the clients, coordinators, and workers interact? Is Presto both fast and cheap? How does Presto tune Java to get speed
http://traffic.libsyn.com/sedaily/venkatesth_hortonworks.mp3Podcast: Play in new window | DownloadHortonworks Data Platform is a managed Hadoop architecture for enterprises. Venkatesh Seetharam is a software engineer at Hortonworks. He has worked on several Apache projects, including Hadoop, Falcon, and Atlas. Questions include: Will Hadoop ever be so big we will have to start over from scratch? What is the YARN data operating system? How are customers of Hortonworks dealing with numerous managed Big Data
http://traffic.libsyn.com/sedaily/guozhang_kafka.mp3Podcast: Play in new window | DownloadApache Kafka is a publish-subscribe messaging system rethought as a distributed commit log. Kafka serves as the central repository for data streams in a distributed system. Guozhang Wang is an engineer at Confluent, which offers a stream data platform built using Kafka. Questions include: What is a central repository for data streams? How does Kafka improve transportation between systems? How does Kafka allow for richer
http://traffic.libsyn.com/sedaily/matei_spark.mp3Podcast: Play in new window | Download Apache Spark is a fast and general engine for big data processing. Matei Zaharia created Spark, and is the co-founder of Databricks, a company using Spark to power data science. Questions: What was the motivation behind creating Spark? How much faster is a Spark job than a Hadoop job? What is the relationship between streaming and batch processing? Is Spark’s core advantage over Storm