Category Data

Stream Processing with Satish Mittal

“We still need to see in the long run how much of community and industry adoption is there. Because at the end of the day, these are the single two most important things which define and determine the success of any platform.”

Continue reading…

Data Engineering with David Drummond and Austin Ouyang

“We want people to be able to pick up whatever tool it is and really push themselves to get something done with it in a short amount of time, because that’s ultimately what they need to do as a data engineer in the industry.”

Continue reading…

Taming Text with Grant Ingersoll

“It’s a great time to be an engineer.”

Information retrieval and search engineering are becoming more intertwined with machine learning and natural language processing, leading to a wealth of work to be done in the field.

Continue reading…

SQLite with D. Richard Hipp

“Everybody that sees SQL thinks its ugly and dirty and they want to try and rewrite it to be better. There’s a bazillion attempts to do this – I’ve tried it several times myself. But somehow, everybody always comes back to SQL.”

Continue reading…

Apache Flink with Stephan Ewen

“My bet is that there is going to be a big shift towards streaming technologies in the future.”

Apache Flink is an open-source framework for distributed stream and batch data processing.

Continue reading…

Galvanize Data Science with Jonathan Dinu and Ryan Orban

“There’s not enough data scientists out there, and every company wants them to do everything. So, you really have to focus on ‘How can I be most impactful with the limited time and resources I have?’ ”

Continue reading…

Kudu with Todd Lipcon

“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.”

Continue reading…

Replacing Hadoop with Joe Doliner

“There are a lot more people who have the problem that Hadoop solves than there are people using Hadoop.”

Pachyderm is a containerized data analytics platform that seeks to replace Hadoop.

Continue reading…

Applied Data Science with Edwin Chen

“A lot of data science teams – if you ask them what their ten most important questions are… a lot of people can’t even come up with those.”

Many companies find themselves drowning in data. The quantity of data matters far less than the right questions in the pursuit of actionable insights.

Continue reading…

Data Science at Pivotal with Sarah Aerni

Data science is saving and improving lives by leveraging sensor data and machine learning. Pivotal makes software platforms and database products to enable enterprises to make use of their data.

Sarah Aerni is principal data scientist at Pivotal.

Continue reading…

Transactions and Analytics with VoltDB’s Ryan Betts

http://traffic.libsyn.com/sedaily/voltdb_rbetts.mp3Podcast: Play in new window | DownloadStreaming pipelines and in-memory analytics are difficult to support with old database systems. VoltDB provides streaming analytics with transactions.     Questions How does VoltDB exemplify Michael Stonebraker’s thesis that one size does not fit all? What is the difference between OLTP and Streaming? How does VoltDB serve the common Zookeeper-Kafka-Storm-Cassandra stack? What trends and requirements among OLTP and OLAP systems are changing most

Continue reading…

Graph Databases with Ryan Boyd of Neo4j

http://traffic.libsyn.com/sedaily/neo4j_ryan.mp3Podcast: Play in new window | DownloadGraph databases use graph structures for semantic queries. Ryan Boyd is a developer advocate for Neo4j, an open-source graph database. Questions Why does Monsanto use graph databases? In a social network graph, how would you query for “people you may know”? What CAP tradeoffs does Neo4j make? Why isn’t BASE good enough? Links Hadoop and Graph Databases for Bioinformatics Neo4j availability discussion (explores ZooKeeper option)

Continue reading…

Time-Series Database with InfluxDB CEO Paul Dix

http://traffic.libsyn.com/sedaily/influxdb_pauldix.mp3Podcast: Play in new window | DownloadInfluxDB is an open-source time-series database. Time-series data can be used by for metrics and analytics. Paul Dix is the CEO of InfluxDB. Questions What differentiates InfluxDB from a regular database with a timestamp on every entry? What is the full-stack architecture of a typical user of InfluxDB? Why are distributed time series databases so hard? What CAP tradeoffs does InfluxDB make? Does Go’s

Continue reading…

Streaming SQL with PipelineDB CEO Derek Nelson

http://traffic.libsyn.com/sedaily/pipelinedb_derek.mp3Podcast: Play in new window | DownloadPipelineDB is a streaming SQL database. Derek Nelson is the CEO of PipelineDB. Questions What are continuous views? Why is PipelineDB a good fit for the Kafka+Storm+HBase-type architecture? How does PipelineDB affect the application tier or the browser tier? What are the latency guarantees for how long it takes raw data streams to be converted into the refined queries provided by a continuous view?

Continue reading…

Push Databases with RethinkDB CEO Slava Akhmechet

http://traffic.libsyn.com/sedaily/rethinkdb_slava.mp3Podcast: Play in new window | DownloadRethinkDB is an open-source database for the realtime web. RethinkDB pushes changes to the application rather than waiting for a request. Slava Akhmechet is the CEO of RethinkDB. Questions RethinkDB supports a “push” model rather than request handling. Why? What are some use cases for pushing data? What does the full-stack architecture look like when the database has push? What did you learn from the

Continue reading…

MemSQL with Nikita Shamgunov

http://traffic.libsyn.com/sedaily/memsql_nikita_2.mp3Podcast: Play in new window | DownloadMemSQL is a high-performance, in-memory database that combines the horizontal scalability of distributed systems with the familiarity of SQL. Nikita Shamgunov is co-founder and CTO of MemSQL. Questions What types of data does a user want to keep on disk versus on an in-memory database? How does MemSQL compare to MySQL? How do MemSQL users leverage Apache Spark? How does a user onboard with

Continue reading…

Facebook Presto with Christopher Berner

http://traffic.libsyn.com/sedaily/presto_chris.mp3Podcast: Play in new window | DownloadPresto is a low latency SQL language built for interactive analysis.   Christopher Berner works on Presto at Facebook. Questions: Is Presto for data scientists, developers, or everyone? What are the problems with Hive? How does Hive break a query into mapreduces? How do the clients, coordinators, and workers interact? Is Presto both fast and cheap? How does Presto tune Java to get speed

Continue reading…

Hortonworks Data Platform with Venkatesh Seetharam

http://traffic.libsyn.com/sedaily/venkatesth_hortonworks.mp3Podcast: Play in new window | DownloadHortonworks Data Platform is a managed Hadoop architecture for enterprises. Venkatesh Seetharam is a software engineer at Hortonworks. He has worked on several Apache projects, including Hadoop, Falcon, and Atlas. Questions include: Will Hadoop ever be so big we will have to start over from scratch? What is the YARN data operating system? How are customers of Hortonworks dealing with numerous managed Big Data

Continue reading…

Apache Kafka with Guozhang Wang

http://traffic.libsyn.com/sedaily/guozhang_kafka.mp3Podcast: Play in new window | DownloadApache Kafka is a publish-subscribe messaging system rethought as a distributed commit log. Kafka serves as the central repository for data streams in a distributed system. Guozhang Wang is an engineer at Confluent, which offers a stream data platform built using Kafka. Questions include: What is a central repository for data streams? How does Kafka improve transportation between systems? How does Kafka allow for richer

Continue reading…

Apache Spark Creator Matei Zaharia Interview

http://traffic.libsyn.com/sedaily/matei_spark.mp3Podcast: Play in new window | Download  Apache Spark is a fast and general engine for big data processing. Matei Zaharia created Spark, and is the co-founder of Databricks, a company using Spark to power data science. Questions: What was the motivation behind creating Spark? How much faster is a Spark job than a Hadoop job? What is the relationship between streaming and batch processing? Is Spark’s core advantage over Storm

Continue reading…