Podcast: Play in new window | Download
http://traffic.libsyn.com/sedaily/data-warehousing_edited.mp3Podcast: Play in new window | Download In the mid 90s, data warehousing might have meant “using an Oracle database.” Today, it means a wide variety of things. You could be stitching together a big data pipeline using Kafka, Hadoop, and Spark. You could be using managed tools like BigQuery from Google. How did we get from the simple days of Oracle databases to the wealth of options available today?
http://traffic.libsyn.com/sedaily/Troy_Hunt_Edited_2.mp3Podcast: Play in new window | Download When you hear about massive data breaches like the recent ones from LinkedIn, MySpace, or Ashley Madison, how can you find out whether your own data was compromised? Troy Hunt created the website HaveIBeenPwned.com to answer this question. When a major data breach occurs, Troy acquires a copy of the stolen data and provides a safe way for individuals to check if
http://traffic.libsyn.com/sedaily/arrow_edited_fixed.mp3Podcast: Play in new window | Download In a typical data analytics system, there are a variety of technologies interacting. HDFS for storing files, Spark for distributed machine learning, pandas for data analysis in Python–each of these different technologies has a different format for how data is represented. Serialization and deserialization between these different formats causes significant latency across the overall system. Apache Arrow is a tool for improving
http://traffic.libsyn.com/sedaily/uber_danny_edited.mp3Podcast: Play in new window | Download “Be aggressive in vision, but conservative in operation.” Uber is a transportation company with a high volume of temporal spacial data, constantly being collected from the devices of its users. At any given time, the engineers and data scientists at Uber need to be able to query the system, and understand what is going on with drivers and riders. The unique real-time engineering
http://traffic.libsyn.com/sedaily/Alluxio_Edited.mp3Podcast: Play in new window | Download “Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ” Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio
http://traffic.libsyn.com/sedaily/Mapping_Edited.mp3Podcast: Play in new window | Download “I’m always worried that if you teach too much magic, people don’t learn the basics – they don’t know why something is working, they just know the documentation said it should work that way.” On Software Engineering Daily, we often discuss big data in terms of data engineering and data science. Data engineering is the infrastructure and pipelines that handle massive amounts of
http://traffic.libsyn.com/sedaily/Filodb_Edited.mp3Podcast: Play in new window | Download “The world is becoming more and more interactive, and people want answers right away, so you’re seeing the rise of stream processing and real-time.” Big data is yesterday–fast data is now. FiloDB is a reactive columnar OLAP database that is built on Cassandra and Spark. Today’s guest is Evan Chan, creator of FiloDB. In our discussion today, we talk about the use cases
http://traffic.libsyn.com/sedaily/Cassandra_Edited.mp3Podcast: Play in new window | Download “There isn’t any central node in Cassandra. Every node is a peer, there is no master – there is no single point of failure.” Apache Cassandra can serve as both the real-time data store for online transactional applications, as well as the read-intensive database for data warehousing operations. In order to combine these two use cases into a single database, Apache Cassandra required
From Eric Sammer’s answer via Quora: At Cloudera (company) we regularly work on open source code right along side our competitors. I tend to joke that the engineers at our competitors are effectively our coworkers. Since the question specifically asks about how one deals with code (rather than the working relationship) I’ll focus on that. Honestly though, that’s probably the less interesting part. In (almost) all ways, our engineers shed their
http://traffic.libsyn.com/sedaily/Hadoop_2_Edited.mp3Podcast: Play in new window | Download “HDFS is going to be a cockroach – I don’t think its ever going away.” Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community. Today’s guest is Mike Cafarella,
http://traffic.libsyn.com/sedaily/Airbnb_Edited.mp3Podcast: Play in new window | Download “One big transformation we’re seeing right now is the slow agonizing death of MapReduce.” When a company gets big enough, there is so much data to be processed that an entire data engineering team becomes responsible for managing this data and making it available to other teams. Airbnb is one such company. Max Beauchemin works on the data engineering team at Airbnb, where
From Chris Schrader’s answer via Quora: Someone could write a 5000 page book on this subject but I’ll do my best at a high level. SQL Databases I break these down into to three basic groups: Traditional, MPP, columnar, and an emerging technology called NewSQL. Traditional These are the usual databases that we’ve seen for years. Some vendors might includeMySQL, PostgreSQL, SQL Server (product), Sybase, Oracle Database, etc. They comply with
“Benchmarks are all crap, but there are some benchmarks that are better than others.”
“I found Spark and I was really excited because I’m a functional programming nerd, and it was written in Scala.”
“It’s a great world to be a developer in. If you’re a developer today, you have so many tools and so many options.”
“I really think that data science is like design in the sense that it’s a way of thinking.”
“Systems are giving up correctness for latency, and I’m arguing that stream processing systems have to be designed to allow the user to pick the tradeoffs that the application needs.”
“My bet is that there is going to be a big shift towards streaming technologies in the future.”
Apache Flink is an open-source framework for distributed stream and batch data processing.
“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.”
“Sometimes there’s a misconception that Genie is a job scheduling platform… Genie really represents our extraction layer, from what our computational resources are, to our end user jobs.”
Genie is an open-source tool that provides job and resource management for the Hadoop ecosystem in the cloud.