Podcast: Play in new window | Download
http://traffic.libsyn.com/sedaily/uber_danny_edited.mp3Podcast: Play in new window | Download “Be aggressive in vision, but conservative in operation.” Uber is a transportation company with a high volume of temporal spacial data, constantly being collected from the devices of its users. At any given time, the engineers and data scientists at Uber need to be able to query the system, and understand what is going on with drivers and riders. The unique real-time engineering
http://traffic.libsyn.com/sedaily/Alluxio_Edited.mp3Podcast: Play in new window | Download “Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ” Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio
http://traffic.libsyn.com/sedaily/Mapping_Edited.mp3Podcast: Play in new window | Download “I’m always worried that if you teach too much magic, people don’t learn the basics – they don’t know why something is working, they just know the documentation said it should work that way.” On Software Engineering Daily, we often discuss big data in terms of data engineering and data science. Data engineering is the infrastructure and pipelines that handle massive amounts of
http://traffic.libsyn.com/sedaily/Filodb_Edited.mp3Podcast: Play in new window | Download “The world is becoming more and more interactive, and people want answers right away, so you’re seeing the rise of stream processing and real-time.” Big data is yesterday–fast data is now. FiloDB is a reactive columnar OLAP database that is built on Cassandra and Spark. Today’s guest is Evan Chan, creator of FiloDB. In our discussion today, we talk about the use cases
http://traffic.libsyn.com/sedaily/Cassandra_Edited.mp3Podcast: Play in new window | Download “There isn’t any central node in Cassandra. Every node is a peer, there is no master – there is no single point of failure.” Apache Cassandra can serve as both the real-time data store for online transactional applications, as well as the read-intensive database for data warehousing operations. In order to combine these two use cases into a single database, Apache Cassandra required
From Eric Sammer’s answer via Quora: At Cloudera (company) we regularly work on open source code right along side our competitors. I tend to joke that the engineers at our competitors are effectively our coworkers. Since the question specifically asks about how one deals with code (rather than the working relationship) I’ll focus on that. Honestly though, that’s probably the less interesting part. In (almost) all ways, our engineers shed their
http://traffic.libsyn.com/sedaily/Hadoop_2_Edited.mp3Podcast: Play in new window | Download “HDFS is going to be a cockroach – I don’t think its ever going away.” Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community. Today’s guest is Mike Cafarella,
http://traffic.libsyn.com/sedaily/Airbnb_Edited.mp3Podcast: Play in new window | Download “One big transformation we’re seeing right now is the slow agonizing death of MapReduce.” When a company gets big enough, there is so much data to be processed that an entire data engineering team becomes responsible for managing this data and making it available to other teams. Airbnb is one such company. Max Beauchemin works on the data engineering team at Airbnb, where
From Chris Schrader’s answer via Quora: Someone could write a 5000 page book on this subject but I’ll do my best at a high level. SQL Databases I break these down into to three basic groups: Traditional, MPP, columnar, and an emerging technology called NewSQL. Traditional These are the usual databases that we’ve seen for years. Some vendors might includeMySQL, PostgreSQL, SQL Server (product), Sybase, Oracle Database, etc. They comply with
“Benchmarks are all crap, but there are some benchmarks that are better than others.”
“I found Spark and I was really excited because I’m a functional programming nerd, and it was written in Scala.”
“It’s a great world to be a developer in. If you’re a developer today, you have so many tools and so many options.”
“I really think that data science is like design in the sense that it’s a way of thinking.”
“Systems are giving up correctness for latency, and I’m arguing that stream processing systems have to be designed to allow the user to pick the tradeoffs that the application needs.”
“My bet is that there is going to be a big shift towards streaming technologies in the future.”
Apache Flink is an open-source framework for distributed stream and batch data processing.
“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.”
“Sometimes there’s a misconception that Genie is a job scheduling platform… Genie really represents our extraction layer, from what our computational resources are, to our end user jobs.”
Genie is an open-source tool that provides job and resource management for the Hadoop ecosystem in the cloud.
Data science competitions are an effective way to crowdsource the best solutions for challenging datasets. Kaggle is a platform for data scientists to collaborate and compete on machine learning problems with the opportunity to win money from the competitions’ sponsors.
“A lot of data science teams – if you ask them what their ten most important questions are… a lot of people can’t even come up with those.”
Many companies find themselves drowning in data. The quantity of data matters far less than the right questions in the pursuit of actionable insights.
Data science is a broad topic with numerous subfields such as data engineering and machine learning. Yad Faeq returns to the podcast to discuss data science at a high level, and rescue Software Engineering Daily from the threat of the hype vortex.