Tag Data Engineering

Kafka Streams with Jay Kreps

http://traffic.libsyn.com/sedaily/kafka_streams_edited.mp3Podcast: Play in new window | Download Kafka Streams is a library for building streaming applications that transform input Kafka topics into output Kafka topics. In a time when there are numerous streaming frameworks already out there, why do we need yet another? To quote today’s guest Jay Kreps “the gap we see Kafka Streams filling is less the analytics-focused domain these frameworks focus on and more building core applications

Continue reading…

Cloud Dataflow with Eric Anderson

http://traffic.libsyn.com/sedaily/Google_Cloud_Edited.mp3Podcast: Play in new window | Download Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have iterated on them.   Dataflow and Apache Beam are projects that present a unified batch and stream processing system. A previous episode with Frances

Continue reading…

Apache Beam with Frances Perry

http://traffic.libsyn.com/sedaily/Apache_Beam__Edited.mp3Podcast: Play in new window | Download Unbounded data streams create difficult challenges for our application architectures. The data never stops coming, and we are forced to assume that we will never know if or when we have seen all of our data. Some streaming systems give us the tools to deal partially with unbounded data streams, but we have to complement those streaming systems with batch processing, in a

Continue reading…

Stream Processing at Uber with Danny Yuan

http://traffic.libsyn.com/sedaily/uber_danny_edited.mp3Podcast: Play in new window | Download “Be aggressive in vision, but conservative in operation.” Uber is a transportation company with a high volume of temporal spacial data, constantly being collected from the devices of its users. At any given time, the engineers and data scientists at Uber need to be able to query the system, and understand what is going on with drivers and riders. The unique real-time engineering

Continue reading…

Alluxio and Memory-centric Distributed Storage with Haoyuan Li

http://traffic.libsyn.com/sedaily/Alluxio_Edited.mp3Podcast: Play in new window | Download “Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ” Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio

Continue reading…

Building Software for Millenials with Anthony Sessa

http://traffic.libsyn.com/sedaily/Mic_Edited_2.mp3Podcast: Play in new window | Download “Millenials deeply care about software, in the sense where if something doesn’t work as it should, it’s forgotten immediately – if you build an app and there are bugs, you’re done.” Mic.com is a media company focused on news for millennials. Anthony Sessa is the VP of product at Mic.com, and he joins us to talk about the engineering of a modern news

Continue reading…

Hadoop: Past, Present and Future with Mike Cafarella

http://traffic.libsyn.com/sedaily/Hadoop_2_Edited.mp3Podcast: Play in new window | Download “HDFS is going to be a cockroach – I don’t think its ever going away.” Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community. Today’s guest is Mike Cafarella,

Continue reading…

Data Engineering at Airbnb with Maxime Beauchemin

http://traffic.libsyn.com/sedaily/Airbnb_Edited.mp3Podcast: Play in new window | Download “One big transformation we’re seeing right now is the slow agonizing death of MapReduce.” When a company gets big enough, there is so much data to be processed that an entire data engineering team becomes responsible for managing this data and making it available to other teams. Airbnb is one such company. Max Beauchemin works on the data engineering team at Airbnb, where

Continue reading…

Apache Kafka’s Uses and Target Market

From Nicolae Marasoiu’s answer via Quora: Kafka is a high performance messaging system which provides an immutable, linearizable, sharded log of messages. Throughput and storage capacity scale linearly with nodes. Kafka can push astonishingly high volume through each node; often saturating disk, network, or both, while keeping a low cpu utilization. You would use Kafka in scenarios of asynchronous communication and processing pipelines, predominantly in distributed systems, cloud & big data,

Continue reading…

Deep Learning and Keras with François Chollet

“I definitely think we can try to abstract away the first principles of intelligence and then try to go from these principles to an intelligent machine that might look nothing like the brain.”

Continue reading…

Mesosphere and Tech Journalism with Derrick Harris

“The business of technology and the technology of technology are kind of converging if you ask me. And there is definitely a space for some publications that don’t have decades of technical debt in the software space.”

Continue reading…

Spark in Practice with Holden Karau

“I found Spark and I was really excited because I’m a functional programming nerd, and it was written in Scala.”

Continue reading…

Machine Learning for Businesses with Joshua Bloom

“You’ve got software engineers who are interested in machine learning, and think what they need to do is just bring in another module and then that will solve their problem. It’s particularly important for those people to understand that this is a different type of beast.”

Continue reading…

Data Science with Srini Kadamati

“I really think that data science is like design in the sense that it’s a way of thinking.”

Continue reading…

Architecting Distributed Databases with Fangjin Yang

“The more you’re comfortable with this idea that everything is going to fail, the more you realize that it’s a natural process of distributed systems, and it helps you write and architect better code.”

Continue reading…

Distributed Systems with Alvaro Videla

“Every vendor will advertise that their system is better – that’s nice, I understand you need to sell your thing, but what am I gaining as a user and what am I sacrificing as a user by choosing your product?”

Continue reading…

Mesos and Docker in Practice with Michael Hausenblas

http://traffic.libsyn.com/sedaily/Mesos_Edited_2.mp3Podcast: Play in new window | Download Apache Mesos is an open-source cluster manager that enables resource sharing in a fine-grained manner, improving cluster utilization. Michael Hausenblas is a developer and cloud advocate with Mesosphere, which builds the Datacenter Operating System (DCOS), a distributed OS that uses Apache Mesos as its kernel. Questions Can you give the historical context for cluster computing? How are the distributed systems needs of different

Continue reading…

Demystifying Stream Processing with Neha Narkhede

“Systems are giving up correctness for latency, and I’m arguing that stream processing systems have to be designed to allow the user to pick the tradeoffs that the application needs.”

Continue reading…

TensorFlow with Greg Corrado

“You don’t mind if failures slow things down, but its very important that failures do not stop forward progress.”

Continue reading…