Tag Data Engineering

Alerting and Metrics with Clement Pang

http://traffic.libsyn.com/sedaily/ClementPang.mp3Podcast: Play in new window | Download An alert is a signal of problematic application behavior. When something unusual happens to your application, an alert can bring that anomaly to your attention. In order to detect unusual events, you need to define the norm. In order to define both normal and problematic behavior, you need metrics. Metrics are measurements of the behavior in your application. Metrics get created from logs

Continue reading…

Antifraud Architecture with Josh Yudaken

http://traffic.libsyn.com/sedaily/antifraud_architecture_edited.mp3Podcast: Play in new window | Download Online marketplaces and social networks often have a trust and safety team. The trust and safety team helps protect the platform from scams, fraud, and malicious actors. To detect these bad actors at scale requires building a system that classifies every transaction on the platform as safe or potentially malicious. Since every social platform has to build something like this, Smyte decided to

Continue reading…

Data Engineering with Pete Soderling

http://traffic.libsyn.com/sedaily/hakkalabs_edited.mp3Podcast: Play in new window | Download In the last five years, companies started hiring data engineers. A data engineer creates the systems that manage and access the huge volumes of data that are accumulating on cheap cloud servers. As the saying goes, “it’s more expensive to throw out the data than to store it.” Pete Soderling joins the show to discuss the rise of the data engineer, and how

Continue reading…

PANCAKE STACK Data Engineering with Chris Fregly

http://traffic.libsyn.com/sedaily/pancakestack_edited_fixed.mp3Podcast: Play in new window | Download Data engineering is the software engineering that enables data scientists to work effectively. In today’s episode, we explore the different sides of data engineering–the data science algorithms that need to be processed and the implementation of software architectures that enable those algorithms to run smoothly. The PANCAKE STACK is a 12-letter acronym that Chris Fregly gave to a collection of data engineering technologies

Continue reading…

Stream Processing at Uber with Danny Yuan

http://traffic.libsyn.com/sedaily/uber_danny_edited.mp3Podcast: Play in new window | Download “Be aggressive in vision, but conservative in operation.” Uber is a transportation company with a high volume of temporal spacial data, constantly being collected from the devices of its users. At any given time, the engineers and data scientists at Uber need to be able to query the system, and understand what is going on with drivers and riders. The unique real-time engineering

Continue reading…

Alluxio and Memory-centric Distributed Storage with Haoyuan Li

http://traffic.libsyn.com/sedaily/Alluxio_Edited.mp3Podcast: Play in new window | Download “Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ” Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio

Continue reading…

Building Software for Millenials with Anthony Sessa

http://traffic.libsyn.com/sedaily/Mic_Edited_2.mp3Podcast: Play in new window | Download “Millenials deeply care about software, in the sense where if something doesn’t work as it should, it’s forgotten immediately – if you build an app and there are bugs, you’re done.” Mic.com is a media company focused on news for millennials. Anthony Sessa is the VP of product at Mic.com, and he joins us to talk about the engineering of a modern news

Continue reading…

Hadoop: Past, Present and Future with Mike Cafarella

http://traffic.libsyn.com/sedaily/Hadoop_2_Edited.mp3Podcast: Play in new window | Download “HDFS is going to be a cockroach – I don’t think its ever going away.” Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community. Today’s guest is Mike Cafarella,

Continue reading…

Data Engineering at Airbnb with Maxime Beauchemin

http://traffic.libsyn.com/sedaily/Airbnb_Edited.mp3Podcast: Play in new window | Download “One big transformation we’re seeing right now is the slow agonizing death of MapReduce.” When a company gets big enough, there is so much data to be processed that an entire data engineering team becomes responsible for managing this data and making it available to other teams. Airbnb is one such company. Max Beauchemin works on the data engineering team at Airbnb, where

Continue reading…

Apache Kafka’s Uses and Target Market

From Nicolae Marasoiu’s answer via Quora: Kafka is a high performance messaging system which provides an immutable, linearizable, sharded log of messages. Throughput and storage capacity scale linearly with nodes. Kafka can push astonishingly high volume through each node; often saturating disk, network, or both, while keeping a low cpu utilization. You would use Kafka in scenarios of asynchronous communication and processing pipelines, predominantly in distributed systems, cloud & big data,

Continue reading…

Deep Learning and Keras with François Chollet

“I definitely think we can try to abstract away the first principles of intelligence and then try to go from these principles to an intelligent machine that might look nothing like the brain.”

Continue reading…

Mesosphere and Tech Journalism with Derrick Harris

“The business of technology and the technology of technology are kind of converging if you ask me. And there is definitely a space for some publications that don’t have decades of technical debt in the software space.”

Continue reading…

Spark in Practice with Holden Karau

“I found Spark and I was really excited because I’m a functional programming nerd, and it was written in Scala.”

Continue reading…

Machine Learning for Businesses with Joshua Bloom

“You’ve got software engineers who are interested in machine learning, and think what they need to do is just bring in another module and then that will solve their problem. It’s particularly important for those people to understand that this is a different type of beast.”

Continue reading…

Data Science with Srini Kadamati

“I really think that data science is like design in the sense that it’s a way of thinking.”

Continue reading…

Architecting Distributed Databases with Fangjin Yang

“The more you’re comfortable with this idea that everything is going to fail, the more you realize that it’s a natural process of distributed systems, and it helps you write and architect better code.”

Continue reading…

Distributed Systems with Alvaro Videla

“Every vendor will advertise that their system is better – that’s nice, I understand you need to sell your thing, but what am I gaining as a user and what am I sacrificing as a user by choosing your product?”

Continue reading…

Mesos and Docker in Practice with Michael Hausenblas

http://traffic.libsyn.com/sedaily/Mesos_Edited_2.mp3Podcast: Play in new window | Download Apache Mesos is an open-source cluster manager that enables resource sharing in a fine-grained manner, improving cluster utilization. Michael Hausenblas is a developer and cloud advocate with Mesosphere, which builds the Datacenter Operating System (DCOS), a distributed OS that uses Apache Mesos as its kernel. Questions Can you give the historical context for cluster computing? How are the distributed systems needs of different

Continue reading…

Demystifying Stream Processing with Neha Narkhede

“Systems are giving up correctness for latency, and I’m arguing that stream processing systems have to be designed to allow the user to pick the tradeoffs that the application needs.”

Continue reading…

  • 1 2