Tag Data Engineering

Machine Learning and Technical Debt with D. Sculley Holiday Repeat

http://traffic.libsyn.com/sedaily/ml_techdebt_ad_free.mp3Podcast: Play in new window | DownloadOriginally published November 17, 2015 “Changing anything changes everything.” Technical debt, referring to the compounding cost of changes to software architecture, can be especially challenging in machine learning systems. D. Sculley is a software engineer at Google, focusing on machine learning, data mining, and information retrieval. He recently co-authored the paper Machine Learning: The High Interest Credit Card of Technical Debt. Questions How do

Continue reading…

Thumbtack Infrastructure with Nate Kupp

http://traffic.libsyn.com/sedaily/ThumbtackInfrastructure.mp3Podcast: Play in new window | Download Thumbtack is a marketplace for real-world services. On Thumbtack, people get their house painted, their dog walked, and their furniture assembled. With 40,000 daily marketplace transactions, the company handles significant traffic. On yesterday’s episode, we explored how one aspect of Thumbtack’s marketplace recently changed, going from asynchronous matching to synchronous “instant” matching. In this episode, we zoom out to the larger architecture of

Continue reading…

High Volume Event Processing with John-Daniel Trask

http://traffic.libsyn.com/sedaily/HighVolumeEventProcessing.mp3Podcast: Play in new window | Download A popular software application serves billions of user requests. These requests could be for many different things. These requests need to be routed to the correct destination, load balanced across different instances of a service, and queued for processing. Processing a request might require generating a detailed response to the user, or making a write to a database, or the creation of a

Continue reading…

Fiverr Engineering with Gil Sheinfeld

http://traffic.libsyn.com/sedaily/FiverrEngineering.mp3Podcast: Play in new window | Download As the gig economy grows, that growth necessitates innovations in the online infrastructure powering these new labor markets. In our previous episodes about Uber, we explored the systems that balance server load and gather geospacial data. In our coverage of Lyft, we studied Envoy, the service proxy that standardizes communications and load balancing among services. In shows about Airbnb, we talked about the

Continue reading…

BigQuery with Jordan Tigani

http://traffic.libsyn.com/sedaily/BigQuery.mp3Podcast: Play in new window | Download Large-scale data analysis was pioneered by Google, with the MapReduce paper. Since then, Google’s approach to analytics has evolved rapidly, marked by papers such as Dataflow and Dremel. Dremel combined a column-oriented, distributed file system with a novel way of processing queries. A single Dremel query is distributed into a tree of servers, starting with the root server, splitting into the intermediate servers,

Continue reading…

Kafka at NY Times with Boerge Svingen

http://traffic.libsyn.com/sedaily/KafkaatNYT.mp3Podcast: Play in new window | Download The New York Times is a newspaper that evolved into a digital publication. Across its 166 year history, The Times has been known for longform journalistic quality, in addition to its ability to quickly churn out news stories. Some content on the New York Times is old but timeless “evergreen” content. Readers of the New York Times website are not only looking for the

Continue reading…

Dremio with Tomer Shiran

http://traffic.libsyn.com/sedaily/Dremio.mp3Podcast: Play in new window | Download The MapReduce paper was published by Google in 2004. MapReduce is an algorithm that describes how to do large-scale data processing on large clusters of commodity hardware. The MapReduce paper marked the beginning of the “big data” movement. The Hadoop project is an open source implementation of the MapReduce paper. Doug Cutting and Mike Cafarella wrote software that allowed anybody to use MapReduce,

Continue reading…

Internet Monitoring with Matt Kraning

http://traffic.libsyn.com/sedaily/InternetMonitoring.mp3Podcast: Play in new window | Download How would you build a system for indexing and monitoring the entire Internet? Start by breaking the Internet up into IP address ranges. Give each of those address ranges to servers distributed around the world. On each of those servers, iterate through your list of IP addresses, sending packets to them. Depending on what sorts of packets those IP addresses respond to, and

Continue reading…

Alerting and Metrics with Clement Pang

http://traffic.libsyn.com/sedaily/ClementPang.mp3Podcast: Play in new window | Download An alert is a signal of problematic application behavior. When something unusual happens to your application, an alert can bring that anomaly to your attention. In order to detect unusual events, you need to define the norm. In order to define both normal and problematic behavior, you need metrics. Metrics are measurements of the behavior in your application. Metrics get created from logs

Continue reading…

Dashboarding and Query Latency with Tom O’Neill

http://traffic.libsyn.com/sedaily/PeriscopeData.mp3Podcast: Play in new window | Download A dashboard is a data visualization that aggregates metrics in a way that we can quickly understand. In a modern software company, everyone uses dashboards–from salespeople to DevOps to HR. Each dashboard represents a query that must be updated frequently, so that anyone looking at it is getting up-to-date information. The data set being queried might be getting updated quickly in the case

Continue reading…

Tinder Growth Engineering with Alex Ross

http://traffic.libsyn.com/sedaily/TinderGrowthEngineering.mp3Podcast: Play in new window | Download Tinder is a popular dating app where each user swipes through a sequence of other users in order to find a match. Swiping left means you are not interested. Swiping right means you would like to connect with the person. The simple premise of Tinder has led to massive growth, and the app is now also used to discover new friends and create

Continue reading…

Spotify Event Delivery with Igor Maravic

http://traffic.libsyn.com/sedaily/SpotifyEventDelivery.mp3Podcast: Play in new window | Download Spotify is a streaming music company with more than 50 million users. Whenever a user listens to a song, Spotify records that event and uses it as input to learn more about the user’s preferences. Listening to a song is one type of event–there are hundreds of others. Opening the Spotify app, skipping a song, sharing a playlist with a friend–all of these

Continue reading…

Data Intensive Applications with Martin Kleppmann

http://traffic.libsyn.com/sedaily/dataintensive_edited_fixed.mp3Podcast: Play in new window | Download A new programmer learns to build applications using data structures like a queue, a cache, or a database. Modern cloud applications are built using more sophisticated tools like Redis, Kafka, or Amazon S3. These tools do multiple things well, and often have overlapping functionality. Application architecture becomes less straightforward. The applications we are building today are data-intensive rather than compute-intensive. Netflix needs to

Continue reading…

Data Warehousing with Mark Rittman

http://traffic.libsyn.com/sedaily/data-warehousing_edited.mp3Podcast: Play in new window | Download In the mid 90s, data warehousing might have meant “using an Oracle database.” Today, it means a wide variety of things. You could be stitching together a big data pipeline using Kafka, Hadoop, and Spark. You could be using managed tools like BigQuery from Google. How did we get from the simple days of Oracle databases to the wealth of options available today?

Continue reading…

Giphy Engineering with Anthony Johnson

http://traffic.libsyn.com/sedaily/giphy_edited.mp3Podcast: Play in new window | Download Giphy is a search engine for gifs, the short animated graphics that we see around the Internet. Giphy is also a creative platform where people create new gifs. Every search engine requires the construction of a search index, which is a data structure that responds to search queries efficiently. Since Giphy is a search engine for graphics, there is almost no text inherently

Continue reading…

Columnar Data: Apache Arrow and Parquet with Julien Le Dem and Jacques Nadeau

http://traffic.libsyn.com/sedaily/columnardata_edited_fixed.mp3Podcast: Play in new window | Download Column-oriented data storage allows us to access all of the entries in a database column quickly and efficiently. Columnar storage formats are mostly relevant today for performing large analytics jobs. For example, if you are a bank, and you want to get the sum of all of the financial transactions that took place on your system in the last week, you don’t want

Continue reading…

Antifraud Architecture with Josh Yudaken

http://traffic.libsyn.com/sedaily/antifraud_architecture_edited.mp3Podcast: Play in new window | Download Online marketplaces and social networks often have a trust and safety team. The trust and safety team helps protect the platform from scams, fraud, and malicious actors. To detect these bad actors at scale requires building a system that classifies every transaction on the platform as safe or potentially malicious. Since every social platform has to build something like this, Smyte decided to

Continue reading…

Data Engineering with Pete Soderling

http://traffic.libsyn.com/sedaily/hakkalabs_edited.mp3Podcast: Play in new window | Download In the last five years, companies started hiring data engineers. A data engineer creates the systems that manage and access the huge volumes of data that are accumulating on cheap cloud servers. As the saying goes, “it’s more expensive to throw out the data than to store it.” Pete Soderling joins the show to discuss the rise of the data engineer, and how

Continue reading…

Fraud Prevention with Pete Hunt

http://traffic.libsyn.com/sedaily/antifraud_edited.mp3Podcast: Play in new window | Download When Facebook acquired Instagram, one of the first systems Instagram plugged into was Facebook’s internal spam and fraud prevention system. Pete Hunt was the first Facebook engineer to join the Instagram team. When he joined, the big problems at Instagram were around fake accounts, harassment, and large volumes of spammy comments. After seeing the internal Facebook spam prevention tools clean up Instagram, Pete

Continue reading…

PANCAKE STACK Data Engineering with Chris Fregly

http://traffic.libsyn.com/sedaily/pancakestack_edited_fixed.mp3Podcast: Play in new window | Download Data engineering is the software engineering that enables data scientists to work effectively. In today’s episode, we explore the different sides of data engineering–the data science algorithms that need to be processed and the implementation of software architectures that enable those algorithms to run smoothly. The PANCAKE STACK is a 12-letter acronym that Chris Fregly gave to a collection of data engineering technologies

Continue reading…