Category Data

SafeGraph with Auren Hoffman

http://traffic.libsyn.com/sedaily/2018_04_17_MLDatawithAurenHoffman.mp3Podcast: Play in new window | Download Machine learning tools are rapidly maturing. TensorFlow gave developers an open source version of Google’s internal machine learning framework. Cloud computing provides a cost effective, accessible way of training models. Edge computing allows for low latency deployments of models. But even if you are a kid with a laptop who has learned all the machine learning algorithms, read all of the deep learning

Continue reading…

Streamr: Data Streaming Marketplace with Henri Pihkala

http://traffic.libsyn.com/sedaily/2018_03_22_Streamr.mp3Podcast: Play in new window | Download Data streams about the weather can be used to predict how soybean futures are going to change in price. Satellite data streams can take pictures of the number of cars on the road, and judge how traffic patterns are changing. Search engines can aggregate data from different queries and determine what people are most interested in. Data streams define how the world is

Continue reading…

Smart Agriculture with Mike Prorock

http://traffic.libsyn.com/sedaily/2018_03_05_SmartAgriculture.mp3Podcast: Play in new window | Download Farms have lots of data. A corn farmer needs to monitor the chemical composition of soil. A soybean farmer needs to track crop yield. A chicken farmer needs to count the number of eggs produced. If this data is captured, it can be acted upon—for example, a dry farm can automatically turn up its irrigation system. Or the data can simply be studied.

Continue reading…

Spark and Streaming with Matei Zaharia

http://traffic.libsyn.com/sedaily/2018_02_26_SparkDelta.mp3Podcast: Play in new window | Download Apache Spark is a system for processing large data sets in parallel. The core abstraction of Spark is the resilient distributed dataset (RDD), a working set of data that sits in memory for fast, iterative processing. Matei Zaharia created Spark with two goals: to provide a composable, high-level set of APIs for performing distributed processing; and to provide a unified engine for running

Continue reading…

Kafka Design Patterns with Gwen Shapira

http://traffic.libsyn.com/sedaily/2018_02_20_GwenShapiro.mp3Podcast: Play in new window | Download Kafka is at the center of modern streaming systems. Kafka serves as a database, a pubsub system, a buffer, and a data recovery tool. It’s an extremely flexible tool, and that flexibility has led to its use as a platform for a wide variety of data intensive applications. Today’s guest is Gwen Shapira, a product manager at Confluent. Confluent is a company that

Continue reading…

Streaming Architecture with Ted Dunning

http://traffic.libsyn.com/sedaily/2018_02_19_TedDunning.mp3Podcast: Play in new window | Download Streaming architecture defines how large volumes of data make their way through an organization. Data is created at a user’s smartphone, or on a sensor inside of a conveyor belt at a factory. That data is sent to a set of backend services that aggregate the data, organizing it and making it available to business analysts, application developers, and machine learning algorithms. The

Continue reading…

Streaming Analytics with Scott Kidder

http://traffic.libsyn.com/sedaily/2018_02_16_FlinkandVideo.mp3Podcast: Play in new window | Download When you go to a website where a video is playing, and your video lags, how does the website know that you are having a bad experience? Problems with video are often not complete failures–maybe part of the video loads, and plays just fine, and then the rest of the video is buffering. You have probably experienced sitting in front of a video,

Continue reading…

Streaming Architecture with Tugdual Grall

http://traffic.libsyn.com/sedaily/2018_02_15_TugdualGraal.mp3Podcast: Play in new window | Download At a big enough scale, every software product produces lots of data. Whether you are building an advertising technology company, a social network, or a system for IoT devices, you have thousands of events coming in at a fast pace that you want to aggregate, study and act upon. For the last decade, engineers have been learning to store and process these vast

Continue reading…

Spring Data with John Blum

http://traffic.libsyn.com/sedaily/SpringData.mp3Podcast: Play in new window | Download In the 1980s and the 1990s, most applications used only a relational database for their data management. In the early 2000s, software projects started to use an ever increasing number of data sources. MongoDB popularized the document database, which allows storage of objects that do not have a consistent schema. The Hadoop distributed file system enabled the redundant storage and efficient querying of

Continue reading…

Protocol Buffers with Kenton Varda

http://traffic.libsyn.com/sedaily/ProtocolBuffers.mp3Podcast: Play in new window | Download When engineers are writing code, they are manipulating objects. You might have a user object represented on your computer, and that user object has several different fields—a name, a gender, and an age. When you want to send that object across the network to a different computer, the object needs to be turned into a sequence of 1s and 0s that will travel

Continue reading…

Data Science Mindset with Zacharias Voulgaris

http://traffic.libsyn.com/sedaily/DataScienceMindset.mp3Podcast: Play in new window | Download A company’s approach to data can make or break the business. In the past, data was static. There was not much data, it sat in Excel, and it was interacted with on a nightly or monthly basis. Now, data is dynamic, real time and huge. To tap into available data, many industries have oriented themselves to becoming data intensive. With many new industry

Continue reading…

BigQuery with Jordan Tigani

http://traffic.libsyn.com/sedaily/BigQuery.mp3Podcast: Play in new window | Download Large-scale data analysis was pioneered by Google, with the MapReduce paper. Since then, Google’s approach to analytics has evolved rapidly, marked by papers such as Dataflow and Dremel. Dremel combined a column-oriented, distributed file system with a novel way of processing queries. A single Dremel query is distributed into a tree of servers, starting with the root server, splitting into the intermediate servers,

Continue reading…

Kafka at NY Times with Boerge Svingen

http://traffic.libsyn.com/sedaily/KafkaatNYT.mp3Podcast: Play in new window | Download The New York Times is a newspaper that evolved into a digital publication. Across its 166 year history, The Times has been known for longform journalistic quality, in addition to its ability to quickly churn out news stories. Some content on the New York Times is old but timeless “evergreen” content. Readers of the New York Times website are not only looking for the

Continue reading…

Dremio with Tomer Shiran

http://traffic.libsyn.com/sedaily/Dremio.mp3Podcast: Play in new window | Download The MapReduce paper was published by Google in 2004. MapReduce is an algorithm that describes how to do large-scale data processing on large clusters of commodity hardware. The MapReduce paper marked the beginning of the “big data” movement. The Hadoop project is an open source implementation of the MapReduce paper. Doug Cutting and Mike Cafarella wrote software that allowed anybody to use MapReduce,

Continue reading…

Alerting and Metrics with Clement Pang

http://traffic.libsyn.com/sedaily/ClementPang.mp3Podcast: Play in new window | Download An alert is a signal of problematic application behavior. When something unusual happens to your application, an alert can bring that anomaly to your attention. In order to detect unusual events, you need to define the norm. In order to define both normal and problematic behavior, you need metrics. Metrics are measurements of the behavior in your application. Metrics get created from logs

Continue reading…

Dashboarding and Query Latency with Tom O’Neill

http://traffic.libsyn.com/sedaily/PeriscopeData.mp3Podcast: Play in new window | Download A dashboard is a data visualization that aggregates metrics in a way that we can quickly understand. In a modern software company, everyone uses dashboards–from salespeople to DevOps to HR. Each dashboard represents a query that must be updated frequently, so that anyone looking at it is getting up-to-date information. The data set being queried might be getting updated quickly in the case

Continue reading…

Sales Software with Jean-Baptiste Escoyez

http://traffic.libsyn.com/sedaily/SalesSoftware.mp3Podcast: Play in new window | Download Most products do not sell themselves. Salespeople bridge the gap between a product creation and a customer who purchases it. People can make a good living on the internet selling niche products–if they can find their customers. The process of taking a large group of potential customers and narrowing it down to only the subset of those customers who will buy your product

Continue reading…

CosmosDB with Andrew Hoh

http://traffic.libsyn.com/sedaily/cosmosdb_edited.mp3Podcast: Play in new window | Download Different databases have different access patterns. Key-value, document, graph, and columnar databases are useful under different circumstances. For example, if you are a bank, and you have a database of customers and the transactions they have performed, the ideal access pattern for aggregating the total amount of all transactions might be a columnar store. If the transaction amounts are all in one column,

Continue reading…

Data Skepticism with Kyle Polich

http://traffic.libsyn.com/sedaily/dataskeptic_edited.mp3Podcast: Play in new window | Download With a fast-growing field like data science, it is important to keep some amount of skepticism. Tools can be overhyped, buzzwords can be overemphasized, and people can forget the fundamentals. If you have bad data, you will get bad results in your experimentation. If you don’t know what statistical approach you want to take to your data, it doesn’t matter how well you

Continue reading…

Data Intensive Applications with Martin Kleppmann

http://traffic.libsyn.com/sedaily/dataintensive_edited_fixed.mp3Podcast: Play in new window | Download A new programmer learns to build applications using data structures like a queue, a cache, or a database. Modern cloud applications are built using more sophisticated tools like Redis, Kafka, or Amazon S3. These tools do multiple things well, and often have overlapping functionality. Application architecture becomes less straightforward. The applications we are building today are data-intensive rather than compute-intensive. Netflix needs to

Continue reading…

  • 1 2 4