Category Data

Relational Databases with Craig Kerstiens

http://traffic.libsyn.com/sedaily/RelationalDBs.mp3Podcast: Play in new window | Download Relational databases are used by most applications. MySQL, Postgres, Microsoft SQL Server, and other products implement the core features of a relational database in different ways. A developer who has never studied this space in detail may not know the differences between these databases, and in this episode we describe some tradeoffs that relational databases can make. Craig Kerstiens is an engineer at

Continue reading…

Peter Bailis on the Data Community’s Identity Crisis

http://traffic.libsyn.com/sedaily/database_crisis_edited_fixed.mp3Podcast: Play in new window | Download Breakthroughs in modern data research tend to come from companies like Google, Facebook, and Amazon, with projects like MapReduce, Cassandra, and Dynamo.   Twenty years ago, this types of breakthroughs would be happening in academia, which causes today’s guest Peter Bailis to ask: is the academic data community having an identity crisis?   Peter is an assistant professor at Stanford University, where he

Continue reading…

Apache Arrow with Uwe Korn

http://traffic.libsyn.com/sedaily/arrow_edited_fixed.mp3Podcast: Play in new window | Download In a typical data analytics system, there are a variety of technologies interacting. HDFS for storing files, Spark for distributed machine learning, pandas for data analysis in Python–each of these different technologies has a different format for how data is represented.   Serialization and deserialization between these different formats causes significant latency across the overall system. Apache Arrow is a tool for improving

Continue reading…

Cassandra Data Modeling with Jon Haddad

http://traffic.libsyn.com/sedaily/Cassandra_jon_haddad_Edited.mp3Podcast: Play in new window | Download Data modeling is the process of creating relationships and rules about objects, so that we can decide how to store them in a database. Data modeling defines how we store and query our database systems.   Today’s episode features a discussion of data modeling in Cassandra with Jon Haddad, an evangelist at Datastax. The distributed nature of Cassandra creates some unique rules around

Continue reading…

Cassandra Compliant ScyllaDB with Dor Laor

http://traffic.libsyn.com/sedaily/ScyllaDB_Edited.mp3Podcast: Play in new window | Download Apache Cassandra is a distributed database that can handle large amounts of data with no single point of failure. Since 2008, Cassandra has been widely adopted and the software and the community around it have grown steadily. A software developer interacting with Cassandra uses CQL, the Cassandra Query Language. ScyllaDB is another open-source database that has been created to be totally compatible with

Continue reading…

Scaling PostgreSQL with Citus Data’s Ozgun Erdogan

http://traffic.libsyn.com/sedaily/Citus_Data.mp3Podcast: Play in new window | Download Ten years ago, databases were much simpler. Most companies would only have one or two types of databases in production. Today, the age of one-size-fits-all is over. Companies have multiple databases to deal with different types of use cases, and databases have become distributed to multiple nodes in order to be scalable. Ozgun Erdogan of Citus Data joins the show to give us

Continue reading…

Kafka, Storm, and Cassandra: Keen IO’s Analytics Architecture with Dan Kador

http://traffic.libsyn.com/sedaily/Keen_Edited_3.mp3Podcast: Play in new window | Download The process of building a software project requires us to make so many architectural decisions. Which programming languages should be used? Which cloud service provider? Which database? A newer type of building block is the analytics platform. Companies need to track events, aggregate metrics, and change the user’s experience based on aggregated data. Dan Kador is a co-founder of Keen IO, and he

Continue reading…

Netflix’s Data Pipeline with Steven Wu

http://traffic.libsyn.com/sedaily/Netflix.mp3Podcast: Play in new window | Download At Netflix, 500 billion events and 1.3 petabytes of data are ingested by the system per day.  This includes video viewing activities, error logs, and performance events. On today’s episode, we dive deep into the data pipeline of Netflix, and how it evolved from their 1.0 version to the modern 2.0 version. Before listening to this episode, check out the blog post that

Continue reading…

Crate.io and Distributed SQL with Jodok Batlogg

http://traffic.libsyn.com/sedaily/Crate.mp3Podcast: Play in new window | Download Distributed databases are difficult to operate, and Crate.io wants to change that. Crate is a fast, scalable, easy-to-use SQL database that is built to run in containerized environments. An average software company runs several databases–MySQL for relational store, MongoDB for a document database, HDFS for blob storage and data warehouse, elastic search for search. On today’s show, Jodok Batlogg from Crate discuss the

Continue reading…

Azure Stream Analytics with Santosh Balasubramanian

http://traffic.libsyn.com/sedaily/azure_stream_final.mp3Podcast: Play in new window | Download Microsoft has built a suite of technologies on top of its Azure infrastructure as a service. Today, we discuss Azure Stream Analytics, a real-time event processing engine developed at Microsoft. Azure Streaming allows for constant querying of incoming data streams, and my guest Santosh Balasubramanian discusses Azure and the movement from batch processing to stream processing. Sponsors Hired.com is the job marketplace for

Continue reading…

Spark and Cassandra with Tim Berglund

http://traffic.libsyn.com/sedaily/Cassandra_Spark_Edited_2.mp3Podcast: Play in new window | Download Apache Spark is a framework for fast, distributed, in-memory analysis. Apache Cassandra is a distributed database management system that provides high availability and fast throughput. Today, we are collecting fast, big data streams from user behavior, smart phones and sensors, and the disk checkpointing of and query language of Hadoop MapReduce is no longer adequate.   Tim Berglund from Datastax came on Software

Continue reading…

Azure Event Hubs and Kafka with Dan Rosanova

http://traffic.libsyn.com/sedaily/Eventhubs_Edited.mp3Podcast: Play in new window | Download Apache Kafka has become the most popular open-source solution for persistent replicated messaging in the Hadoop ecosystem. But some software engineers who are working with “big data” don’t want to deal with the configuration and set up of Kafka. One way to side step this problem is to go with a managed solution, like Microsoft Azure Event Hubs. Dan Rosanova is today’s guest.

Continue reading…

CockroachDB with Ben Darnell

http://traffic.libsyn.com/sedaily/cockroachdb_Edited.mp3Podcast: Play in new window | Download “Eventual consistency is really kind of a marketing term from some of these NoSQL systems – it’s not really consistent in any strong sense of the term.” Google has published papers on distributed systems such as BigTable, Chubby, and the Google File System. During this episode, we focus on a product that takes inspiration from Google’s Spanner project, a database that is built

Continue reading…

Stream Processing at Uber with Danny Yuan

http://traffic.libsyn.com/sedaily/uber_danny_edited.mp3Podcast: Play in new window | Download “Be aggressive in vision, but conservative in operation.” Uber is a transportation company with a high volume of temporal spacial data, constantly being collected from the devices of its users. At any given time, the engineers and data scientists at Uber need to be able to query the system, and understand what is going on with drivers and riders. The unique real-time engineering

Continue reading…

Data Visualization and Mapping with Aurelia Moser

http://traffic.libsyn.com/sedaily/Mapping_Edited.mp3Podcast: Play in new window | Download “I’m always worried that if you teach too much magic, people don’t learn the basics – they don’t know why something is working, they just know the documentation said it should work that way.” On Software Engineering Daily, we often discuss big data in terms of data engineering and data science. Data engineering is the infrastructure and pipelines that handle massive amounts of

Continue reading…

FiloDB with Evan Chan

http://traffic.libsyn.com/sedaily/Filodb_Edited.mp3Podcast: Play in new window | Download “The world is becoming more and more interactive, and people want answers right away, so you’re seeing the rise of stream processing and real-time.” Big data is yesterday–fast data is now. FiloDB is a reactive columnar OLAP database that is built on Cassandra and Spark. Today’s guest is Evan Chan, creator of FiloDB. In our discussion today, we talk about the use cases

Continue reading…

Cassandra with Tim Berglund

http://traffic.libsyn.com/sedaily/Cassandra_Edited.mp3Podcast: Play in new window | Download “There isn’t any central node in Cassandra. Every node is a peer, there is no master – there is no single point of failure.” Apache Cassandra can serve as both the real-time data store for online transactional applications, as well as the read-intensive database for data warehousing operations. In order to combine these two use cases into a single database, Apache Cassandra required

Continue reading…

Hadoop: Past, Present and Future with Mike Cafarella

http://traffic.libsyn.com/sedaily/Hadoop_2_Edited.mp3Podcast: Play in new window | Download “HDFS is going to be a cockroach – I don’t think its ever going away.” Hadoop was created in 2003. In the early years, Hadoop provided large scale data processing with MapReduce, and distributed fault-tolerant storage with the Hadoop Distributed File System. Over the last decade, Hadoop has evolved rapidly, with the support of a big open-source community. Today’s guest is Mike Cafarella,

Continue reading…

Data Engineering at Airbnb with Maxime Beauchemin

http://traffic.libsyn.com/sedaily/Airbnb_Edited.mp3Podcast: Play in new window | Download “One big transformation we’re seeing right now is the slow agonizing death of MapReduce.” When a company gets big enough, there is so much data to be processed that an entire data engineering team becomes responsible for managing this data and making it available to other teams. Airbnb is one such company. Max Beauchemin works on the data engineering team at Airbnb, where

Continue reading…

Computational Neuroscience with Jeremy Freeman

http://traffic.libsyn.com/sedaily/Neuroscience_Edited.mp3Podcast: Play in new window | Download “You want to take a scientist who knows a little bit of matlab programming and try to teach them mapreduce, and write a mapreduce program in java to do image processing? It’s a disaster!” Apache Spark is replacing MATLAB in the domain of computational neuroscience. The constraints of running MATLAB on a single machine can’t support the demands of neuroscience, which has huge

Continue reading…