Data Warehouse with Christian Kleinerman

A data warehouse provides fast access to large data sets for analytics, data science, and dashboards. A data warehouse differs from a transactional database, because you often do not need to update specific records. Because of the read-only nature of the access patterns, and the high volumes of data being queried, the design of a data warehouse is very different than a transactional database.

With a transactional database (such as MySQL or MongoDB), it is important to have consistency guarantees. For example, consider a transactional database that serves as the backend for banking applications. If multiple frontend servers are hitting that transactional database to withdraw money, you need the records to be quickly updated. You need to avoid race conditions, so that two servers cannot withdraw the entire bank account balance simultaneously from different locations.

In contrast to transactional databases, a data warehouse is often used to process a query that encompasses a big data set. For example, Netflix might want to answer the question: “how many users that watched House of Cards also watched Black Mirror?” Netflix has a lot of users, so they will want to be accessing those user records in a way that lets them scan through the records quickly.

Christian Kleinerman is the VP of product at Snowflake Computing. Snowflake’s main product is a cloud data warehouse. In today’s show, we talk about the difference between a data warehouse, a data lake, and a transactional database, and the process of moving data sets between them, often known as ETL.

This show continues our series on data engineering and data platforms. As companies accumulate more and more data, the complexity of managing that data and taking full advantage of it is escalating. Christian gives his perspective on these changing trends, and describes the plans for Snowflake to evolve as a business.

 

Show Notes

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

QCon San Francisco 2018 features 18 editorial tracks with 140+ speakers from places like Uber, Google, Dropbox, Slack, Twitter, and more. At QCon, we create a platform for senior software engineers, team leads, architects, and leaders working at innovator and early adopter companies to share their stories. It goes to the heart of who we are. We simply prefer practitioners over evangelists in the speakers we bring to the conference. SED listeners can save $100 off the price of a ticket using the promo code SED100.

OpenShift is a Kubernetes platform from Red Hat. OpenShift takes the Kubernetes container orchestration system and adds features that let you build software more quickly. OpenShift includes service discovery, CI/CD, built-in monitoring and health management, and scalability. With OpenShift, you avoid getting locked into any particular cloud provider. Check out OpenShift from RedHat, by going to softwareengineeringdaily.com/redhat.

Transifex is a SaaS-based localization and translation platform that easily integrates with your agile development process. Your software, websites, games, apps, video subtitles, and more can all be translated with Transifex. Use Transifex with in-house translation teams, language service providers, or even crowdsource your translations. If you’re a developer who is ready to reach a global audience, check out Transifex by visiting transifex.com/sedaily and sign up for a free 15-day trial.

Datadog is a cloud-scale monitoring platform for infrastructure and applications. And with Datadog’s new Live Container view, you can see every container’s health, resource consumption, and running processes in real time. See for yourself by starting a free trial and get a free Datadog T-shirt! softwareengineeringdaily.com/datadog.