Data Lineage: Understanding Data Lineage at Scale with Julien Le Dem
Big Data has exploded the past decade as cloud computing and more efficient hardware made scaling essentially limitless. Products like Uber revolve entirely around analyzing data to provide rides. According to an EMC/IDC study, there was approximately 5.2TB of data for every person in 2020. That estimate was made before the transition to remote work, which likely makes it much higher.
The term “data lineage” refers to the collection, origin, storage, transfer, and use of data over time. Given the size of the Big Data industry and related industries, maintaining a thorough data lineage, even within small companies, can be very difficult. It becomes especially challenging at scale. What innovative tools make understanding all this information possible? Can data really continue growing at this rate?
In this episode we talk with Julien Le Dem, CTO and Co-Founder at Datakin. We discuss the challenges, available tools, and future for big data and data lineage.
Sponsorship inquiries: email@example.com
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com to get 15% off the first three months of audio editing and transcription services with code: SED. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.
Triplebyte is a network of 200,000+ Top Engineers. Triplebyte works with more than 400 tech companies including Coinbase, Zoox, Snap, Gusto, and Facebook. Triplebyte puts engineers in control of their job search, and helps engineers find the role that’s right for them. Triplebyte gives feedback on what companies do (or don’t do) with your application. This lets you learn from every application and improve over time. Visit Triplebyte.com/sedaily to sign up for free.
ClickUp is no-code project management software that brings all of your engineering work into one place, and they guarantee to save you one day every week by consolidating your tools. Engineers use ClickUp to collaborate on code, docs, sprints, bug tracking, roadmaps, and chat. So code smarter, not harder with ClickUp. Try ClickUp for Free today at ClickUp.com/sedaily and use code SED to get 30% off Unlimited and 15% off Business plans.
Pachyderm is an easy-to-use MLOps platform that empowers anyone to build scalable end-to-end machine learning workflows, regardless of whatever language or framework they are built on. Pachyderm provides Git-like data versioning and lineage to automatically track every data change and final output result. Head over to pachyderm.com/sedaily to get over $400 in free credits. But hurry because this offer only lasts for a limited time.
TeamCity Cloud is a new continuous integration service that is completely hosted and managed by JetBrains. It is based on the original on-premises version of TeamCity, and shares most of its functionality. Multiplatform development, integration with popular build and test frameworks, real-time feedback, test history and test analysis – these are just a few of the many powerful features that can take your team to a new level of productivity. You can try TeamCity Cloud free of charge for 14 days. The trial period gives you 12,000 build credits (equivalent of 20 build hours on the Linux Small build agent), unlimited parallel builds, 120 GB of storage, and up to 3 self-hosted build agents. Get started with cloud CI/CD today!
Dynatrace’s software intelligent platform delivers automatic and intelligent observability to simplify cloud complexity and accelerate digital transformation. Dynatrace Cloud Automation helps increase collaboration between dev and production teams around a single source of truth. It accelerates delivery pipelines with the automated orchestration of CI/CD and remediation workflows. And it also ensures risk-free releases by evaluating quality gates and service level objectives earlier in the lifecycle. See why Dynatrace is radically different and try it free for 15 days at Dynatrace.com/SE-Daily.