Pachyderm: Data Pipelines with Joe Doliner

Data infrastructure is advancing beyond the days of Hadoop MapReduce, single-node databases, and nightly reporting.

Companies are adopting modern data warehouses, streaming data systems, and cloud-specific data tools like BigQuery. Every company with a large amount of data wants to aggregate that data into a data lake and make the data available to developers. All of this data can be used to power machine learning models which can potentially improve every area within a company where they have historical data.

“Data pipeline” is a term used to describe the process of preparing data, building machine learning models, deploying those models, and tracking the results of those models.

Pachyderm is a company and open source project that is focused on deployment, management, and scalability of data pipelines. Pachyderm allows developers to version data, track the state of data sets, backtest machine learning models, and collaborate on data. It also tackles the very hard problem of machine learning auditability.

Joe Doliner is the CEO of Pachyderm and joins the show to discuss his experience building Pachyderm over the last five years. Data infrastructure has changed a lot in five years, and the world has moved in a direction that has benefitted Pachyderm, with more infrastructure moving to containers and more data teams advancing beyond a world of just Hadoop MapReduce.

In today’s show, Joe talks about modern infrastructure, data provenance, and the long-term vision of Pachyderm.

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

Stellares is a job recommendation engine for software engineers. Stellares uses its machine learning algorithms to factor in the subtle aspects of the job search, so that you find your perfect job — from salary to work-life balance to team fit and personal learning goals. To find out more about STELLARES, go to Stellares.ai/sedaily.

MongoDB is the most popular nonrelational database. MongoDB Stitch is a serverless platform from MongoDB, that allows you to build rich interactions with your database. To try it out yourself today, experiment with $10 in free credit by going to mongodb.com/sedaily.

Deploy infrastructure faster; simplify life cycle maintenance for your servers; give IT the ability to deliver infrastructure to developers as a service like the public cloud. Go to softwareengineeringdaily.com/hpe and learn about how HPE OneView can improve your infrastructure operations.

Digital Ocean is the easiest cloud platform to run and scale your application. Try it out today and get a free $100 credit–go to do.co/sedaily. Digital Ocean is a complete cloud platform to help developers and teams save time when running and scaling their applications.

Software Daily

Software Daily

 
Subscribe to Software Daily, a curated newsletter featuring the best and newest from the software engineering community.