Modin: Pandas Scalability with Devin Petersohn

Pandas is a Python data analysis library, and an essential tool in data science. Pandas allows users to load large quantities of data into a data structure called a dataframe, over which the user can call mathematical operations. When the data fits entirely into memory this works well, but sometimes there is too much data for a single box.

The Modin project scales Pandas workflows to multiple machines by utilizing Dask or Ray, which are distributed computing primitives for Python programs. Modin builds an execution plan for large data frames to be operated on against each other, which makes data science considerably easier for these large data sets.

Devin Petersohn started the Modin project, and he joins the show to talk about data science with Python, and his work in the Berkeley RISELab.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Transcript

Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.


Sponsors

SAP Data Intelligence connects and transforms data to extract value from the distributed data landscape. SAP Data Intelligence brings together data orchestration, metadata management, and powerful data pipelines with advanced machine learning, enabling close collaboration between data scientists and IT. To learn more about SAP Data Intelligence, visit sap.com/sedaily

CockroachDB is a distributed SQL database that makes it simple to build resilient, scalable applications quickly. CockroachDB is Postgres compatible, giving the same familiar SQL interface database developers have used for years. Host it on prem, run it in a hybrid cloud, and even deploy it across multiple clouds. Sign up for a free 30-day trial and get a free t-shirt at cockroachlabs.com/sedaily.

From their recent report on serverless adoption and trends, Datadog found half of their customer base using EC2s have now adopted AWS Lambda. You can easily monitor all your serverless functions in one place and generate serverless metrics straight from Datadog. Check it out yourself by signing up for a free 14-day trial and get a free t-shirt at softwareengineeringdaily.com/datadog

Join us on August 26, 2020 for GitLab Virtual Commit! An immersive 24-hour day of practical DevOps strategies shared by developers, ops pros, engineers, managers and leaders. Attendees will hear from U.S. Air Force and Army, GNOME Foundation, State Farm, Northwestern Mutual, Google, and more and more about problems solved, cultures changed, and release times halved. Come and be part of a community of people just as passionate as you are about DevOps. Register today!