Podcast: Play in new window | Download
In a typical data analytics system, there are a variety of technologies interacting. HDFS for storing files, Spark for distributed machine learning, pandas for data analysis in Python–each of these different technologies has a different format for how data is represented.
Serialization and deserialization between these different formats causes significant latency across the overall system. Apache Arrow is a tool for improving performance of in-memory analytics systems, and today’s guest Uwe Korn explains how Arrow enables these systems with interoperability.