Tag Apache Arrow

Columnar Data: Apache Arrow and Parquet with Julien Le Dem and Jacques Nadeau

http://traffic.libsyn.com/sedaily/columnardata_edited_fixed.mp3Podcast: Play in new window | Download Column-oriented data storage allows us to access all of the entries in a database column quickly and efficiently. Columnar storage formats are mostly relevant today for performing large analytics jobs. For example, if you are a bank, and you want to get the sum of all of the financial transactions that took place on your system in the last week, you don’t want

Continue reading…

Apache Arrow with Uwe Korn

http://traffic.libsyn.com/sedaily/arrow_edited_fixed.mp3Podcast: Play in new window | Download In a typical data analytics system, there are a variety of technologies interacting. HDFS for storing files, Spark for distributed machine learning, pandas for data analysis in Python–each of these different technologies has a different format for how data is represented.   Serialization and deserialization between these different formats causes significant latency across the overall system. Apache Arrow is a tool for improving

Continue reading…