Alluxio and Memory-centric Distributed Storage with Haoyuan Li
Podcast: Play in new window | Download
Subscribe: RSS
“Its not really about removing disk from the picture per se – it’s more like saying, ‘how do we leverage more and more resources from DRAM?’ ”
Memory is king. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing.
Alluxio is an open source, memory-centric, distributed, and reliable storage system enabling data sharing across clusters at memory speed. Alluxio was formerly known as Tachyon. Haoyuan Li is the creator of Alluxio. Haoyuan was a member of the Berkeley AMPLab, which is the same research facility from which Apache Mesos and Apache Spark were born. In this episode, we discuss Alluxio, Spark, Hadoop, and the evolution of the data center software architecture.
Questions
- Why is the growing throughput of memory so important to the big data stack?
- How has memory hierarchy evolved over time?
- Should we start migrating all of the functionality of disk to RAM?
- What are the problems with needing to replicate to disk?
- What is underFS?
- How often do nodes fail in a typical cluster?
- What is lineage based storage?
- How does the workflow of a data scientist or data engineer change with the addition of Alluxio?
Links
- Alluxio
- Baidu
- underFS
- Scale-out architecture
- Data lineage
- Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds
- HY on Twitter