Spark
Dask: Scalable Python with Matthew Rocklin
Python is the most widely used language for data science, and there are several libraries that are commonly used by Python data scientists including Numpy, Pandas, and scikit-learn.
Building a Big Data Pipeline With Airflow, Spark and Zeppelin
Featured Image: “black tunnel interior with white lights” by Jared Arango on Unsplash This Article was originally written by Mahdi Karabiben on Medium. Reposted with permission.
Apache Beam with Frances Perry
Unbounded data streams create difficult challenges for our application architectures. The data never stops coming, and we are forced to assume that we will never know if or when we have
Peter Bailis on the Data Community’s Identity Crisis
Breakthroughs in modern data research tend to come from companies like Google, Facebook, and Amazon, with projects like MapReduce, Cassandra, and Dynamo. Twenty years ago, this
Apache Arrow with Uwe Korn
In a typical data analytics system, there are a variety of technologies interacting. HDFS for storing files, Spark for distributed machine learning, pandas for data analysis in