Spark

Sort by:

Dask: Scalable Python with Matthew Rocklin

Python is the most widely used language for data science, and there are several libraries that are commonly used by Python data scientists including Numpy, Pandas, and scikit-learn.

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

Featured Image: “black tunnel interior with white lights” by Jared Arango on Unsplash This Article was originally written by Mahdi Karabiben on Medium. Reposted with permission.

Apache Beam with Frances Perry

Unbounded data streams create difficult challenges for our application architectures. The data never stops coming, and we are forced to assume that we will never know if or when we have

Peter Bailis on the Data Community’s Identity Crisis

Breakthroughs in modern data research tend to come from companies like Google, Facebook, and Amazon, with projects like MapReduce, Cassandra, and Dynamo.   Twenty years ago, this

Apache Arrow with Uwe Korn

In a typical data analytics system, there are a variety of technologies interacting. HDFS for storing files, Spark for distributed machine learning, pandas for data analysis in