Streaming vs Batch: The Differences

Article Monday, August 3 2015

Sean Owen, Director, Data Science @ Cloudera via Quora

Although people use the word in different ways, Hadoop refers to an ecosystem of projects, most of which are not processing systems at all. It contains MapReduce, which is a very batch-oriented data processing paradigm.

Spark is also part of the Hadoop ecosystem, I’d say, although it can be used separately from things we would call Hadoop. Spark is a batch processing system at heart too. Spark Streaming is a stream processing system.

To me a stream processing system:

Computes a function of one data element, or a smallish window of recent data
Computes something relatively simple
Needs to complete each computation in near-real-time — probably seconds at most
Computations are generally independent
Asynchronous – source of data doesn’t interact with the stream processing directly, like by waiting for an answer

A batch processing system to me is just the general case, rather than a special type of processing, but I suppose you could say that a batch processing system:

Has access to all data
Might compute something big and complex
Is generally more concerned with throughput than latency of individual components of the computation
Has latency measured in minutes or more

I sometimes hear streaming used as a sort of synonym for real-time . Real-time stuff usually takes the form of needing to respond to an event in milliseconds, as in a synchronous API. This isn’t streaming to me.

Streaming vs Batch: The Differences

Jeff

Software Daily

VMware Tanzu GemFire and Next-Generation Real-Time Application Development

Uber’s LedgerStore and its Trillions of Indexes with Kaushik Devarajaiah

GraphQL vs. REST: What Are They, and Which Is Better for You?

CodeRabbit and RAG for Code Review with Harjot Gill

Building Chess.com with Jay Severson

Mastodon with Eugen Rochko

Startup Investing with George Mathew

KubeCon Special: Docker with Justin Cormack

Software Architecture with Josh Prismon

Hardening C++ with Bjarne Stroustrup

Surviving ChatGPT with Christian Hubicki

Special Episode with George Hotz

Making React 70% faster with Aiden Bai of Million.js

Cross-functional Incident Management with Ashley Sawatsky and Niall Murphy

SDKs for your API with Sagar Batchu

Hyperscaling SQL with Sam Lambert

Spring AI and Java in 2024

Iceberg at Netflix and Beyond with Ryan Blue

About Us

Community

Get Involved

Streaming vs Batch: The Differences

POPULAR

Software Daily

VMware Tanzu GemFire and Next-Generation Real-Time Application Development

Uber’s LedgerStore and its Trillions of Indexes with Kaushik Devarajaiah

GraphQL vs. REST: What Are They, and Which Is Better for You?

CodeRabbit and RAG for Code Review with Harjot Gill

Building Chess.com with Jay Severson

Mastodon with Eugen Rochko

Startup Investing with George Mathew

KubeCon Special: Docker with Justin Cormack

Software Architecture with Josh Prismon

Hardening C++ with Bjarne Stroustrup

Surviving ChatGPT with Christian Hubicki

Special Episode with George Hotz

Making React 70% faster with Aiden Bai of Million.js

Cross-functional Incident Management with Ashley Sawatsky and Niall Murphy

SDKs for your API with Sagar Batchu

Hyperscaling SQL with Sam Lambert

Spring AI and Java in 2024

Iceberg at Netflix and Beyond with Ryan Blue