Streaming vs Batch: The Differences

Sean Owen, Director, Data Science @ Cloudera via Quora

Although people use the word in different ways, Hadoop refers to an ecosystem of projects, most of which are not processing systems at all. It contains MapReduce, which is a very batch-oriented data processing paradigm.

Spark is also part of the Hadoop ecosystem, I’d say, although it can be used separately from things we would call Hadoop. Spark is a batch processing system at heart too. Spark Streaming is a stream processing system.

To me a stream processing system:

  • Computes a function of one data element, or a smallish window of recent data
  • Computes something relatively simple
  • Needs to complete each computation in near-real-time — probably seconds at most
  • Computations are generally independent
  • Asynchronous – source of data doesn’t interact with the stream processing directly, like by waiting for an answer

A batch processing system to me is just the general case, rather than a special type of processing, but I suppose you could say that a batch processing system:

  • Has access to all data
  • Might compute something big and complex
  • Is generally more concerned with throughput than latency of individual components of the computation
  • Has latency measured in minutes or more

I sometimes hear streaming used as a sort of synonym for real-time . Real-time stuff usually takes the form of needing to respond to an event in milliseconds, as in a synchronous API. This isn’t streaming to me.

Comments