Apache Kafka’s Uses and Target Market

From Nicolae Marasoiu’s answer via Quora:

Kafka is a high performance messaging system which provides an immutable, linearizable, sharded log of messages. Throughput and storage capacity scale linearly with nodes. Kafka can push astonishingly high volume through each node; often saturating disk, network, or both, while keeping a low cpu utilization.


You would use Kafka in scenarios of asynchronous communication and processing pipelines, predominantly in distributed systems, cloud & big data, including the following cases:

  • messaging via publish-subscribe or queue paradigms
  • persistent buffer at systems entry point that can absorb traffic spikes, allowing inner system to react and process in their own optimal pace
  • stream processing & soft real time analytics, paired with a stream processor like Apache Flink, Spark streaming or Storm and with a database like HBase
  • log transport & aggregator: a way to consolidate event streams in their way to backup, analytics
The beauty of this is that just sending your logs, or event streams to Kafka, gives you all those advantages at once: events consolidation, backup, on-the-fly analytics, pressure absorbing, processing by multiple systems of the same events, and re-processing them at any time while they live in Kafka (a few weeks, configurable). This is why some name Kafka the “new enterprise bus” for data intensive companies.

Kafka is offering quite a few guarantees, including:

 

  • at-least once, at most once, exactly once (tunable, configurable)
  • messages total-ordering at shard level:
    • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
    • A consumer instance sees messages in the order they are stored in the log.
    • For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log.
    • By setting a shard policy per topic (partition count + partition function), admins trade off between load balancing and stateful processing

Here are the more detailed use cases, straight out of Kafka documentation:

 

Messaging

Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.

 

In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.

 

Website Activity Tracking

 

The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.

 

Activity tracking is often very high volume as many activity messages are generated for each user page view.

 

Metrics

 

Kafka is often used for operation monitoring data pipelines. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

 

Log Aggregation

 

Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.

 

Stream Processing

 

Many users end up doing stage-wise processing of data where data is consumed from topics of raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for further consumption. For example a processing flow for article recommendation might crawl article content from RSS feeds and publish it to an “articles” topic; further processing might help normalize or deduplicate this content to a topic of cleaned article content; a final stage might attempt to match this content to users. This creates a graph of real-time data flow out of the individual topics. The Storm framework is one popular way for implementing some of these transformations. Recently, Apache Flink is a more efficient exactly-once stream processing solution. Spark Streaming is also suitable in many scenarios, but does not have a truly streaming semantic.

Alternatives to Apache Kafka include:

Comments