Meet Apache Kafka
Kafka has become a central tool for data at many large organizations.
At data-intensive companies like Fiverr and Netflix, Kafka is used simultaneously as:
- a database
- a queue for ordered processing
- a tool for sharing data between different teams
- the backup tool that can fix any other system if a catastrophic failure occurs
- a platform to decouple different teams that are dependent on each other
MEET APACHE KAFKA: WHAT IT IS, ORIGINS
Kafka: a streaming platform — a central hub for real-time streams of data.
Apache Kafka is an open-source distributed streaming platform.
It is a publish-subscribe messaging system rethought as a distributed commit log so producers and consumers can publish messages to each other.
Kafka serves as the central repository for data streams in a distributed system.
Kafka was originally developed at LinkedIn, and the creators of the project eventually left LinkedIn and started Confluent, a company that builds enterprise products around Kafka.
“Systems are giving up correctness for latency, and I’m arguing that stream processing systems have to be designed to allow the user to pick the tradeoffs that the application needs.”
– Neha Narkhede
Event sourcing is an architectural pattern that allows changes to our application model to be represented as events.
Each event is published to an event queue, and is pulled off of the queue by each of the various services that need to consume that event.
Event sourcing and the related architectural pattern CQRS allow for a flow of information through an application that is easy to reason about, and has several other desirable properties.
Kafka can be used for event sourcing. Related software patterns are improving the architectures of companies like Netflix, Uber, eBay, and Yelp.
The major terms of Kafka’s architecture are topics, records, and brokers. Topics consist of stream of records holding different information. On the other hand, Brokers are responsible for replicating the messages.
There are four major APIs in Kafka:
- Producer API – Permits the applications to publish streams of records.
- Consumer API – Permits the application to subscribe to the topics and processes the stream of records.
- Streams API – This API converts the input streams to output and produces the result.
- Connector API – Executes the reusable producer and consumer APIs that can link the topics to the existing applications.
WHEN TO USE KAFKA
- messaging via publish-subscribe or queue paradigms
- persistent buffer at systems entry point that can absorb traffic spikes, allowing inner system to react and process in their own optimal pace
- stream processing and soft real time analytics, paired with a stream processor like Apache Flink, Spark streaming or Storm and with a database like HBase
- log transport & aggregator: a way to consolidate event streams in their way to backup, analytics
The beauty of this is that just sending your logs, or event streams to Kafka, gives you all those advantages at once: events consolidation, backup, on-the-fly analytics, pressure absorbing, processing by multiple systems of the same events, and re-processing them at any time while they live in Kafka (a few weeks, configurable). This is why some name Kafka the “new enterprise bus” for data intensive companies.
Kafka offers quite a few guarantees, including:
- at-least once, at most once, exactly once (tunable, configurable)
- messages total-ordering at shard level:
- Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
- A consumer instance sees messages in the order they are stored in the log.
- For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log.
- By setting a shard policy per topic (partition count + partition function), admins trade off between load balancing and stateful processing
DETAILED USE CASES FROM THE KAFKA DOCS
Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.
Website Activity Tracking
The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
Activity tracking is often very high volume as many activity messages are generated for each user page view.
Kafka is often used for operation monitoring data pipelines. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.
Many users end up doing stage-wise processing of data where data is consumed from topics of raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for further consumption. For example a processing flow for article recommendation might crawl article content from RSS feeds and publish it to an “articles” topic; further processing might help normalize or deduplicate this content to a topic of cleaned article content; a final stage might attempt to match this content to users. This creates a graph of real-time data flow out of the individual topics. The Storm framework is one popular way for implementing some of these transformations. Recently, Apache Flink is a more efficient exactly-once stream processing solution. Spark Streaming is also suitable in many scenarios, but does not have a truly streaming semantic.
Kafka Streams is a Java stream-processing library for building streaming applications that transform input Kafka topics into output Kafka topics. In a time when there are numerous streaming frameworks already out there, why do we need yet another? To quote guest Jay Kreps, “the gap we see Kafka Streams filling is less the analytics-focused domain these frameworks focus on and more building core applications and microservices that process data streams.”
Jay is the CEO of Confluent, a company that is building Kafka technology, and he is one of the original authors of Kafka. Kafka evolved to be the message broker of choice for so many data engineering stacks.
Kafka deployments can be a complex to manage. Kafka is very popular, but is not easy to deploy and operationalize. That is why Confluent has built a Kafka-as-a-service product, so that managing Kafka is not the job of an on-call DevOps engineer. There are many complexities to building this system.
Apache Kafka has become the most popular open-source solution for persistent replicated messaging in the Hadoop ecosystem. But some software engineers who are working with big data don’t want to deal with the configuration and setup of Kafka. One way to sidestep this problem is to go with a managed solution, like Microsoft Azure Event Hubs. Or recently, Heroku developed the Heroku Kafka product, which is another managed version of Apache Kafka.
FUTURE OF KAFKA
Monitoring Kafka performance at scale is more important than ever, and there are several open source and paid services that do this. Stay tuned to follow what happens next with Kafka.