Stream is unbound flow of data (call it message, event or log).

The Streams API of Apache Kafka®, available through a Java library, can be used to build highly scalable, elastic, fault-tolerant, distributed applications and microservices. First and foremost, the Kafka Streams API allows you to create real-time applications that power your core business. It is the easiest yet the most powerful technology to process data stored in Kafka.

If your team is using Kafka as a message broker or event sourcing system or change logs or commit log; no matter what your use case is, you must be having producers and consumers (mostly Kafka Producer and Consumer APIs). Beside using Kafka Consumer API to process messages/events, Kafka STream API is another way. Let’s discuss this approach in detail.

There is a wealth of interesting work happening in the stream processing area—ranging from open source frameworks like Apache Spark, Apache Storm, Apache Flink, and Apache Samza, to proprietary services such as Google’s DataFlow and AWS Lambda — so it is worth outlining how Kafka Streams is similar and different from these things.

On top of what other stream processing frameworks offer, Kafka Streams directly addresses a lot of the hard problems in stream processing:

  • Event-at-a-time processing (not microbatch) with millisecond latency
  • Stateful processing including distributed joins and aggregations
  • A convenient DSL
  • Windowing with out-of-order data using a DataFlow-like model
  • Distributed processing and fault-tolerance with fast failover
  • Reprocessing capabilities so you can recalculate output when your code changes
  • No-downtime rolling deployments
  • And everything with a very simple architecture; which is not possible with other Open source Streaming frameworks 😂

Framework-Free Stream Processing

Existing streaming frameworks come with heavy and complex deployment process. You need to create a cluster of say Apache Storm and then you need to deploy your application code into the cluster which will then be copied across all the nodes.

Kafka Stream application, you write using simple API, without any framework. You can run it as single instance and even if you start another intsnace; no issues then Kafka will districute the load evenly to the new instances.

So What Does Kafka Streams Do Instead?

It does the following:

  1. Balance the processing load as new instances of your app are added or existing ones crash
  2. Maintain local state for tables
  3. Recover from failures

The result is that a Kafka Streams app is just like any other service. It may have some local state on disk, but that is just a cache that can be recreated if it is lost or if that instance of the app is moved elsewhere. You just use the library in your app, and start as many instances of the app as you like, and Kafka will partition up and balance the work over these instances.

How to deploy the Kafka Stream application

These applications can be packaged, deployed, and monitored like any other Java application – there is no need to install separate processing clusters or similar special-purpose and expensive infrastructure!

You deploy your Apache Stream application using one of the following deployment framework -

  • Apache Mesos with a framework like Marathon
  • Kubernetes
  • YARN with something like Slider
  • Swarm from Docker
  • Various hosted container services such as ECS from Amazon
  • Cloud Foundry

Streams meet tables

Tables is nothing but a snapshot of Streams processing

Current state of data in a database table is nothing but a result of processing of some add-data and update-data events. Example: When you log on in Amazon website you sometimes see your shopping cart already showing some of the items which you added in the past. If current state of your cart is showing 5 items, which means you added those 5 items one after another. Here the stream of Item-Added-Into-Cart events was processed and every time count was increased by 1. It is also possible that you deleted some of the items from the cart and another stream of Item-Deleted-From-Cart events was then processed and numbers were reduced from your cart one after another. End result in a table (in Database) is nothing but processing the inbound change events.

Ways to capture change logs

By modeling the table concept in this way, Kafka Streams lets you compute derived values against the table using just the stream of changes. In other words it lets you process database change streams just as you would in case of a stream of clicks. Kafka Connect, a framework that is built for data capture and was newly added to Apache Kafka in the 0.9 release.

Joins and Aggregates are Tables Too

Let’s say I have a stream of user clicks coming in and I want to compute the total number of clicks for each user. Kafka Streams lets you compute this aggregation, and the set of counts that are computed, is, unsurprisingly, a table of the current number of clicks per user.

In terms of implementation Kafka Streams stores this derived aggregation in a local embedded key-value store (RocksDB by default, but you can plug in anything). The output of the job is exactly the changelog of updates to this table. This changelog is used for high-availability of the computation, but it’s also an output that can be consumed and transformed by other Kafka Streams processing or loaded into another system using Kafka Connect.

They share a lot of the same operations, and can be converted back and forth just as the table/stream duality suggests, but, for example, an aggregation on a KTable will automatically handle that fact that it is made up of updates to the underlying values. This matters, as the semantics of computing a sum over a tables undergoing updates and a stream of immutable updates are totally different; likewise the semantics of joining two streams (say clicks and impressions) are totally different from the semantics of joining a stream to a table (say clicks to user accounts). By modeling these two concepts in the DSL, these details fall out automatically.

Windows and Tables 

Kafka Streams makes handling this really simple: the semantics of a windowed aggregation like a count is that it represents the count “so far” for the window. It is continuously updated as new data arrives and allows the downstream receiver to decide when it is complete. And yes, this notion of an updatable quantity should seem eerily familiar: it is nothing more than a table where the window being updated is part of the key. Naturally downstream operations know that this stream represents a table, and process these refinements as they come.

References

Kafka Streams (Concepts and Architecture)

Kafka Storage & Processing Fundamentals

Developer Guides

Other useful concepts

Code examples

SpringBoot and Kafka Streams

Videos

Ecommerce End2End app developed using SpringBoot