Heroku’s Streaming Data Connectors and Best Practices

Article Thursday, September 3 2020

Heroku has three managed data service offerings: Heroku Postgres, Heroku Redis, and Apache Kafka on Heroku. According to this article from Heroku, more developers are choosing upfront to include Kafka in their architecture, rather than opting to integrate it later in an application’s lifecycle. This is evidence of a growing, recent trend towards architecting event-driven applications. By no means is event-driven architecture a novel architectural paradigm; enterprises have been building event-driven systems long before Kafka was open-sourced. Kafka lowered the barrier to entry into the world of event-driven applications. However, in the case of an application built on Heroku Postgres, there was not a straightforward way to integrate Kafka into an application’s architecture. Sure, developers could integrate Kafka connectors into their system, but this could be tedious. This point of integration between Kafka and the rest of the system was not an out-of-the-box solution.

Figure 1

As of July 2020, there is no longer a need for this emergent complexity; streaming data connectors are here to save the day (as well as lower the bar to building event-driven applications). Built on Kafka Connect and Debezium, Heroku’s streaming data connectors represent the elegant solution to integrating Kafka into a Heroku application. The aforementioned Heroku article that introduces streaming data connectors notes features of connectors that render it superior to prior solutions: “… fully-managed, [have] a simple and powerful user experience, and [come] with [Heroku’s] operational excellence …”. Heroku’s streaming data connectors allow developers to subscribe to changes in their Heroku Postgres and push these changes to corresponding Kafka Topics, and all of the complexity is managed by Heroku.

Figure 2

To contextualize the hype behind the introduction of these streaming data connectors, it’s useful to examine the paradigm of event-driven architecture and how it’s situated in the software ecosystem. Think of an event as a message that provides context about an operation, as well as information about the data involved in this operation. Let’s use a hypothetical involving LinkedIn, the company that created Kafka, to flesh this out. Supposed LinkedIn publishes an event when a user sends a connection request. Now, let’s examine how the defining characteristics of an event (context, operation, and state) may relate to this hypothetical event. The context, in this case, could include information like the time in which it was initially processed, the device ID the connection request came from, and the geographic location of the datacenter it was processed in. The operation tied to this event could include information indicating the kinds of functions it triggered: creations, updates, deletions, etc. Finally, the state associated with this event could have information indicating that two users were involved in this operation and that these users are identified with a unique ID. Heroku’s streaming data connectors observe Postgres tables, form events from these changes, and send these events to Kafka.

Figure 3

As noted, many enterprises had been using event-driven architecture before the release of Kafka. In other words, the merits of an event-driven architecture do not stem from Kafka. Rather, some of Kafka’s merit stems from its ties to an event-driven-paradigm. This said, Big Data keeps on getting bigger (shocker, right?), and the use cases for data are increasing in complexity. Coupled with the rise of microservices, Kafka is rebranding what it means for an architecture to be event driven. This can be seen in the rise of platform engineering, as enterprises migrate from legacy technologies and adapt their system’s architecture to integrate modern technologies.

Figure 4

There are some critical best practices to keep in mind when developing a system that uses streaming data connectors. First, since these streaming data connectors are built on top of Debezium and Kafka Connect, it’s helpful to be aware of specifics related to these platforms. According to Debezium’s documentation about Postgres connectors, if the system is being operated “normally or being managed carefully”, then “Debezium provides exactly once delivery of every change event record”. Exactly once delivery is the notion that the changes captured by Debezium will only be delivered once. This property of atomicity is extremely helpful for building transactional support into a system. Removing the guarantee of exactly once delivery increases the complexity of building transactional support.

Debezium doesn’t guarantee exactly once delivery when recovering from a fault. Rather, it provides the guarantee of at least once delivery. For developers, this means that Kafka consumers should be built to handle repeated events. One common approach to this problem suggested on this Kafka FAQ page is to “include a primary key (UUID or something) in the message and deduplicate on the consumer.”

Figure 5

All messages received have context about the “before” data in the events they are describing. A streaming data connector constructs a message to be passed on to Kafka by taking columns of the table that are part of the table’s REPLICA_IDENTITY. A table’s REPLICA_IDENTITY is “the information which is written to the write-ahead log to identify rows which are updated or deleted, according to Postgres’s documentation. By default, it only includes the row’s primary key.

A final best practice to note is how adding, removing, and renaming primary keys should be performed in order to not risk creating some messages with an inconsistent schema. Debezium retrieves all but one set of schema information from the Postgres logical decoder plug-in: the set of columns that represent the primary key. This set of columns is gathered from JDBC metadata. Since Heroku uses Debezium, this is noteworthy because it is possible for the change data in a logical decoding event to not be consistent with that pulled from JDBC metadata.

Figure 6

Given, this is not guaranteed nor is this issue that must be resolved by a human; only a small number of messages with inconsistent schemas will be sent to Kafka. Nonetheless, there is an approximate sequence of operations that will reduce the likelihood of this situation arising: set Postgres to read-only, allow Heroku streaming data connectors to finish processing lingering events, stop these streaming data connectors, alter the primary key, set Postgres to a read and write state, and then boot-up these streaming data connectors. More information can be found in Debezium’s documentation.

We’ve only examined a number of best practices, all of which are related to using streaming data connectors. There are a number of others related to creating and destroying streaming data connectors, as well as how to handle failures. For more information, visit this documentation published by Heroku that contains more information about best practices for working with streaming data connectors.