Kafka consumer lag monitoring

The consumer lag is a key performance indicator for services / Storm topologies that utilize Kafka as a message broker. The lag indicates how far behind a service / topology is in processing topic messages. The lag is calculated simply as the delta between the last produced message and the last committed message (the last processed or read message).

A large or a growing lag indicates that a consumer is not capable to keep up with the volume of topic messages. So the Kilda monitoring solution should focus on the consumer lag as a crucial metric for monitoring and alerting.

What did we need to achieve?

When planning for a lag monitoring solution, the following needs must be took into account:

Flexibility: From a metrics perspective, we want to introduce alerts for different use cases of Kafka in services / topologies. The lag monitoring and alerting should help in understanding a problem and assist in resolving it.
Scalable: The number of Kilda topics continues to grow. Thus, we need a solution that will scale to support hundreds of consumer groups.
Automation: We wouldn't need to change config files every time a new topic / consumer group is added to the Kafka cluster.

Proposed solutions

Burrow

Burrow is a monitoring solution for Kafka that performs consumer status checking. It monitors committed offsets and lags, and calculates the status of consumers. The results are exposed via an HTTP endpoint.

Burrow provides enough flexibility, including custom metrics calculations (e.g. lag time, lag evaluation rules). Burrow also has configurable notifiers for status updates.

We should establish a set of rules that determines the normal and bad consumer behavior. This can be done without the need for defining a threshold for the lag:

If any lag within the window is zero (or close to it with some deviation), the status is considered to be normal.
If lag stays the same or increases between every pair of offsets, the consumer is slow, and is falling behind.

Kafka Lag Exporter

Kafka Lag Exporter is a tool to represent consumer group metrics using Kubernetes, Prometheus, and Grafana.

Consumer Group Time Lag

One of Kafka Lag Exporter’s unique features is its ability to estimate the length of time that a consumer group is behind the last produced value for a particular partition, time lag.

That's why it matters: two consumer groups may have different lag characteristics. Topic A has a producer with low write rate and a consumer with a stable read rate. As the result the lag for topic A has stable linear graph with values close to 0. Meanwhile, topic B has much higher write rate and a consumer which performs batch reading. The lag for topic B has saw-tooth graph with high spikes. Thus, the simple values of the consumer lag are not descriptive enough without understanding the topic characteristics.

But it’s much easier to build monitoring alerts using a time lag measurement than an offset lag measurement because latency is best described in requirements as a unit of time.

Provide feedback

Saved searches