Kafka Rebalancing

No Comments
57

Asif Saleem

July 1, 2024
3 minute read

Kafka Rebalancing

No Comments
57

Kafka rebalancing is a crucial process that ensures each partition in a consumer group is handled by only one consumer at a time. Here’s a closer look at what triggers this process and how it works.

Key Points about Kafka Rebalancing

Definition and Trigger

Rebalancing: This involves mapping each partition to each single consumer in a consumer group
Triggers: Rebalancing kicks in when a consumer joins or leaves a group. For instance, in the image above we have only one consumer in the consumer group. When more consumers join the group, rebalancing kicks in as shown in the consumer block section below. In the Kubernetes environment, this can happen due to updates, new application deployments, crashes, and more.

Consumer Blocking

During rebalancing, consumer processing is generally halted in eager rebalancing, except for incremental rebalancing, which only affects the partitions being transferred.

Configuration Properties

These configuration properties are directly connected with Kafka rebalancing

max.poll.records (default=500): Maximum number of messages a consumer can retrieve in one poll.
max.partition.fetch.bytes (default=1048576): Maximum bytes returned in a poll for a single partition.
max.poll.interval.ms (default=300000): Time allowed for a consumer to process all messages from a poll before it leaves the group.
heartbeat.interval.ms (default=3000): Frequency at which consumers send heartbeats.
session.timeout.ms (default=10000): Time a consumer has to send a heartbeat before being considered inactive.

Rebalancing Reasons

There are multiple reasons, why the kafka needs rebalancing but some important are list below

Exceeding max.poll.interval.ms
Exceeding session.timeout.ms due to missed heartbeats
New consumers joining the group
Consumer shutdown
Pod relocation
Application scale-down and many more

Processing Time Variability

Consumers process records quickly if they are not large enough, While consumers working on larger record sets take longer to process and such processing that exceeds max.poll.interval.ms can trigger rebalancing.

Impact of Configuration Adjustments

Increasing max.poll.interval.ms allows more time for processing but can delay rebalancing if processing takes too long.

Improvements and Strategies

Kafka Static Membership

Static Membership: Retains a consumer’s identity in a group even after restarts, reducing rebalancing frequency.
Implementation in Kubernetes: Use StatefulSets to ensure consistent consumer IDs.

Incremental Cooperative Rebalancing

This method allows incremental rebalancing, minimizing disruption by only transferring necessary partitions, However, it doesn’t resolve issues associated with long processing times for larger tasks, if any consumer is processing larger records let’s say for hours, then rebalancing would also take hours.

Suggestions

Configuration Adjustments: While frequent changes to session.timeout.ms and heartbeat.interval.ms aren’t recommended, adjusting max.poll.interval.ms and max.poll.records can help.
Use Static Membership Protocol: Reduces rebalancing frequency for applications prone to frequent rebalancing.
Initial Rebalance Delay: Configuring group.initial.rebalance.delay.ms can delay the initial rebalance, allowing all members to join before rebalancing starts.

Understanding these triggers and configurations can help manage Kafka rebalancing more effectively, improving performance and reducing downtime. Moreover, as per the KIP:932, the Kafka community has released something called shared consumer, where each topic behaves like a message queue where Kafka design restrictions such as having the same number of consumers as opposed to having the same number of partitions can be skipped and this proposal solves few of the problem associated with the Kafka rebalancing. For larger deployments, we need more scaleable solutions such as proxying topic messages to HTTP and putting consumers after the proxy to consumer HTTP messages.