Kafka rebalancing is a crucial process that ensures each partition in a consumer group is handled by only one consumer at a time. Here’s a closer look at what triggers this process and how it works.
Key Points about Kafka Rebalancing
Definition and Trigger
- Rebalancing: This involves mapping each partition to each single consumer in a consumer group
- Triggers: Rebalancing kicks in when a consumer joins or leaves a group. For instance, in the image above we have only one consumer in the consumer group. When more consumers join the group, rebalancing kicks in as shown in the consumer block section below. In the Kubernetes environment, this can happen due to updates, new application deployments, crashes, and more.
Consumer Blocking
During rebalancing, consumer processing is generally halted in eager rebalancing, except for incremental rebalancing, which only affects the partitions being transferred.
Configuration Properties
These configuration properties are directly connected with Kafka rebalancing
- max.poll.records (default=500): Maximum number of messages a consumer can retrieve in one poll.
- max.partition.fetch.bytes (default=1048576): Maximum bytes returned in a poll for a single partition.
- max.poll.interval.ms (default=300000): Time allowed for a consumer to process all messages from a poll before it leaves the group.
- heartbeat.interval.ms (default=3000): Frequency at which consumers send heartbeats.
- session.timeout.ms (default=10000): Time a consumer has to send a heartbeat before being considered inactive.
Rebalancing Reasons
There are multiple reasons, why the kafka needs rebalancing but some important are list below
- Exceeding max.poll.interval.ms
- Exceeding session.timeout.ms due to missed heartbeats
- New consumers joining the group
- Consumer shutdown
- Pod relocation
- Application scale-down and many more
Processing Time Variability
Consumers process records quickly if they are not large enough, While consumers working on larger record sets take longer to process and such processing that exceeds max.poll.interval.ms can trigger rebalancing.
Impact of Configuration Adjustments
Increasing max.poll.interval.ms allows more time for processing but can delay rebalancing if processing takes too long.
Improvements and Strategies
Kafka Static Membership
- Static Membership: Retains a consumer’s identity in a group even after restarts, reducing rebalancing frequency.
- Implementation in Kubernetes: Use StatefulSets to ensure consistent consumer IDs.
Incremental Cooperative Rebalancing
This method allows incremental rebalancing, minimizing disruption by only transferring necessary partitions, However, it doesn’t resolve issues associated with long processing times for larger tasks, if any consumer is processing larger records let’s say for hours, then rebalancing would also take hours.
Suggestions
- Configuration Adjustments: While frequent changes to session.timeout.ms and heartbeat.interval.ms aren’t recommended, adjusting max.poll.interval.ms and max.poll.records can help.
- Use Static Membership Protocol: Reduces rebalancing frequency for applications prone to frequent rebalancing.
- Initial Rebalance Delay: Configuring group.initial.rebalance.delay.ms can delay the initial rebalance, allowing all members to join before rebalancing starts.
Understanding these triggers and configurations can help manage Kafka rebalancing more effectively, improving performance and reducing downtime. Moreover, as per the KIP:932, the Kafka community has released something called shared consumer, where each topic behaves like a message queue where Kafka design restrictions such as having the same number of consumers as opposed to having the same number of partitions can be skipped and this proposal solves few of the problem associated with the Kafka rebalancing. For larger deployments, we need more scaleable solutions such as proxying topic messages to HTTP and putting consumers after the proxy to consumer HTTP messages.