Controller Broker
Within a Kafka cluster, one broker is elected as the Controller -- the cluster-wide leader responsible for administrative coordination. This is the Leader and Follower pattern applied at the cluster level.
What the Controller does
| Responsibility | Detail |
|---|---|
| Topic management | Creating/deleting topics, adding partitions |
| Partition leader assignment | Assigns a leader broker to each partition |
| Broker health monitoring | Detects broker failures and triggers failover |
| Leader election | When a broker dies, reassigns its partition leadership to ISR members |
| Metadata propagation | Communicates partition leadership changes to all brokers |
Split-brain: the zombie controller problem
When the controller broker becomes unresponsive (GC pause, network partition), the cluster elects a new controller. But what if the old controller comes back? Now there are two controllers -- a classic split-brain scenario.
During garbage collection, Java (which Kafka runs on) can pause all application threads for seconds. During this pause, the broker appears dead to the rest of the cluster. If the pause exceeds ZooKeeper's session timeout, a new controller is elected -- and when the GC pause ends, the old controller wakes up thinking it's still in charge.
Solution: epoch numbers
Kafka solves split-brain with a generation clock (epoch number):
- Every time a new controller is elected, the epoch number increments (old controller had epoch 1, new has epoch 2)
- The epoch is included in every request from the controller to other brokers
- Brokers reject requests from any controller with a lower epoch than the highest they've seen
- The epoch is persisted in ZooKeeper, so it survives restarts
When the zombie controller (epoch 1) tries to issue commands, every broker has already seen epoch 2 and ignores it. The zombie eventually discovers it's been superseded and steps down.
Controller broker + split-brain + epoch numbers is a complete story for "How does Kafka handle controller failures?" Walk through: (1) controller dies or pauses, (2) new controller elected with higher epoch, (3) old controller wakes up, (4) its commands are rejected because brokers have seen the higher epoch. This shows you understand both the problem and the solution.