12 Split Brain
Your distributed database has a leader that coordinates all writes. The leader goes silent -- maybe it crashed, maybe the network is partitioned, maybe it's just paused for garbage collection. The cluster elects a new leader. Thirty seconds later, the old leader wakes up and starts accepting writes again.
You now have two leaders. Both think they're in charge. Both are issuing conflicting commands. This is split-brain, and it can corrupt your data.
Background
In any leader-based system (GFS, HDFS, Kafka, Chubby), the leader is a critical coordination point. When the leader appears to fail, the system must elect a new one quickly to maintain availability.
But here's the problem: you cannot reliably distinguish between "dead" and "temporarily unreachable." A leader that's experiencing a GC pause, a network partition, or a slow disk looks identical to a dead leader from the outside. If you elect a new leader too aggressively, the old leader might come back -- creating a zombie leader that doesn't know it's been replaced.
During garbage collection, some runtimes (like Java's) pause all application threads to reclaim memory. These pauses can last seconds -- long enough for the rest of the cluster to assume the node has died and elect a new leader.
The danger: the zombie leader has stale state and stale authority. If it continues serving requests, clients connected to it will see different data than clients connected to the real leader. Writes can conflict, reads can diverge, and the system can enter an inconsistent state that's extremely difficult to recover from.
Definition
Split-brain occurs when a distributed system has two or more nodes that simultaneously believe they are the leader.
The standard solution: Generation Clock (also called an epoch number) -- a monotonically increasing number that identifies each leader's "generation."
How it works
- Every time a new leader is elected, the generation number is incremented (e.g., old leader was generation 1, new leader is generation 2)
- The generation number is included in every request from the leader to other nodes
- Nodes reject requests from any leader with a generation number lower than the highest they've seen
- The generation number is persisted to disk (often in the write-ahead log) so it survives restarts
This means when the zombie leader (generation 1) tries to issue commands, every node has already seen generation 2 and will ignore it. The zombie leader itself, upon receiving a response with a higher generation number, can also recognize that it's been superseded and step down.
Timestamps suffer from clock skew -- two nodes might disagree on what time it is. Generation numbers are purely logical: they only increment when a specific event (leader election) occurs, and they're maintained by the consensus protocol itself, not by hardware clocks. They're monotonically increasing by definition, with no ambiguity.
Examples
Kafka
Kafka uses an epoch number for its controller broker. When a new controller is elected, the epoch increments. All requests from the controller include the epoch number. Brokers compare the epoch in incoming requests against the latest epoch they've seen and reject stale commands.
HDFS
HDFS uses ZooKeeper for NameNode leader election. Each transaction ID includes an epoch number that reflects the NameNode's generation. Combined with fencing, this ensures a zombie NameNode cannot corrupt the file system.
Cassandra
Cassandra stores a generation number that increments on every node restart. This number is included in gossip messages, allowing other nodes to distinguish between a node's pre-restart state and post-restart state. If a gossip message arrives with a higher generation number, the receiving node knows the sender has restarted and updates its state accordingly.
Split-brain comes up whenever you design a system with leader election. The interviewer wants to hear: (1) you recognize the zombie leader risk, (2) you use generation numbers / epoch numbers to tag leadership, and (3) you combine this with fencing to prevent the old leader from doing damage while it figures out it's been replaced.