Skip to main content

12 Split Brain

Your distributed database has a leader that coordinates all writes. The leader goes silent -- maybe it crashed, maybe the network is partitioned, maybe it's just paused for garbage collection. The cluster elects a new leader. Thirty seconds later, the old leader wakes up and starts accepting writes again.

You now have two leaders. Both think they're in charge. Both are issuing conflicting commands. This is split-brain, and it can corrupt your data.

Think first
The leader of your distributed database goes silent -- maybe crashed, maybe just a GC pause. The cluster elects a new leader. 30 seconds later, the old leader wakes up and starts accepting writes again. What damage can this cause, and how would you prevent it?

Background

In any leader-based system (GFS, HDFS, Kafka, Chubby), the leader is a critical coordination point. When the leader appears to fail, the system must elect a new one quickly to maintain availability.

But here's the problem: you cannot reliably distinguish between "dead" and "temporarily unreachable." A leader that's experiencing a GC pause, a network partition, or a slow disk looks identical to a dead leader from the outside. If you elect a new leader too aggressively, the old leader might come back -- creating a zombie leader that doesn't know it's been replaced.

Stop-the-world GC pause

During garbage collection, some runtimes (like Java's) pause all application threads to reclaim memory. These pauses can last seconds -- long enough for the rest of the cluster to assume the node has died and elect a new leader.

The danger: the zombie leader has stale state and stale authority. If it continues serving requests, clients connected to it will see different data than clients connected to the real leader. Writes can conflict, reads can diverge, and the system can enter an inconsistent state that's extremely difficult to recover from.

Definition

Split-brain occurs when a distributed system has two or more nodes that simultaneously believe they are the leader.

The standard solution: Generation Clock (also called an epoch number) -- a monotonically increasing number that identifies each leader's "generation."

How it works

  1. Every time a new leader is elected, the generation number is incremented (e.g., old leader was generation 1, new leader is generation 2)
  2. The generation number is included in every request from the leader to other nodes
  3. Nodes reject requests from any leader with a generation number lower than the highest they've seen
  4. The generation number is persisted to disk (often in the write-ahead log) so it survives restarts

This means when the zombie leader (generation 1) tries to issue commands, every node has already seen generation 2 and will ignore it. The zombie leader itself, upon receiving a response with a higher generation number, can also recognize that it's been superseded and step down.

Why not just use timestamps?

Timestamps suffer from clock skew -- two nodes might disagree on what time it is. Generation numbers are purely logical: they only increment when a specific event (leader election) occurs, and they're maintained by the consensus protocol itself, not by hardware clocks. They're monotonically increasing by definition, with no ambiguity.

Examples

Kafka

Kafka uses an epoch number for its controller broker. When a new controller is elected, the epoch increments. All requests from the controller include the epoch number. Brokers compare the epoch in incoming requests against the latest epoch they've seen and reject stale commands.

HDFS

HDFS uses ZooKeeper for NameNode leader election. Each transaction ID includes an epoch number that reflects the NameNode's generation. Combined with fencing, this ensures a zombie NameNode cannot corrupt the file system.

Cassandra

Cassandra stores a generation number that increments on every node restart. This number is included in gossip messages, allowing other nodes to distinguish between a node's pre-restart state and post-restart state. If a gossip message arrives with a higher generation number, the receiving node knows the sender has restarted and updates its state accordingly.

Interview angle

Split-brain comes up whenever you design a system with leader election. The interviewer wants to hear: (1) you recognize the zombie leader risk, (2) you use generation numbers / epoch numbers to tag leadership, and (3) you combine this with fencing to prevent the old leader from doing damage while it figures out it's been replaced.

Quiz
A system uses generation numbers to prevent split-brain. The old leader (generation 1) wakes up from a GC pause and sends a write to shared storage. The storage system does NOT check generation numbers. What happens?