How 7 Systems Handle Failure
Every distributed system in this course was designed with a specific failure model in mind. The differences in how they detect, tolerate, and recover from failure reveal the deepest architectural choices their designers made. This page compares all seven systems side by side -- not to summarize each one, but to expose the patterns and trade-offs that connect them.
The failure landscape
Not all failures are created equal, and not all systems care about the same ones.
| System | Primary failure concern | Why this matters most |
|---|---|---|
| Dynamo | Node unavailability during peak traffic | A shopping cart that errors out is a lost sale |
| Cassandra | Node failure in multi-datacenter deployments | Must survive entire datacenter outages |
| GFS | Chunkserver crashes and disk corruption | Hardware failure is routine at Google's scale |
| HDFS | NameNode single point of failure | One metadata server controls the entire cluster |
| BigTable | Tablet server crashes mid-compaction | Must recover tablet state without data loss |
| Kafka | Broker failure causing message loss | Financial and operational data cannot be lost |
| Chubby | Master failure disrupting lock service | Thousands of clients depend on lock correctness |
Detection mechanisms: how systems know something is wrong
Detection is the first step in any failure response. The seven systems use four fundamentally different approaches, and the choice of detection mechanism reveals what each system values most.
Heartbeat-based detection
The simplest model: a node periodically sends "I'm alive" messages. If a threshold is missed, the node is declared dead.
Used by: GFS, HDFS, BigTable, Kafka
The heartbeat approach works well for leader-based systems where a central authority (the master or controller) monitors all workers. The master collects heartbeats, maintains a global view of cluster health, and makes unilateral decisions. This is simple and fast -- but it creates a dependency on the master itself.
See: Heartbeat pattern
Gossip-based detection
Nodes share failure information with each other through randomized peer-to-peer communication. No single node has a privileged view -- instead, failure knowledge spreads epidemically through the cluster.
Gossip is the natural choice for decentralized systems. There is no master to collect heartbeats, so nodes must detect failures collectively. The trade-off: convergence is slower (it takes multiple gossip rounds for all nodes to agree), but there is no single point of detection failure.
Phi Accrual Failure Detection
Rather than a binary alive/dead decision, the Phi Accrual detector outputs a suspicion level -- a continuous value that represents how likely a node is to have failed, based on the statistical distribution of past heartbeat arrival times.
Used by: Cassandra
See: Phi Accrual Failure Detection pattern
ZooKeeper session-based detection
Instead of direct monitoring, some systems delegate failure detection to a coordination service. A node maintains an ephemeral session with ZooKeeper; when the session expires (due to missed heartbeats to ZooKeeper), the node is considered dead.
Used by: Kafka, BigTable (via Chubby, which is Google's equivalent)
This approach outsources the hard problem of failure detection to a specialized, consensus-backed system. The benefit: consistent failure decisions across all observers. The cost: an additional infrastructure dependency.
Detection comparison
| Mechanism | Latency to detect | False positive risk | Central dependency | Best for |
|---|---|---|---|---|
| Heartbeat | Low (seconds) | Medium -- fixed timeout | Yes (master) | Leader-based systems |
| Gossip | Medium (multiple rounds) | Low -- collective decision | No | Peer-to-peer systems |
| Phi Accrual | Adaptive | Very low -- statistical | No | Variable-latency networks |
| ZooKeeper sessions | Medium (session timeout) | Low -- consensus-backed | Yes (ZooKeeper) | Systems needing consistent detection |
Recovery strategies: what happens after detection
Detection is the easy part. Recovery is where the real engineering happens. Each system must answer: how do we restore the system to a healthy state without losing data or violating correctness?
Re-replication
When a replica is lost, the system creates new copies to restore the target replication factor.
In GFS and HDFS, the master/NameNode tracks the replication count for every chunk/block. When a node dies, it identifies under-replicated data and schedules re-replication from surviving replicas to healthy nodes. Kafka's controller does the same for partition replicas.
The key design question is priority: which data gets re-replicated first? GFS prioritizes chunks with the fewest remaining replicas -- a chunk down to one copy is far more urgent than a chunk down to two.
Hinted handoff
When the intended recipient of a write is unavailable, another node temporarily accepts the write and "hints" that it needs to be forwarded later.
This is the AP system's answer to node failure during writes. Instead of failing the write (which would sacrifice availability), the system redirects it. The hint ensures durability and eventual delivery. But hinted handoff is not a guarantee -- if the hinting node also dies before delivering, the data can be lost. This is why it works alongside other anti-entropy mechanisms.
Anti-entropy with Merkle trees
Replicas can drift out of sync due to hinted handoff delays, missed writes, or partial failures. Merkle trees allow two replicas to efficiently identify exactly which data differs between them, so they can synchronize with minimum data transfer.
See: Merkle Trees pattern, Read Repair pattern
Leader failover
When a leader/master dies, the system must elect or promote a replacement. This is the most complex recovery scenario because it involves coordinating all remaining nodes to agree on the new leader while preventing the old leader from causing damage (split-brain).
Used by: HDFS, Kafka, Chubby, BigTable (via Chubby)
The failover pipeline looks different in each system:
- HDFS: Standby NameNode replays journal entries from shared storage (QJM), then takes over. ZKFC (ZooKeeper Failover Controller) coordinates.
- Kafka: The controller broker detects leader failure for a partition and selects a new leader from the ISR (in-sync replica set).
- Chubby: Paxos-based consensus among replicas elects a new master. Clients observe a brief "grace period" during which their sessions remain valid.
Fencing
Fencing prevents the old leader from continuing to act after a new leader has been elected -- the critical defense against split-brain.
Used by: HDFS, Chubby, BigTable (via Chubby)
HDFS uses STONITH ("Shoot The Other Node In The Head") as a last resort -- literally power-cycling the old NameNode's hardware. Chubby uses sequencers (monotonically increasing tokens) so that servers can reject requests from stale lock holders. Both approaches solve the same problem: ensuring that at most one leader can modify shared state at any time.
See: Fencing pattern
The failure detection to recovery pipeline
Master comparison table
| System | Detection | Is leader-based? | Recovery strategy | Anti-entropy | Fencing | Split-brain risk |
|---|---|---|---|---|---|---|
| Dynamo | Gossip | No | Hinted handoff, Merkle trees, read repair | Yes | N/A | N/A (no leader) |
| Cassandra | Gossip + Phi Accrual | No | Hinted handoff, Merkle trees, read repair | Yes | N/A | N/A (no leader) |
| GFS | Heartbeat | Yes (single master) | Re-replication, checksums | Checksums | No (manual failover) | Low (manual) |
| HDFS | Heartbeat | Yes (NameNode) | Re-replication, standby NameNode | Checksums | STONITH | Mitigated by ZKFC |
| BigTable | Heartbeat + Chubby | Yes (master + tablet servers) | Tablet reassignment, WAL replay | Checksums | Chubby locks | Mitigated by Chubby |
| Kafka | Heartbeat + ZooKeeper | Yes (per-partition leaders) | ISR-based leader election, re-replication | High-water mark | Controller epoch | Mitigated by ZK |
| Chubby | Paxos consensus | Yes (master) | Paxos-based master election, grace period | Paxos log replication | Sequencers | Prevented by Paxos |
Key insights for interviews
When asked "how does system X handle failures?", structure your answer as a pipeline: detect (how do we know something failed?) then recover (what do we do about it?) then prevent cascading (how do we avoid making things worse?). This framework works for any distributed system, not just the seven in this course.
Don't assume that detecting failure is instant or binary. In real systems, detection is probabilistic and has latency. A node might be slow (not dead), and declaring it dead prematurely triggers expensive re-replication. This is why Cassandra uses Phi Accrual rather than fixed timeouts -- and why understanding the detection mechanism matters as much as understanding the recovery strategy.
The meta-pattern
Looking across all seven systems, a clear pattern emerges:
-
Leader-based systems invest heavily in failover and fencing because leader failure is existential. They use centralized detection (heartbeats to the master) which is fast and simple but creates a detection bottleneck.
-
Peer-to-peer systems invest heavily in anti-entropy (hinted handoff, Merkle trees, read repair) because any node can fail without threatening the system's ability to operate. They use decentralized detection (gossip) which is slower but has no single point of failure.
The choice between these two failure-handling philosophies is not about which is "better" -- it is about which failure mode your application can tolerate. If inconsistency is worse than downtime, choose leader-based with strong failover. If downtime is worse than inconsistency, choose peer-to-peer with anti-entropy.