Skip to main content

How 7 Systems Handle Failure

Every distributed system in this course was designed with a specific failure model in mind. The differences in how they detect, tolerate, and recover from failure reveal the deepest architectural choices their designers made. This page compares all seven systems side by side -- not to summarize each one, but to expose the patterns and trade-offs that connect them.

The failure landscape

Not all failures are created equal, and not all systems care about the same ones.

Think first
Before reading the table below, consider: a peer-to-peer system like Dynamo and a leader-based system like GFS face the same hardware failures, but the *consequences* are radically different. Why? Think about what is lost when a random node dies in each system.
SystemPrimary failure concernWhy this matters most
DynamoNode unavailability during peak trafficA shopping cart that errors out is a lost sale
CassandraNode failure in multi-datacenter deploymentsMust survive entire datacenter outages
GFSChunkserver crashes and disk corruptionHardware failure is routine at Google's scale
HDFSNameNode single point of failureOne metadata server controls the entire cluster
BigTableTablet server crashes mid-compactionMust recover tablet state without data loss
KafkaBroker failure causing message lossFinancial and operational data cannot be lost
ChubbyMaster failure disrupting lock serviceThousands of clients depend on lock correctness

Detection mechanisms: how systems know something is wrong

Detection is the first step in any failure response. The seven systems use four fundamentally different approaches, and the choice of detection mechanism reveals what each system values most.

Heartbeat-based detection

The simplest model: a node periodically sends "I'm alive" messages. If a threshold is missed, the node is declared dead.

Used by: GFS, HDFS, BigTable, Kafka

The heartbeat approach works well for leader-based systems where a central authority (the master or controller) monitors all workers. The master collects heartbeats, maintains a global view of cluster health, and makes unilateral decisions. This is simple and fast -- but it creates a dependency on the master itself.

See: Heartbeat pattern

Gossip-based detection

Nodes share failure information with each other through randomized peer-to-peer communication. No single node has a privileged view -- instead, failure knowledge spreads epidemically through the cluster.

Used by: Dynamo, Cassandra

Gossip is the natural choice for decentralized systems. There is no master to collect heartbeats, so nodes must detect failures collectively. The trade-off: convergence is slower (it takes multiple gossip rounds for all nodes to agree), but there is no single point of detection failure.

See: Gossip Protocol pattern

Phi Accrual Failure Detection

Rather than a binary alive/dead decision, the Phi Accrual detector outputs a suspicion level -- a continuous value that represents how likely a node is to have failed, based on the statistical distribution of past heartbeat arrival times.

Used by: Cassandra

Think first
Why would Cassandra use Phi Accrual detection on top of gossip, rather than simple timeout-based failure detection? Think about what happens in a multi-datacenter deployment where network latencies vary dramatically between racks, datacenters, and continents.

See: Phi Accrual Failure Detection pattern

ZooKeeper session-based detection

Instead of direct monitoring, some systems delegate failure detection to a coordination service. A node maintains an ephemeral session with ZooKeeper; when the session expires (due to missed heartbeats to ZooKeeper), the node is considered dead.

Used by: Kafka, BigTable (via Chubby, which is Google's equivalent)

This approach outsources the hard problem of failure detection to a specialized, consensus-backed system. The benefit: consistent failure decisions across all observers. The cost: an additional infrastructure dependency.

Detection comparison

MechanismLatency to detectFalse positive riskCentral dependencyBest for
HeartbeatLow (seconds)Medium -- fixed timeoutYes (master)Leader-based systems
GossipMedium (multiple rounds)Low -- collective decisionNoPeer-to-peer systems
Phi AccrualAdaptiveVery low -- statisticalNoVariable-latency networks
ZooKeeper sessionsMedium (session timeout)Low -- consensus-backedYes (ZooKeeper)Systems needing consistent detection

Recovery strategies: what happens after detection

Detection is the easy part. Recovery is where the real engineering happens. Each system must answer: how do we restore the system to a healthy state without losing data or violating correctness?

Re-replication

When a replica is lost, the system creates new copies to restore the target replication factor.

Used by: GFS, HDFS, Kafka

In GFS and HDFS, the master/NameNode tracks the replication count for every chunk/block. When a node dies, it identifies under-replicated data and schedules re-replication from surviving replicas to healthy nodes. Kafka's controller does the same for partition replicas.

The key design question is priority: which data gets re-replicated first? GFS prioritizes chunks with the fewest remaining replicas -- a chunk down to one copy is far more urgent than a chunk down to two.

Hinted handoff

When the intended recipient of a write is unavailable, another node temporarily accepts the write and "hints" that it needs to be forwarded later.

Used by: Dynamo, Cassandra

This is the AP system's answer to node failure during writes. Instead of failing the write (which would sacrifice availability), the system redirects it. The hint ensures durability and eventual delivery. But hinted handoff is not a guarantee -- if the hinting node also dies before delivering, the data can be lost. This is why it works alongside other anti-entropy mechanisms.

See: Hinted Handoff pattern

Anti-entropy with Merkle trees

Replicas can drift out of sync due to hinted handoff delays, missed writes, or partial failures. Merkle trees allow two replicas to efficiently identify exactly which data differs between them, so they can synchronize with minimum data transfer.

Used by: Dynamo, Cassandra

Think first
Why can't hinted handoff alone keep replicas in sync? Think about what happens if a node is down for an extended period and the hint-holding node runs out of storage, or if a network partition prevents hint delivery.

See: Merkle Trees pattern, Read Repair pattern

Leader failover

When a leader/master dies, the system must elect or promote a replacement. This is the most complex recovery scenario because it involves coordinating all remaining nodes to agree on the new leader while preventing the old leader from causing damage (split-brain).

Used by: HDFS, Kafka, Chubby, BigTable (via Chubby)

The failover pipeline looks different in each system:

  • HDFS: Standby NameNode replays journal entries from shared storage (QJM), then takes over. ZKFC (ZooKeeper Failover Controller) coordinates.
  • Kafka: The controller broker detects leader failure for a partition and selects a new leader from the ISR (in-sync replica set).
  • Chubby: Paxos-based consensus among replicas elects a new master. Clients observe a brief "grace period" during which their sessions remain valid.

Fencing

Fencing prevents the old leader from continuing to act after a new leader has been elected -- the critical defense against split-brain.

Used by: HDFS, Chubby, BigTable (via Chubby)

HDFS uses STONITH ("Shoot The Other Node In The Head") as a last resort -- literally power-cycling the old NameNode's hardware. Chubby uses sequencers (monotonically increasing tokens) so that servers can reject requests from stale lock holders. Both approaches solve the same problem: ensuring that at most one leader can modify shared state at any time.

See: Fencing pattern

The failure detection to recovery pipeline

Master comparison table

SystemDetectionIs leader-based?Recovery strategyAnti-entropyFencingSplit-brain risk
DynamoGossipNoHinted handoff, Merkle trees, read repairYesN/AN/A (no leader)
CassandraGossip + Phi AccrualNoHinted handoff, Merkle trees, read repairYesN/AN/A (no leader)
GFSHeartbeatYes (single master)Re-replication, checksumsChecksumsNo (manual failover)Low (manual)
HDFSHeartbeatYes (NameNode)Re-replication, standby NameNodeChecksumsSTONITHMitigated by ZKFC
BigTableHeartbeat + ChubbyYes (master + tablet servers)Tablet reassignment, WAL replayChecksumsChubby locksMitigated by Chubby
KafkaHeartbeat + ZooKeeperYes (per-partition leaders)ISR-based leader election, re-replicationHigh-water markController epochMitigated by ZK
ChubbyPaxos consensusYes (master)Paxos-based master election, grace periodPaxos log replicationSequencersPrevented by Paxos

Key insights for interviews

Interview insight

When asked "how does system X handle failures?", structure your answer as a pipeline: detect (how do we know something failed?) then recover (what do we do about it?) then prevent cascading (how do we avoid making things worse?). This framework works for any distributed system, not just the seven in this course.

Common mistake

Don't assume that detecting failure is instant or binary. In real systems, detection is probabilistic and has latency. A node might be slow (not dead), and declaring it dead prematurely triggers expensive re-replication. This is why Cassandra uses Phi Accrual rather than fixed timeouts -- and why understanding the detection mechanism matters as much as understanding the recovery strategy.

The meta-pattern

Looking across all seven systems, a clear pattern emerges:

  1. Leader-based systems invest heavily in failover and fencing because leader failure is existential. They use centralized detection (heartbeats to the master) which is fast and simple but creates a detection bottleneck.

  2. Peer-to-peer systems invest heavily in anti-entropy (hinted handoff, Merkle trees, read repair) because any node can fail without threatening the system's ability to operate. They use decentralized detection (gossip) which is slower but has no single point of failure.

The choice between these two failure-handling philosophies is not about which is "better" -- it is about which failure mode your application can tolerate. If inconsistency is worse than downtime, choose leader-based with strong failover. If downtime is worse than inconsistency, choose peer-to-peer with anti-entropy.

Quiz

Quiz
GFS originally had no automatic master failover -- requiring manual intervention to recover from master failure. Given that chunkserver failures were handled automatically via re-replication, why did Google accept this asymmetry?
Quiz
Dynamo uses three separate anti-entropy mechanisms: hinted handoff, Merkle trees, and read repair. Why isn't one mechanism sufficient?