How 7 Systems Handle Failure

Every distributed system in this course was designed with a specific failure model in mind. The differences in how they detect, tolerate, and recover from failure reveal the deepest architectural choices their designers made. This page compares all seven systems side by side -- not to summarize each one, but to expose the patterns and trade-offs that connect them.

The failure landscape

Not all failures are created equal, and not all systems care about the same ones.

Think first

Before reading the table below, consider: a peer-to-peer system like Dynamo and a leader-based system like GFS face the same hardware failures, but the *consequences* are radically different. Why? Think about what is lost when a random node dies in each system.

System	Primary failure concern	Why this matters most
Dynamo	Node unavailability during peak traffic	A shopping cart that errors out is a lost sale
Cassandra	Node failure in multi-datacenter deployments	Must survive entire datacenter outages
GFS	Chunkserver crashes and disk corruption	Hardware failure is routine at Google's scale
HDFS	NameNode single point of failure	One metadata server controls the entire cluster
BigTable	Tablet server crashes mid-compaction	Must recover tablet state without data loss
Kafka	Broker failure causing message loss	Financial and operational data cannot be lost
Chubby	Master failure disrupting lock service	Thousands of clients depend on lock correctness

Detection mechanisms: how systems know something is wrong

Detection is the first step in any failure response. The seven systems use four fundamentally different approaches, and the choice of detection mechanism reveals what each system values most.

Heartbeat-based detection

The simplest model: a node periodically sends "I'm alive" messages. If a threshold is missed, the node is declared dead.

Used by: GFS, HDFS, BigTable, Kafka

The heartbeat approach works well for leader-based systems where a central authority (the master or controller) monitors all workers. The master collects heartbeats, maintains a global view of cluster health, and makes unilateral decisions. This is simple and fast -- but it creates a dependency on the master itself.

See: Heartbeat pattern

Gossip-based detection

Nodes share failure information with each other through randomized peer-to-peer communication. No single node has a privileged view -- instead, failure knowledge spreads epidemically through the cluster.

Used by: Dynamo, Cassandra

Gossip is the natural choice for decentralized systems. There is no master to collect heartbeats, so nodes must detect failures collectively. The trade-off: convergence is slower (it takes multiple gossip rounds for all nodes to agree), but there is no single point of detection failure.

See: Gossip Protocol pattern

Phi Accrual Failure Detection

Rather than a binary alive/dead decision, the Phi Accrual detector outputs a suspicion level -- a continuous value that represents how likely a node is to have failed, based on the statistical distribution of past heartbeat arrival times.

Used by: Cassandra

Think first

Why would Cassandra use Phi Accrual detection on top of gossip, rather than simple timeout-based failure detection? Think about what happens in a multi-datacenter deployment where network latencies vary dramatically between racks, datacenters, and continents.

See: Phi Accrual Failure Detection pattern

ZooKeeper session-based detection

Instead of direct monitoring, some systems delegate failure detection to a coordination service. A node maintains an ephemeral session with ZooKeeper; when the session expires (due to missed heartbeats to ZooKeeper), the node is considered dead.

Used by: Kafka, BigTable (via Chubby, which is Google's equivalent)

This approach outsources the hard problem of failure detection to a specialized, consensus-backed system. The benefit: consistent failure decisions across all observers. The cost: an additional infrastructure dependency.

Detection comparison

Mechanism	Latency to detect	False positive risk	Central dependency	Best for
Heartbeat	Low (seconds)	Medium -- fixed timeout	Yes (master)	Leader-based systems
Gossip	Medium (multiple rounds)	Low -- collective decision	No	Peer-to-peer systems
Phi Accrual	Adaptive	Very low -- statistical	No	Variable-latency networks
ZooKeeper sessions	Medium (session timeout)	Low -- consensus-backed	Yes (ZooKeeper)	Systems needing consistent detection

Recovery strategies: what happens after detection

Detection is the easy part. Recovery is where the real engineering happens. Each system must answer: how do we restore the system to a healthy state without losing data or violating correctness?

Re-replication

When a replica is lost, the system creates new copies to restore the target replication factor.

Used by: GFS, HDFS, Kafka

In GFS and HDFS, the master/NameNode tracks the replication count for every chunk/block. When a node dies, it identifies under-replicated data and schedules re-replication from surviving replicas to healthy nodes. Kafka's controller does the same for partition replicas.

The key design question is priority: which data gets re-replicated first? GFS prioritizes chunks with the fewest remaining replicas -- a chunk down to one copy is far more urgent than a chunk down to two.

Hinted handoff

When the intended recipient of a write is unavailable, another node temporarily accepts the write and "hints" that it needs to be forwarded later.

Used by: Dynamo, Cassandra

This is the AP system's answer to node failure during writes. Instead of failing the write (which would sacrifice availability), the system redirects it. The hint ensures durability and eventual delivery. But hinted handoff is not a guarantee -- if the hinting node also dies before delivering, the data can be lost. This is why it works alongside other anti-entropy mechanisms.

See: Hinted Handoff pattern

Anti-entropy with Merkle trees

Replicas can drift out of sync due to hinted handoff delays, missed writes, or partial failures. Merkle trees allow two replicas to efficiently identify exactly which data differs between them, so they can synchronize with minimum data transfer.

Used by: Dynamo, Cassandra

Think first

Why can't hinted handoff alone keep replicas in sync? Think about what happens if a node is down for an extended period and the hint-holding node runs out of storage, or if a network partition prevents hint delivery.

See: Merkle Trees pattern, Read Repair pattern

Leader failover

When a leader/master dies, the system must elect or promote a replacement. This is the most complex recovery scenario because it involves coordinating all remaining nodes to agree on the new leader while preventing the old leader from causing damage (split-brain).

Used by: HDFS, Kafka, Chubby, BigTable (via Chubby)

The failover pipeline looks different in each system:

HDFS: Standby NameNode replays journal entries from shared storage (QJM), then takes over. ZKFC (ZooKeeper Failover Controller) coordinates.
Kafka: The controller broker detects leader failure for a partition and selects a new leader from the ISR (in-sync replica set).
Chubby: Paxos-based consensus among replicas elects a new master. Clients observe a brief "grace period" during which their sessions remain valid.

Fencing

Fencing prevents the old leader from continuing to act after a new leader has been elected -- the critical defense against split-brain.

Used by: HDFS, Chubby, BigTable (via Chubby)

HDFS uses STONITH ("Shoot The Other Node In The Head") as a last resort -- literally power-cycling the old NameNode's hardware. Chubby uses sequencers (monotonically increasing tokens) so that servers can reject requests from stale lock holders. Both approaches solve the same problem: ensuring that at most one leader can modify shared state at any time.

See: Fencing pattern

The failure detection to recovery pipeline

Master comparison table

System	Detection	Is leader-based?	Recovery strategy	Anti-entropy	Fencing	Split-brain risk
Dynamo	Gossip	No	Hinted handoff, Merkle trees, read repair	Yes	N/A	N/A (no leader)
Cassandra	Gossip + Phi Accrual	No	Hinted handoff, Merkle trees, read repair	Yes	N/A	N/A (no leader)
GFS	Heartbeat	Yes (single master)	Re-replication, checksums	Checksums	No (manual failover)	Low (manual)
HDFS	Heartbeat	Yes (NameNode)	Re-replication, standby NameNode	Checksums	STONITH	Mitigated by ZKFC
BigTable	Heartbeat + Chubby	Yes (master + tablet servers)	Tablet reassignment, WAL replay	Checksums	Chubby locks	Mitigated by Chubby
Kafka	Heartbeat + ZooKeeper	Yes (per-partition leaders)	ISR-based leader election, re-replication	High-water mark	Controller epoch	Mitigated by ZK
Chubby	Paxos consensus	Yes (master)	Paxos-based master election, grace period	Paxos log replication	Sequencers	Prevented by Paxos

Key insights for interviews

Interview insight

When asked "how does system X handle failures?", structure your answer as a pipeline: detect (how do we know something failed?) then recover (what do we do about it?) then prevent cascading (how do we avoid making things worse?). This framework works for any distributed system, not just the seven in this course.

Common mistake

Don't assume that detecting failure is instant or binary. In real systems, detection is probabilistic and has latency. A node might be slow (not dead), and declaring it dead prematurely triggers expensive re-replication. This is why Cassandra uses Phi Accrual rather than fixed timeouts -- and why understanding the detection mechanism matters as much as understanding the recovery strategy.

The meta-pattern

Looking across all seven systems, a clear pattern emerges:

Leader-based systems invest heavily in failover and fencing because leader failure is existential. They use centralized detection (heartbeats to the master) which is fast and simple but creates a detection bottleneck.
Peer-to-peer systems invest heavily in anti-entropy (hinted handoff, Merkle trees, read repair) because any node can fail without threatening the system's ability to operate. They use decentralized detection (gossip) which is slower but has no single point of failure.

The choice between these two failure-handling philosophies is not about which is "better" -- it is about which failure mode your application can tolerate. If inconsistency is worse than downtime, choose leader-based with strong failover. If downtime is worse than inconsistency, choose peer-to-peer with anti-entropy.

Quiz

GFS originally had no automatic master failover -- requiring manual intervention to recover from master failure. Given that chunkserver failures were handled automatically via re-replication, why did Google accept this asymmetry?

Master failures were so rare they didn't justify the engineering investment

Master failure is qualitatively different -- incorrect automatic failover (split-brain) could corrupt the entire file system's metadata, which is worse than temporary unavailability

Google's SRE team preferred manual control over all operations

Automatic failover would have required changing GFS's consistency model

Quiz

Dynamo uses three separate anti-entropy mechanisms: hinted handoff, Merkle trees, and read repair. Why isn't one mechanism sufficient?

Each mechanism handles a different time scale and failure duration -- hinted handoff for seconds-to-minutes, read repair for individual reads, Merkle trees for background full synchronization

They were added incrementally as the system evolved and are partially redundant

Different Amazon services require different anti-entropy strategies

Merkle trees are too expensive to run frequently, so the other mechanisms fill the gaps

The failure landscape​

Detection mechanisms: how systems know something is wrong​

Heartbeat-based detection​

Gossip-based detection​

Phi Accrual Failure Detection​

ZooKeeper session-based detection​

Detection comparison​

Recovery strategies: what happens after detection​

Re-replication​

Hinted handoff​

Anti-entropy with Merkle trees​

Leader failover​

Fencing​

The failure detection to recovery pipeline​

Master comparison table​

Key insights for interviews​

The meta-pattern​

Quiz​

The failure landscape

Detection mechanisms: how systems know something is wrong

Heartbeat-based detection

Gossip-based detection

Phi Accrual Failure Detection

ZooKeeper session-based detection

Detection comparison

Recovery strategies: what happens after detection

Re-replication

Hinted handoff

Anti-entropy with Merkle trees

Leader failover

Fencing

The failure detection to recovery pipeline

Master comparison table

Key insights for interviews

The meta-pattern

Quiz