18 Hinted Handoff

Node C is supposed to receive a replica of a write, but it's down. Do you fail the write? In a system that prioritizes availability, the answer is no -- you find another node to temporarily hold the data and forward it to Node C when it recovers.

Think first

A write should be replicated to Nodes A, B, and C. Node C is temporarily down. You don't want to fail the write or wait for Node C. How would you ensure the data eventually reaches Node C without losing it?

Background

In a replicated system using quorum writes, a write is successful as long as enough replicas acknowledge it. But what happens to the replicas that are temporarily unavailable? If a node is down for a few minutes (network blip, restart, GC pause), you don't want to:

Fail the write entirely (hurts availability)
Permanently lose that replica's copy (hurts durability)
Wait for the node to recover (hurts latency)

Hinted handoff solves this elegantly: temporarily store the data elsewhere, then deliver it when the target recovers.

Definition

When a write is meant for a node that is temporarily unavailable, another node accepts the write and stores a hint -- a note saying "this data belongs to Node C, deliver it when Node C comes back." When the target node recovers, the hinted data is forwarded to it.

How it works

Client sends a write that should be replicated to Nodes A, B, and C
Nodes A and B are healthy -- they accept the write normally
Node C is down -- the coordinating node (or another healthy node) writes the data locally as a hint
The hint includes: the data itself + the identity of the target node (C)
The coordinating node periodically checks if Node C is back
When Node C recovers, the coordinating node forwards the hinted write to Node C
Once Node C confirms receipt, the hint is deleted

The key insight

Hinted handoff decouples "accepting a write" from "delivering it to the right place." The system can stay available (accept writes even when replicas are down) without losing data (it will be delivered later). It's a form of store-and-forward -- an old networking pattern applied to distributed storage.

Hinted handoff and sloppy quorum

Hinted handoff is often paired with sloppy quorum (used by Dynamo):

In a strict quorum, a write must go to the designated replica nodes. If one is down, the write fails (or waits).
In a sloppy quorum, the write can go to any healthy node in the preference list. That node stores a hint and becomes a temporary custodian.

This combination is how Dynamo achieves its "always writeable" guarantee -- no matter how many nodes are down, the system can find healthy nodes to accept writes.

Limitations

Limitation	Why it matters
Not a replacement for full repair	If a node is down for a long time, hints accumulate and may be dropped (storage limits). Read repair and Merkle trees handle long-term divergence.
Temporary inconsistency	Until hints are delivered, the target node is missing data. Reads from that node return stale results.
Storage overhead on hint holders	The nodes holding hints use extra disk space. If many nodes are down simultaneously, this can become significant.

Examples

Dynamo

Dynamo uses hinted handoff with sloppy quorum as its primary mechanism for handling temporary node failures. This is a core part of Dynamo's "always writeable" philosophy -- writes are never rejected due to node failures.

Cassandra

Cassandra implements hinted handoff similarly. When a target replica is unavailable, the coordinator stores hints locally. Hints are replayed when the target node comes back online. Cassandra also has a configurable hint window -- if the node is down longer than this window, hints are no longer stored (and other repair mechanisms take over).

Interview angle

Hinted handoff answers: "What happens to writes when a replica is temporarily down?" It's the first line of defense for temporary failures. For long-term divergence, you need read repair (lazy, during reads) and Merkle trees (proactive, in the background). Show the interviewer you understand all three as a layered anti-entropy strategy.

Quiz

In Dynamo, a node is down for 3 days. Hinted handoff has a hint window of 1 hour. What data does the node miss when it comes back?

No data is missed because other nodes hold all the hints.

The node misses all writes from hours 2 through 72, because hinted handoff only covers the first hour. The remaining divergence must be repaired through read repair (for frequently accessed data) and Merkle tree anti-entropy (for all remaining data).

Dynamo would not allow the node to rejoin the cluster after 3 days.

The coordinator would have replicated all writes to another node permanently.

Background​

Definition​

How it works​

Hinted handoff and sloppy quorum​

Limitations​

Examples​

Dynamo​

Cassandra​