18 Hinted Handoff
Node C is supposed to receive a replica of a write, but it's down. Do you fail the write? In a system that prioritizes availability, the answer is no -- you find another node to temporarily hold the data and forward it to Node C when it recovers.
Background
In a replicated system using quorum writes, a write is successful as long as enough replicas acknowledge it. But what happens to the replicas that are temporarily unavailable? If a node is down for a few minutes (network blip, restart, GC pause), you don't want to:
- Fail the write entirely (hurts availability)
- Permanently lose that replica's copy (hurts durability)
- Wait for the node to recover (hurts latency)
Hinted handoff solves this elegantly: temporarily store the data elsewhere, then deliver it when the target recovers.
Definition
When a write is meant for a node that is temporarily unavailable, another node accepts the write and stores a hint -- a note saying "this data belongs to Node C, deliver it when Node C comes back." When the target node recovers, the hinted data is forwarded to it.
How it works
- Client sends a write that should be replicated to Nodes A, B, and C
- Nodes A and B are healthy -- they accept the write normally
- Node C is down -- the coordinating node (or another healthy node) writes the data locally as a hint
- The hint includes: the data itself + the identity of the target node (C)
- The coordinating node periodically checks if Node C is back
- When Node C recovers, the coordinating node forwards the hinted write to Node C
- Once Node C confirms receipt, the hint is deleted
Hinted handoff decouples "accepting a write" from "delivering it to the right place." The system can stay available (accept writes even when replicas are down) without losing data (it will be delivered later). It's a form of store-and-forward -- an old networking pattern applied to distributed storage.
Hinted handoff and sloppy quorum
Hinted handoff is often paired with sloppy quorum (used by Dynamo):
- In a strict quorum, a write must go to the designated replica nodes. If one is down, the write fails (or waits).
- In a sloppy quorum, the write can go to any healthy node in the preference list. That node stores a hint and becomes a temporary custodian.
This combination is how Dynamo achieves its "always writeable" guarantee -- no matter how many nodes are down, the system can find healthy nodes to accept writes.
Limitations
| Limitation | Why it matters |
|---|---|
| Not a replacement for full repair | If a node is down for a long time, hints accumulate and may be dropped (storage limits). Read repair and Merkle trees handle long-term divergence. |
| Temporary inconsistency | Until hints are delivered, the target node is missing data. Reads from that node return stale results. |
| Storage overhead on hint holders | The nodes holding hints use extra disk space. If many nodes are down simultaneously, this can become significant. |
Examples
Dynamo
Dynamo uses hinted handoff with sloppy quorum as its primary mechanism for handling temporary node failures. This is a core part of Dynamo's "always writeable" philosophy -- writes are never rejected due to node failures.
Cassandra
Cassandra implements hinted handoff similarly. When a target replica is unavailable, the coordinator stores hints locally. Hints are replayed when the target node comes back online. Cassandra also has a configurable hint window -- if the node is down longer than this window, hints are no longer stored (and other repair mechanisms take over).
Hinted handoff answers: "What happens to writes when a replica is temporarily down?" It's the first line of defense for temporary failures. For long-term divergence, you need read repair (lazy, during reads) and Merkle trees (proactive, in the background). Show the interviewer you understand all three as a layered anti-entropy strategy.