Skip to main content

18 Hinted Handoff

Node C is supposed to receive a replica of a write, but it's down. Do you fail the write? In a system that prioritizes availability, the answer is no -- you find another node to temporarily hold the data and forward it to Node C when it recovers.

Think first
A write should be replicated to Nodes A, B, and C. Node C is temporarily down. You don't want to fail the write or wait for Node C. How would you ensure the data eventually reaches Node C without losing it?

Background

In a replicated system using quorum writes, a write is successful as long as enough replicas acknowledge it. But what happens to the replicas that are temporarily unavailable? If a node is down for a few minutes (network blip, restart, GC pause), you don't want to:

  • Fail the write entirely (hurts availability)
  • Permanently lose that replica's copy (hurts durability)
  • Wait for the node to recover (hurts latency)

Hinted handoff solves this elegantly: temporarily store the data elsewhere, then deliver it when the target recovers.

Definition

When a write is meant for a node that is temporarily unavailable, another node accepts the write and stores a hint -- a note saying "this data belongs to Node C, deliver it when Node C comes back." When the target node recovers, the hinted data is forwarded to it.

How it works

  1. Client sends a write that should be replicated to Nodes A, B, and C
  2. Nodes A and B are healthy -- they accept the write normally
  3. Node C is down -- the coordinating node (or another healthy node) writes the data locally as a hint
  4. The hint includes: the data itself + the identity of the target node (C)
  5. The coordinating node periodically checks if Node C is back
  6. When Node C recovers, the coordinating node forwards the hinted write to Node C
  7. Once Node C confirms receipt, the hint is deleted
The key insight

Hinted handoff decouples "accepting a write" from "delivering it to the right place." The system can stay available (accept writes even when replicas are down) without losing data (it will be delivered later). It's a form of store-and-forward -- an old networking pattern applied to distributed storage.

Hinted handoff and sloppy quorum

Hinted handoff is often paired with sloppy quorum (used by Dynamo):

  • In a strict quorum, a write must go to the designated replica nodes. If one is down, the write fails (or waits).
  • In a sloppy quorum, the write can go to any healthy node in the preference list. That node stores a hint and becomes a temporary custodian.

This combination is how Dynamo achieves its "always writeable" guarantee -- no matter how many nodes are down, the system can find healthy nodes to accept writes.

Limitations

LimitationWhy it matters
Not a replacement for full repairIf a node is down for a long time, hints accumulate and may be dropped (storage limits). Read repair and Merkle trees handle long-term divergence.
Temporary inconsistencyUntil hints are delivered, the target node is missing data. Reads from that node return stale results.
Storage overhead on hint holdersThe nodes holding hints use extra disk space. If many nodes are down simultaneously, this can become significant.

Examples

Dynamo

Dynamo uses hinted handoff with sloppy quorum as its primary mechanism for handling temporary node failures. This is a core part of Dynamo's "always writeable" philosophy -- writes are never rejected due to node failures.

Cassandra

Cassandra implements hinted handoff similarly. When a target replica is unavailable, the coordinator stores hints locally. Hints are replayed when the target node comes back online. Cassandra also has a configurable hint window -- if the node is down longer than this window, hints are no longer stored (and other repair mechanisms take over).

Interview angle

Hinted handoff answers: "What happens to writes when a replica is temporarily down?" It's the first line of defense for temporary failures. For long-term divergence, you need read repair (lazy, during reads) and Merkle trees (proactive, in the background). Show the interviewer you understand all three as a layered anti-entropy strategy.

Quiz
In Dynamo, a node is down for 3 days. Hinted handoff has a hint window of 1 hour. What data does the node miss when it comes back?