19 Read Repair

A node was down for an hour. It missed several writes. Hinted handoff delivered some of them, but not all (maybe the hint window expired). Now the node is back up and serving reads -- with stale data. How do you detect and fix this, without running a separate repair process?

Think first

You're already reading from multiple replicas to satisfy a quorum. Some replicas have stale data. Instead of running a separate repair process, how could you fix the stale replicas using the data you already have from the read operation?

Background

In eventually consistent systems, replicas can drift apart. Hinted handoff handles short-term failures, but it's not comprehensive -- hints can be lost, expire, or cover only a subset of missed writes. You need another mechanism to detect and fix stale replicas, ideally without adding a separate background process.

The clever insight: you're already reading from multiple replicas to satisfy the quorum. Why not compare them while you're at it?

Definition

During a read operation, the system reads from multiple replicas and compares their responses. If any replica has stale data, the system immediately pushes the latest version to it. This "repair during read" is called read repair.

How it works

Client sends a read request
The coordinator reads the full data from one replica and a digest (checksum) from the others
If digests match → all replicas are in sync, return the data
If digests don't match → read full data from all replicas, determine the newest version
Return the newest version to the client
Asynchronously push the newest version to any replicas that had stale data

Optimization: probabilistic read repair

Comparing all replicas on every read is expensive. When the consistency level is less than ALL, many systems perform read repair probabilistically -- for example, only on 10% of reads. This reduces overhead while still gradually repairing stale replicas over time.

Full read repair (every read)	Probabilistic read repair
Every read repairs inconsistencies immediately	Repairs happen gradually over time
Higher read latency (extra comparisons)	Lower overhead per read
Used when consistency level = ALL	Used when consistency level < ALL

The key insight

Read repair is lazy -- it only fixes data that's actually being read. Hot data (frequently accessed) gets repaired quickly. Cold data (rarely accessed) might remain stale for a long time. For cold data, you need Merkle trees to proactively find and fix divergence.

The three layers of anti-entropy

Mechanism	When it runs	What it fixes	Speed
Hinted Handoff	During write (proactive)	Temporary node failures	Immediate (when node recovers)
Read Repair	During read (reactive)	Stale replicas for accessed data	On next read
Merkle Trees	Background process (proactive)	All divergence, including cold data	Eventually

These three mechanisms form a layered defense: hinted handoff catches most temporary failures, read repair fixes stale data as it's accessed, and Merkle trees sweep up everything else in the background.

Examples

Cassandra

Cassandra implements both full and probabilistic read repair. The read_repair_chance setting controls the probability of triggering read repair on reads below the ALL consistency level. At ALL consistency, read repair always runs.

Dynamo

Dynamo uses read repair as part of its anti-entropy strategy. During reads, the coordinator compares responses and pushes updates to stale replicas. This works together with Merkle tree-based background synchronization.

Interview angle

Read repair is the answer to "How do you fix stale replicas without a separate repair process?" The key insight: since you're already reading from multiple replicas for quorum, compare them and fix discrepancies on the spot. Mention it alongside hinted handoff and Merkle trees as three complementary anti-entropy mechanisms -- interviewers love seeing you understand the layered approach.

Quiz

Read repair only fixes data that is actively being read. What problem does this create for data that is written once and rarely accessed again (cold data)?

No problem -- cold data doesn't change, so it can't become stale.

Cold data on stale replicas may remain inconsistent indefinitely, because read repair is lazy and only triggers on access. Without a proactive background mechanism like Merkle tree anti-entropy, stale cold data is never detected or repaired.

Probabilistic read repair handles this by randomly selecting cold data to check.

Cold data would be automatically deleted by garbage collection.

Background​

Definition​

How it works​

Optimization: probabilistic read repair​

The three layers of anti-entropy​

Examples​

Cassandra​

Dynamo​