Fault Tolerance and Compaction

What happens when a GFS master crashes? When a ChunkServer loses a disk? When data silently corrupts on disk? BigTable delegates durability to GFS, so understanding GFS's fault tolerance mechanisms is essential to understanding BigTable's reliability guarantees.

Think first

GFS replicates data across multiple ChunkServers for durability, but how do you detect if one of those replicas has silently become corrupted on disk (bit rot) without anyone reading it?

Fault tolerance

GFS uses two strategies to handle failures:

Strategy	Purpose
Fast recovery	Minimize downtime after component failure
Replication	Ensure data survives hardware loss

Recovery from failures

Failure	Recovery mechanism
Master failure	Operations are saved to an operation log, checkpointed and replicated to remote machines. On recovery, the new master loads the checkpoint, replays the log, and resumes. An external monitoring system detects the failure and redirects traffic to a backup.
Primary replica failure	Master detects missing heartbeat, waits for the current lease to expire, then assigns the lease to a new node. The old primary is detected as stale via chunk version numbers and eventually garbage-collected.
Secondary replica failure	Client operations fail on the dead replica. After retries, the client reports to the master. The stale replica is replaced and garbage-collected.

Shadow masters are read-only replicas of the GFS master. They stay current by replaying the primary master's operation log, and they serve read requests when the primary is down -- trading slight staleness for read availability.

warning

GFS does not guarantee strong consistency on chunk reads. Stale replicas may be exposed to clients. Application developers must handle stale reads themselves. This is a deliberate trade-off for availability.

High availability through chunk replication

Each chunk is replicated across multiple ChunkServers on different racks. Key properties:

Default replication factor: three
Users can configure different replication levels for different parts of the namespace
The master clones replicas to maintain the target count as servers go offline or checksums reveal corruption
A chunk is irreversibly lost only if all replicas fail before GFS can react
Even in total loss, data becomes unavailable, not corrupted -- clients receive clear errors

Data integrity through checksums

Each ChunkServer uses checksumming to detect corruption. Chunks are divided into 64 KB blocks, each with a 32-bit checksum stored in memory and persisted separately from user data.

Operation	Checksum behavior
Reads	Verify checksums of overlapping blocks before returning data. Corrupted blocks trigger an error to the requestor and a report to the master. The master clones from a healthy replica, then tells the corrupt ChunkServer to delete its copy.
Writes	Verify checksums of first and last overlapping blocks before writing. Compute and record new checksums after the write.
Appends	Skip verification of the last block; incrementally update its checksum and compute checksums for new blocks. Corruption is caught on the next read.
Idle scans	ChunkServers scan inactive chunks during idle periods to detect latent corruption before it becomes critical.

Checksumming has minimal impact on read performance because:

Most reads span multiple blocks, so the extra verification data is proportionally small
Client code aligns reads to checksum block boundaries
Checksum lookups require no I/O (they're in memory)
Checksum calculation overlaps with disk I/O

Interview angle

The checksum strategy for appends is a subtle optimization worth mentioning in interviews. GFS skips verifying the last partial block during appends (which could mask existing corruption) but catches it on the next read. This trade-off prioritizes append throughput -- critical for BigTable's commit log writes, which are always appends. When discussing data integrity in system design, show that you understand these kinds of performance-vs-safety trade-offs.

Quiz

GFS uses a replication factor of 3 by default. What would happen if a ChunkServer with a corrupted replica serves a read before the corruption is detected?

The client would receive corrupted data silently because GFS has no integrity checks.

The ChunkServer verifies the checksum before returning data, detects the corruption, reports it to the master, and the master directs the client to a healthy replica -- the corrupted copy is then re-replicated from a good replica.

The read would fail permanently because the data is lost.

Shadow masters would detect and fix the corruption automatically.

Fault tolerance​

Recovery from failures​

High availability through chunk replication​

Data integrity through checksums​

Fault tolerance

Recovery from failures

High availability through chunk replication

Data integrity through checksums