Skip to main content

Fault Tolerance and Compaction

What happens when a GFS master crashes? When a ChunkServer loses a disk? When data silently corrupts on disk? BigTable delegates durability to GFS, so understanding GFS's fault tolerance mechanisms is essential to understanding BigTable's reliability guarantees.

Think first
GFS replicates data across multiple ChunkServers for durability, but how do you detect if one of those replicas has silently become corrupted on disk (bit rot) without anyone reading it?

Fault tolerance

GFS uses two strategies to handle failures:

StrategyPurpose
Fast recoveryMinimize downtime after component failure
ReplicationEnsure data survives hardware loss

Recovery from failures

FailureRecovery mechanism
Master failureOperations are saved to an operation log, checkpointed and replicated to remote machines. On recovery, the new master loads the checkpoint, replays the log, and resumes. An external monitoring system detects the failure and redirects traffic to a backup.
Primary replica failureMaster detects missing heartbeat, waits for the current lease to expire, then assigns the lease to a new node. The old primary is detected as stale via chunk version numbers and eventually garbage-collected.
Secondary replica failureClient operations fail on the dead replica. After retries, the client reports to the master. The stale replica is replaced and garbage-collected.

Shadow masters are read-only replicas of the GFS master. They stay current by replaying the primary master's operation log, and they serve read requests when the primary is down -- trading slight staleness for read availability.

warning

GFS does not guarantee strong consistency on chunk reads. Stale replicas may be exposed to clients. Application developers must handle stale reads themselves. This is a deliberate trade-off for availability.

High availability through chunk replication

Each chunk is replicated across multiple ChunkServers on different racks. Key properties:

  • Default replication factor: three
  • Users can configure different replication levels for different parts of the namespace
  • The master clones replicas to maintain the target count as servers go offline or checksums reveal corruption
  • A chunk is irreversibly lost only if all replicas fail before GFS can react
  • Even in total loss, data becomes unavailable, not corrupted -- clients receive clear errors

Data integrity through checksums

Each ChunkServer uses checksumming to detect corruption. Chunks are divided into 64 KB blocks, each with a 32-bit checksum stored in memory and persisted separately from user data.

OperationChecksum behavior
ReadsVerify checksums of overlapping blocks before returning data. Corrupted blocks trigger an error to the requestor and a report to the master. The master clones from a healthy replica, then tells the corrupt ChunkServer to delete its copy.
WritesVerify checksums of first and last overlapping blocks before writing. Compute and record new checksums after the write.
AppendsSkip verification of the last block; incrementally update its checksum and compute checksums for new blocks. Corruption is caught on the next read.
Idle scansChunkServers scan inactive chunks during idle periods to detect latent corruption before it becomes critical.

Checksumming has minimal impact on read performance because:

  • Most reads span multiple blocks, so the extra verification data is proportionally small
  • Client code aligns reads to checksum block boundaries
  • Checksum lookups require no I/O (they're in memory)
  • Checksum calculation overlaps with disk I/O
Interview angle

The checksum strategy for appends is a subtle optimization worth mentioning in interviews. GFS skips verifying the last partial block during appends (which could mask existing corruption) but catches it on the next read. This trade-off prioritizes append throughput -- critical for BigTable's commit log writes, which are always appends. When discussing data integrity in system design, show that you understand these kinds of performance-vs-safety trade-offs.

Quiz
GFS uses a replication factor of 3 by default. What would happen if a ChunkServer with a corrupted replica serves a read before the corruption is detected?