Fault Tolerance and Compaction
What happens when a GFS master crashes? When a ChunkServer loses a disk? When data silently corrupts on disk? BigTable delegates durability to GFS, so understanding GFS's fault tolerance mechanisms is essential to understanding BigTable's reliability guarantees.
Fault tolerance
GFS uses two strategies to handle failures:
| Strategy | Purpose |
|---|---|
| Fast recovery | Minimize downtime after component failure |
| Replication | Ensure data survives hardware loss |
Recovery from failures
| Failure | Recovery mechanism |
|---|---|
| Master failure | Operations are saved to an operation log, checkpointed and replicated to remote machines. On recovery, the new master loads the checkpoint, replays the log, and resumes. An external monitoring system detects the failure and redirects traffic to a backup. |
| Primary replica failure | Master detects missing heartbeat, waits for the current lease to expire, then assigns the lease to a new node. The old primary is detected as stale via chunk version numbers and eventually garbage-collected. |
| Secondary replica failure | Client operations fail on the dead replica. After retries, the client reports to the master. The stale replica is replaced and garbage-collected. |
Shadow masters are read-only replicas of the GFS master. They stay current by replaying the primary master's operation log, and they serve read requests when the primary is down -- trading slight staleness for read availability.
GFS does not guarantee strong consistency on chunk reads. Stale replicas may be exposed to clients. Application developers must handle stale reads themselves. This is a deliberate trade-off for availability.
High availability through chunk replication
Each chunk is replicated across multiple ChunkServers on different racks. Key properties:
- Default replication factor: three
- Users can configure different replication levels for different parts of the namespace
- The master clones replicas to maintain the target count as servers go offline or checksums reveal corruption
- A chunk is irreversibly lost only if all replicas fail before GFS can react
- Even in total loss, data becomes unavailable, not corrupted -- clients receive clear errors
Data integrity through checksums
Each ChunkServer uses checksumming to detect corruption. Chunks are divided into 64 KB blocks, each with a 32-bit checksum stored in memory and persisted separately from user data.
| Operation | Checksum behavior |
|---|---|
| Reads | Verify checksums of overlapping blocks before returning data. Corrupted blocks trigger an error to the requestor and a report to the master. The master clones from a healthy replica, then tells the corrupt ChunkServer to delete its copy. |
| Writes | Verify checksums of first and last overlapping blocks before writing. Compute and record new checksums after the write. |
| Appends | Skip verification of the last block; incrementally update its checksum and compute checksums for new blocks. Corruption is caught on the next read. |
| Idle scans | ChunkServers scan inactive chunks during idle periods to detect latent corruption before it becomes critical. |
Checksumming has minimal impact on read performance because:
- Most reads span multiple blocks, so the extra verification data is proportionally small
- Client code aligns reads to checksum block boundaries
- Checksum lookups require no I/O (they're in memory)
- Checksum calculation overlaps with disk I/O
The checksum strategy for appends is a subtle optimization worth mentioning in interviews. GFS skips verifying the last partial block during appends (which could mask existing corruption) but catches it on the next read. This trade-off prioritizes append throughput -- critical for BigTable's commit log writes, which are always appends. When discussing data integrity in system design, show that you understand these kinds of performance-vs-safety trade-offs.