Fault Tolerance High Availability and Data Integrity
In a cluster of thousands of commodity machines, hardware failure is not an exception -- it is the steady state. Disks die, servers crash, network links flap. GFS's survival strategy rests on two simple pillars: fast recovery and replication.
Fault tolerance
Master failure
The master is a single point of failure. GFS mitigates this with:
- Operation log replication -- all metadata changes are written to a write-ahead log, replicated to multiple remote machines
- Checkpointing -- periodic snapshots of master state in a B-tree-like format for fast recovery
- External monitoring -- an infrastructure outside GFS detects master failure and redirects traffic to a backup
On recovery, the new master loads the checkpoint, replays the operation log, and polls ChunkServers for chunk locations via HeartBeat messages.
Shadow masters provide read-only access to the file system while the primary is down. They replicate the primary's state by reading its operation log, though they may lag slightly. Since file data comes from ChunkServers (not the master), clients reading through a shadow master get current data -- only metadata might be slightly stale.
Shadow masters are not the same as a hot standby. They serve read-only metadata and may lag the primary. Compare this with HDFS, which uses a standby NameNode that can take over as primary via automatic failover. In an interview, distinguish between read-only replicas (shadows) and true failover replicas.
Primary replica failure
When an active primary replica fails:
- The master detects the failure (no heartbeat)
- The master waits for the current lease to expire (the primary might still be serving clients directly despite the network partition)
- The master grants the lease to a new ChunkServer
- When the old primary recovers, the master detects its stale version number, replaces it with new replicas, and garbage-collects the stale data
Secondary replica failure
When a secondary replica fails, client writes to that replica start failing. The client retries; if retries exhaust, it reports failure to the master. The failed secondary misses mutations and becomes stale. The master eventually replaces it through re-replication and garbage-collects the stale replica.
Stale replicas can be exposed to clients through cached metadata. GFS does not guarantee strong consistency on chunk reads. Applications must tolerate reading slightly stale data -- this is an explicit design choice, not a bug.
High availability through chunk replication
Each chunk is replicated across multiple ChunkServers on different racks (default: 3 replicas). The master monitors replica counts and triggers re-replication when:
- A ChunkServer goes offline
- Checksum verification detects corruption
- The replication factor is increased
A chunk is irreversibly lost only if all its replicas fail before re-replication completes. Even then, the data becomes unavailable, not corrupted -- applications get clear errors rather than garbage data.
Data integrity through checksums
Each ChunkServer independently verifies data integrity using checksums. Every chunk is divided into 64KB blocks, and each block has a 32-bit checksum stored in memory and persisted separately from user data.
| Operation | Checksum behavior |
|---|---|
| Read | Verify checksums of overlapping blocks before returning data. If mismatch: return error, report to master, master clones from good replica |
| Write | Verify checksums of first and last overlapping blocks before writing. Compute and store new checksums after write |
| Append | Skip verification of the last partial block; incrementally update its checksum and compute new checksums for new blocks. If the last block was already corrupted, the mismatch surfaces on the next read |
| Idle scan | ChunkServers scan inactive chunks in the background, preventing dormant corrupted replicas from silently reducing redundancy |
Why checksums don't hurt read performance
- Most reads span multiple blocks, so the extra verification data is proportionally small
- Checksum lookups happen in memory -- no extra disk I/O
- Checksum computation overlaps with data I/O
- GFS client code aligns reads to checksum block boundaries to minimize overhead
Data integrity checking is a universal concern. GFS uses per-block checksums; Dynamo uses Merkle trees to detect divergence between replicas. In an interview, pick the right tool: checksums for single-replica verification, Merkle trees for efficient cross-replica comparison.