Fault Tolerance High Availability and Data Integrity

In a cluster of thousands of commodity machines, hardware failure is not an exception -- it is the steady state. Disks die, servers crash, network links flap. GFS's survival strategy rests on two simple pillars: fast recovery and replication.

Think first

In a cluster of thousands of machines, which failure is more dangerous -- master failure or ChunkServer failure -- and why would the recovery strategies differ?

Fault tolerance

Master failure

The master is a single point of failure. GFS mitigates this with:

Operation log replication -- all metadata changes are written to a write-ahead log, replicated to multiple remote machines
Checkpointing -- periodic snapshots of master state in a B-tree-like format for fast recovery
External monitoring -- an infrastructure outside GFS detects master failure and redirects traffic to a backup

On recovery, the new master loads the checkpoint, replays the operation log, and polls ChunkServers for chunk locations via HeartBeat messages.

Shadow masters provide read-only access to the file system while the primary is down. They replicate the primary's state by reading its operation log, though they may lag slightly. Since file data comes from ChunkServers (not the master), clients reading through a shadow master get current data -- only metadata might be slightly stale.

Interview angle

Shadow masters are not the same as a hot standby. They serve read-only metadata and may lag the primary. Compare this with HDFS, which uses a standby NameNode that can take over as primary via automatic failover. In an interview, distinguish between read-only replicas (shadows) and true failover replicas.

Primary replica failure

When an active primary replica fails:

The master detects the failure (no heartbeat)
The master waits for the current lease to expire (the primary might still be serving clients directly despite the network partition)
The master grants the lease to a new ChunkServer
When the old primary recovers, the master detects its stale version number, replaces it with new replicas, and garbage-collects the stale data

Secondary replica failure

When a secondary replica fails, client writes to that replica start failing. The client retries; if retries exhaust, it reports failure to the master. The failed secondary misses mutations and becomes stale. The master eventually replaces it through re-replication and garbage-collects the stale replica.

warning

Stale replicas can be exposed to clients through cached metadata. GFS does not guarantee strong consistency on chunk reads. Applications must tolerate reading slightly stale data -- this is an explicit design choice, not a bug.

High availability through chunk replication

Each chunk is replicated across multiple ChunkServers on different racks (default: 3 replicas). The master monitors replica counts and triggers re-replication when:

A ChunkServer goes offline
Checksum verification detects corruption
The replication factor is increased

A chunk is irreversibly lost only if all its replicas fail before re-replication completes. Even then, the data becomes unavailable, not corrupted -- applications get clear errors rather than garbage data.

Think first

How would you verify data integrity against silent disk corruption (bit rot) without comparing replicas byte-by-byte?

Data integrity through checksums

Each ChunkServer independently verifies data integrity using checksums. Every chunk is divided into 64KB blocks, and each block has a 32-bit checksum stored in memory and persisted separately from user data.

Operation	Checksum behavior
Read	Verify checksums of overlapping blocks before returning data. If mismatch: return error, report to master, master clones from good replica
Write	Verify checksums of first and last overlapping blocks before writing. Compute and store new checksums after write
Append	Skip verification of the last partial block; incrementally update its checksum and compute new checksums for new blocks. If the last block was already corrupted, the mismatch surfaces on the next read
Idle scan	ChunkServers scan inactive chunks in the background, preventing dormant corrupted replicas from silently reducing redundancy

Why checksums don't hurt read performance

Most reads span multiple blocks, so the extra verification data is proportionally small
Checksum lookups happen in memory -- no extra disk I/O
Checksum computation overlaps with data I/O
GFS client code aligns reads to checksum block boundaries to minimize overhead

Interview angle

Data integrity checking is a universal concern. GFS uses per-block checksums; Dynamo uses Merkle trees to detect divergence between replicas. In an interview, pick the right tool: checksums for single-replica verification, Merkle trees for efficient cross-replica comparison.

Quiz

What would happen if ChunkServers only verified checksums when clients read data, without running background block scans on idle chunks?

The system would be equally reliable since corruption is always caught before data is served

Corruption in rarely-read chunks could go undetected until all replicas silently degrade, potentially losing data before re-replication can save it

The master would detect corruption through heartbeat messages

Clients would notice the corruption through higher latency

Fault tolerance​

Master failure​

Primary replica failure​

Secondary replica failure​

High availability through chunk replication​

Data integrity through checksums​

Why checksums don't hurt read performance​