Fault Tolerance
In a cluster of thousands of commodity machines, hardware failure is not an exception -- it is a daily event. HDFS must tolerate DataNode crashes without data loss and NameNode crashes without losing the file-to-block mapping. These are two fundamentally different problems with different solutions.
How does HDFS handle DataNode failures?
Replication
Every block is replicated to multiple DataNodes (default: 3). When a DataNode dies, its blocks remain available on the surviving replicas. No data is lost, and reads continue without interruption.
Heartbeat
The NameNode detects DataNode failures through a heartbeat mechanism:
| Event | What happens |
|---|---|
| DataNode alive | Sends a periodic heartbeat (every few seconds) to the NameNode |
| Heartbeats stop | NameNode marks the DataNode as dead and stops routing requests to it |
| Under-replication detected | NameNode identifies blocks that now have fewer replicas than the configured minimum |
| Cluster rebalance | NameNode triggers re-replication of under-replicated blocks to other healthy DataNodes |
This is the same pattern GFS uses: the master monitors chunkservers via heartbeats and re-replicates chunks when a server is declared dead.
When discussing fault tolerance, distinguish between detection (heartbeats) and recovery (re-replication). The NameNode does not immediately re-replicate when a heartbeat is missed -- it waits for a configurable timeout to avoid unnecessary replication during transient network issues. This is a common follow-up question.
What happens when the NameNode fails?
The NameNode is a single point of failure (SPOF). If it goes down, no client can read, write, or list files -- the entire cluster is effectively offline. HDFS addresses this with metadata persistence and backup mechanisms.
FsImage and EditLog
The NameNode maintains two on-disk structures:
| Structure | Purpose |
|---|---|
| FsImage | A checkpoint (snapshot) of file system metadata at a specific point in time |
| EditLog | A write-ahead log recording every metadata transaction since the last FsImage was created |
At periodic intervals, FsImage and EditLog are merged to produce a new checkpoint, and the EditLog is cleared. On recovery, the NameNode loads the FsImage and replays the EditLog to reconstruct the latest state.
Metadata backup
Disk failure on the NameNode is catastrophic -- without metadata, there is no way to reconstruct files from raw blocks scattered across DataNodes. HDFS provides two protection mechanisms:
| Mechanism | How it works | Trade-off |
|---|---|---|
| Multi-copy persistence | NameNode writes FsImage and EditLog to both a local disk and a remote NFS mount. Updates are synchronous and atomic. | Slightly reduces metadata transaction throughput, but HDFS workloads are data-intensive, not metadata-intensive |
| Secondary NameNode | A separate process that periodically merges FsImage with EditLog to prevent the EditLog from growing unboundedly | Does not provide failover -- its state always lags behind the primary. On primary failure, data loss is likely |
The Secondary NameNode is not a backup NameNode despite its misleading name. It only helps with checkpointing. If the primary NameNode dies, the Secondary NameNode's state is stale and data loss is nearly inevitable. For true high availability, you need HDFS HA.
A common interview mistake is confusing the Secondary NameNode with a standby/backup. The Secondary NameNode performs only one job: periodic checkpointing (merging FsImage + EditLog). It does not serve client requests and cannot seamlessly take over on failure. If asked about NameNode HA, skip straight to the Active/Standby architecture with ZooKeeper-based failover.