Fault Tolerance

In a cluster of thousands of commodity machines, hardware failure is not an exception -- it is a daily event. HDFS must tolerate DataNode crashes without data loss and NameNode crashes without losing the file-to-block mapping. These are two fundamentally different problems with different solutions.

Think first

HDFS must handle two fundamentally different types of failures: DataNode failures and NameNode failures. Which failure is more dangerous to the cluster, and why? Think about what is lost in each case and how recoverable it is.

How does HDFS handle DataNode failures?

Replication

Every block is replicated to multiple DataNodes (default: 3). When a DataNode dies, its blocks remain available on the surviving replicas. No data is lost, and reads continue without interruption.

Heartbeat

The NameNode detects DataNode failures through a heartbeat mechanism:

Event	What happens
DataNode alive	Sends a periodic heartbeat (every few seconds) to the NameNode
Heartbeats stop	NameNode marks the DataNode as dead and stops routing requests to it
Under-replication detected	NameNode identifies blocks that now have fewer replicas than the configured minimum
Cluster rebalance	NameNode triggers re-replication of under-replicated blocks to other healthy DataNodes

This is the same pattern GFS uses: the master monitors chunkservers via heartbeats and re-replicates chunks when a server is declared dead.

Interview angle

When discussing fault tolerance, distinguish between detection (heartbeats) and recovery (re-replication). The NameNode does not immediately re-replicate when a heartbeat is missed -- it waits for a configurable timeout to avoid unnecessary replication during transient network issues. This is a common follow-up question.

Think first

The NameNode's FsImage + EditLog is analogous to GFS's checkpoint + operation log. The Secondary NameNode has a misleading name -- does it act as a hot standby that takes over on failure? What would actually be needed for true high availability?

What happens when the NameNode fails?

The NameNode is a single point of failure (SPOF). If it goes down, no client can read, write, or list files -- the entire cluster is effectively offline. HDFS addresses this with metadata persistence and backup mechanisms.

FsImage and EditLog

The NameNode maintains two on-disk structures:

Structure	Purpose
FsImage	A checkpoint (snapshot) of file system metadata at a specific point in time
EditLog	A write-ahead log recording every metadata transaction since the last FsImage was created

At periodic intervals, FsImage and EditLog are merged to produce a new checkpoint, and the EditLog is cleared. On recovery, the NameNode loads the FsImage and replays the EditLog to reconstruct the latest state.

Metadata backup

Disk failure on the NameNode is catastrophic -- without metadata, there is no way to reconstruct files from raw blocks scattered across DataNodes. HDFS provides two protection mechanisms:

Mechanism	How it works	Trade-off
Multi-copy persistence	NameNode writes FsImage and EditLog to both a local disk and a remote NFS mount. Updates are synchronous and atomic.	Slightly reduces metadata transaction throughput, but HDFS workloads are data-intensive, not metadata-intensive
Secondary NameNode	A separate process that periodically merges FsImage with EditLog to prevent the EditLog from growing unboundedly	Does not provide failover -- its state always lags behind the primary. On primary failure, data loss is likely

warning

The Secondary NameNode is not a backup NameNode despite its misleading name. It only helps with checkpointing. If the primary NameNode dies, the Secondary NameNode's state is stale and data loss is nearly inevitable. For true high availability, you need HDFS HA.

Interview angle

A common interview mistake is confusing the Secondary NameNode with a standby/backup. The Secondary NameNode performs only one job: periodic checkpointing (merging FsImage + EditLog). It does not serve client requests and cannot seamlessly take over on failure. If asked about NameNode HA, skip straight to the Active/Standby architecture with ZooKeeper-based failover.

Quiz

What would happen if the NameNode's disk fails and there is no remote backup of FsImage and EditLog?

The cluster recovers automatically because DataNodes still hold all the data blocks

The entire file system is effectively lost -- data blocks exist on DataNodes but without the NameNode's metadata, there is no way to know which blocks belong to which files or in what order

DataNodes can reconstruct the metadata from their block reports

The Secondary NameNode takes over seamlessly since it has a recent copy of the metadata

How does HDFS handle DataNode failures?​

Replication​

Heartbeat​

What happens when the NameNode fails?​

FsImage and EditLog​

Metadata backup​