Skip to main content

Fault Tolerance

In a cluster of thousands of commodity machines, hardware failure is not an exception -- it is a daily event. HDFS must tolerate DataNode crashes without data loss and NameNode crashes without losing the file-to-block mapping. These are two fundamentally different problems with different solutions.

Think first
HDFS must handle two fundamentally different types of failures: DataNode failures and NameNode failures. Which failure is more dangerous to the cluster, and why? Think about what is lost in each case and how recoverable it is.

How does HDFS handle DataNode failures?

Replication

Every block is replicated to multiple DataNodes (default: 3). When a DataNode dies, its blocks remain available on the surviving replicas. No data is lost, and reads continue without interruption.

Heartbeat

The NameNode detects DataNode failures through a heartbeat mechanism:

EventWhat happens
DataNode aliveSends a periodic heartbeat (every few seconds) to the NameNode
Heartbeats stopNameNode marks the DataNode as dead and stops routing requests to it
Under-replication detectedNameNode identifies blocks that now have fewer replicas than the configured minimum
Cluster rebalanceNameNode triggers re-replication of under-replicated blocks to other healthy DataNodes

This is the same pattern GFS uses: the master monitors chunkservers via heartbeats and re-replicates chunks when a server is declared dead.

Interview angle

When discussing fault tolerance, distinguish between detection (heartbeats) and recovery (re-replication). The NameNode does not immediately re-replicate when a heartbeat is missed -- it waits for a configurable timeout to avoid unnecessary replication during transient network issues. This is a common follow-up question.

Think first
The NameNode's FsImage + EditLog is analogous to GFS's checkpoint + operation log. The Secondary NameNode has a misleading name -- does it act as a hot standby that takes over on failure? What would actually be needed for true high availability?

What happens when the NameNode fails?

The NameNode is a single point of failure (SPOF). If it goes down, no client can read, write, or list files -- the entire cluster is effectively offline. HDFS addresses this with metadata persistence and backup mechanisms.

FsImage and EditLog

The NameNode maintains two on-disk structures:

StructurePurpose
FsImageA checkpoint (snapshot) of file system metadata at a specific point in time
EditLogA write-ahead log recording every metadata transaction since the last FsImage was created

At periodic intervals, FsImage and EditLog are merged to produce a new checkpoint, and the EditLog is cleared. On recovery, the NameNode loads the FsImage and replays the EditLog to reconstruct the latest state.

Metadata backup

Disk failure on the NameNode is catastrophic -- without metadata, there is no way to reconstruct files from raw blocks scattered across DataNodes. HDFS provides two protection mechanisms:

MechanismHow it worksTrade-off
Multi-copy persistenceNameNode writes FsImage and EditLog to both a local disk and a remote NFS mount. Updates are synchronous and atomic.Slightly reduces metadata transaction throughput, but HDFS workloads are data-intensive, not metadata-intensive
Secondary NameNodeA separate process that periodically merges FsImage with EditLog to prevent the EditLog from growing unboundedlyDoes not provide failover -- its state always lags behind the primary. On primary failure, data loss is likely
warning

The Secondary NameNode is not a backup NameNode despite its misleading name. It only helps with checkpointing. If the primary NameNode dies, the Secondary NameNode's state is stale and data loss is nearly inevitable. For true high availability, you need HDFS HA.

Interview angle

A common interview mistake is confusing the Secondary NameNode with a standby/backup. The Secondary NameNode performs only one job: periodic checkpointing (merging FsImage + EditLog). It does not serve client requests and cannot seamlessly take over on failure. If asked about NameNode HA, skip straight to the Active/Standby architecture with ZooKeeper-based failover.

Quiz
What would happen if the NameNode's disk fails and there is no remote backup of FsImage and EditLog?