HDFS High Availability HA
A cold-start recovery of a NameNode on a large cluster can take 30 minutes or more -- loading the FsImage, replaying the EditLog, and receiving block reports from every DataNode. During that window, the entire cluster is offline. Planned maintenance creates the same problem. HDFS High Availability (HA) eliminates this downtime by keeping a hot standby ready to take over in seconds.
HDFS high availability architecture
HDFS HA runs two (or more) NameNodes in an Active-Standby configuration:
| Role | Responsibility |
|---|---|
| Active NameNode | Handles all client operations (reads, writes, metadata) |
| Standby NameNode | Maintains synchronized state; ready to take over on failure |
For the Standby to stay in sync, HDFS made several architectural changes:
- Shared EditLog: Both NameNodes access a highly available shared EditLog (via NFS or a Quorum Journal Manager). The Standby continuously reads and replays new entries as the Active writes them.
- Block reports to all NameNodes: DataNodes send block reports to every NameNode, because block mappings live in memory and are not persisted to the shared EditLog.
- Transparent client failover: The HDFS URI maps to a logical hostname that resolves to multiple NameNode addresses. The client library tries each address until one responds -- failover is invisible to applications.
QJM (Quorum Journal Manager)
The QJM is the preferred shared-storage mechanism for the EditLog. It runs as a group of journal nodes (typically three), and every edit must be written to a quorum (majority) before it is considered committed.
| Property | Detail |
|---|---|
| Fault tolerance | Tolerates loss of (N-1)/2 journal nodes (1 out of 3 in a typical setup) |
| Consistency | Quorum writes ensure no edit is lost if a minority of journal nodes fail |
| Relationship to ZooKeeper | Similar quorum-based approach, but the QJM does not use ZooKeeper internally |
Note: HDFS HA does use ZooKeeper for Active NameNode election (see below). The QJM handles only EditLog replication.
Because the Standby NameNode has the latest EditLog entries and an up-to-date block mapping in memory, failover takes seconds. In practice, the total failover time is closer to a minute because the system must be conservative in confirming the Active has actually failed.
If all Standbys happen to be down when the Active fails, an administrator can still cold-start a Standby -- no worse than the non-HA case.
ZooKeeper and the Failover Controller
Each NameNode runs a ZKFailoverController (ZKFC) -- a lightweight ZooKeeper client that:
- Monitors the local NameNode's health via heartbeats
- Maintains a ZooKeeper session and ephemeral znode representing its NameNode
- Triggers failover when the Active NameNode becomes unresponsive
HDFS HA is a textbook example of the Leader and Follower pattern with ZooKeeper-based leader election. Walk through it in order: Active NameNode holds a ZooKeeper ephemeral znode, Standby watches it, znode disappears on failure, Standby claims leadership. Then immediately bring up fencing -- because leader election alone does not prevent split-brain.
Failover and fencing
Graceful failover
For planned maintenance, an administrator manually initiates a failover. The ZKFC arranges an orderly transition: the Active NameNode finishes in-flight operations, relinquishes its role, and the Standby takes over.
Ungraceful failover
When the Active NameNode crashes or a network partition occurs, the ZKFC on the Standby detects the failure and initiates automatic failover. The danger here is split-brain: the old Active might still be running (e.g., behind a slow network) and believe it is still the leader.
Fencing
Fencing prevents the old Active from corrupting shared state. HDFS uses two techniques:
| Technique | How it works |
|---|---|
| Resource fencing | Revoke the old Active's access to the shared storage directory (e.g., vendor-specific NFS commands) or disable its network port via remote management |
| Node fencing (STONITH) | Power off or hard-reset the old Active node entirely -- "Shoot The Other Node In The Head" |
Without fencing, two NameNodes can simultaneously believe they are Active and issue conflicting metadata updates, corrupting the file system. Fencing is not optional in a production HA deployment -- it is a hard requirement.
Fencing is a must-know concept for any system with leader election. The sequence is: (1) detect failure, (2) fence the old leader so it cannot cause damage, (3) promote the new leader. If you skip step 2, you risk split-brain. STONITH is the nuclear option -- it guarantees the old leader is dead by physically powering it off.
To learn more about automatic failover, see the Apache documentation.