Skip to main content

HDFS High Availability HA

A cold-start recovery of a NameNode on a large cluster can take 30 minutes or more -- loading the FsImage, replaying the EditLog, and receiving block reports from every DataNode. During that window, the entire cluster is offline. Planned maintenance creates the same problem. HDFS High Availability (HA) eliminates this downtime by keeping a hot standby ready to take over in seconds.

Think first
A cold-start NameNode recovery can take 30+ minutes. What would you need to keep a second NameNode 'warm' and ready to take over in seconds? Consider what state must be synchronized continuously.

HDFS high availability architecture

HDFS HA runs two (or more) NameNodes in an Active-Standby configuration:

RoleResponsibility
Active NameNodeHandles all client operations (reads, writes, metadata)
Standby NameNodeMaintains synchronized state; ready to take over on failure

For the Standby to stay in sync, HDFS made several architectural changes:

  • Shared EditLog: Both NameNodes access a highly available shared EditLog (via NFS or a Quorum Journal Manager). The Standby continuously reads and replays new entries as the Active writes them.
  • Block reports to all NameNodes: DataNodes send block reports to every NameNode, because block mappings live in memory and are not persisted to the shared EditLog.
  • Transparent client failover: The HDFS URI maps to a logical hostname that resolves to multiple NameNode addresses. The client library tries each address until one responds -- failover is invisible to applications.

QJM (Quorum Journal Manager)

The QJM is the preferred shared-storage mechanism for the EditLog. It runs as a group of journal nodes (typically three), and every edit must be written to a quorum (majority) before it is considered committed.

PropertyDetail
Fault toleranceTolerates loss of (N-1)/2 journal nodes (1 out of 3 in a typical setup)
ConsistencyQuorum writes ensure no edit is lost if a minority of journal nodes fail
Relationship to ZooKeeperSimilar quorum-based approach, but the QJM does not use ZooKeeper internally

Note: HDFS HA does use ZooKeeper for Active NameNode election (see below). The QJM handles only EditLog replication.

Because the Standby NameNode has the latest EditLog entries and an up-to-date block mapping in memory, failover takes seconds. In practice, the total failover time is closer to a minute because the system must be conservative in confirming the Active has actually failed.

If all Standbys happen to be down when the Active fails, an administrator can still cold-start a Standby -- no worse than the non-HA case.

ZooKeeper and the Failover Controller

Each NameNode runs a ZKFailoverController (ZKFC) -- a lightweight ZooKeeper client that:

  1. Monitors the local NameNode's health via heartbeats
  2. Maintains a ZooKeeper session and ephemeral znode representing its NameNode
  3. Triggers failover when the Active NameNode becomes unresponsive
Interview angle

HDFS HA is a textbook example of the Leader and Follower pattern with ZooKeeper-based leader election. Walk through it in order: Active NameNode holds a ZooKeeper ephemeral znode, Standby watches it, znode disappears on failure, Standby claims leadership. Then immediately bring up fencing -- because leader election alone does not prevent split-brain.

Think first
Leader election alone is not enough for safe failover. If the Active NameNode appears down (missed heartbeats) and the Standby promotes itself, but the old Active is actually still running behind a slow network, two NameNodes think they are the leader. How do you prevent this split-brain?

Failover and fencing

Graceful failover

For planned maintenance, an administrator manually initiates a failover. The ZKFC arranges an orderly transition: the Active NameNode finishes in-flight operations, relinquishes its role, and the Standby takes over.

Ungraceful failover

When the Active NameNode crashes or a network partition occurs, the ZKFC on the Standby detects the failure and initiates automatic failover. The danger here is split-brain: the old Active might still be running (e.g., behind a slow network) and believe it is still the leader.

Fencing

Fencing prevents the old Active from corrupting shared state. HDFS uses two techniques:

TechniqueHow it works
Resource fencingRevoke the old Active's access to the shared storage directory (e.g., vendor-specific NFS commands) or disable its network port via remote management
Node fencing (STONITH)Power off or hard-reset the old Active node entirely -- "Shoot The Other Node In The Head"
warning

Without fencing, two NameNodes can simultaneously believe they are Active and issue conflicting metadata updates, corrupting the file system. Fencing is not optional in a production HA deployment -- it is a hard requirement.

Interview angle

Fencing is a must-know concept for any system with leader election. The sequence is: (1) detect failure, (2) fence the old leader so it cannot cause damage, (3) promote the new leader. If you skip step 2, you risk split-brain. STONITH is the nuclear option -- it guarantees the old leader is dead by physically powering it off.

To learn more about automatic failover, see the Apache documentation.

Quiz
What would happen if HDFS HA performed failover without fencing the old Active NameNode?