9 Heartbeat

A server in your cluster just stopped responding. Is it dead? Is the network down? Is it just slow? And critically -- how long do you wait before you start moving its work to other nodes?

Think first

A server in your cluster stopped responding. You can't tell if it's dead, partitioned, or just slow. How would you design a failure detection mechanism, and what is the fundamental trade-off in choosing a timeout value?

Background

In a distributed system, nodes need to know which other nodes are alive. This sounds trivial, but it's one of the fundamental challenges of distributed computing. You can't simply "check" if a remote node is alive -- all you can do is send it a message and see if it responds. If it doesn't respond, you don't know whether the node is dead, the network is partitioned, or the response is just delayed.

Despite this uncertainty, the system must make a decision. Waiting too long means requests pile up on a dead node. Reacting too quickly means you might replace a node that's just experiencing a brief GC pause.

Definition

Each server periodically sends a heartbeat -- a small "I'm alive" message -- to a monitoring server or to other nodes in the system. If no heartbeat is received within a timeout period, the node is presumed dead.

How it works

Centralized heartbeat (leader-based systems):

Each worker node sends a heartbeat to the leader/master every few seconds
The leader tracks the last heartbeat time for each node
If a node misses heartbeats beyond a configured timeout, the leader marks it as dead
The leader stops routing requests to the dead node and initiates recovery

Decentralized heartbeat (peer-to-peer systems):

Each node sends heartbeats to a randomly selected set of peers
Nodes share heartbeat information through gossip protocol
If a node hasn't been heard from (directly or via gossip) beyond a timeout, it's suspected dead

The timeout dilemma

Timeout too short	Timeout too long
Healthy but slow nodes get marked as dead	Dead nodes continue receiving requests
Unnecessary data migration and recovery work	System performance degrades as work piles up
"Flapping" -- nodes rapidly alternating between alive and dead states	Slow response to actual failures

There's no perfect timeout value. It depends on network latency, expected GC pauses, and how costly false positives are versus delayed detection. Some systems (like Cassandra) use Phi Accrual Failure Detection to dynamically adjust the threshold based on observed network conditions.

Interview angle

Heartbeat is the simplest failure detection mechanism, and it's the right starting point in any interview answer about failure detection. But always mention the timeout trade-off -- it shows you understand that failure detection in distributed systems is fundamentally about probability, not certainty.

Examples

GFS

The GFS master communicates with each ChunkServer through periodic heartbeat messages. These serve a dual purpose: detecting dead ChunkServers and carrying instructions (like creating or deleting chunks) and collecting state.

HDFS

The NameNode tracks DataNodes through heartbeats sent every few seconds. If a DataNode misses enough consecutive heartbeats, the NameNode marks it as dead, stops routing I/O to it, and begins re-replicating the blocks it held to maintain the target replication factor.

Kafka

Kafka brokers send heartbeats to ZooKeeper to maintain their session. If a broker's session expires (no heartbeats received), ZooKeeper notifies the controller broker, which reassigns that broker's partition leadership to other brokers.

Chubby

Chubby uses a session-based heartbeat mechanism. Clients maintain a session with the Chubby master through periodic KeepAlive messages. If the session lapses, the client loses its locks and cached data -- a deliberate design choice that prevents stale locks from persisting.

Quiz

A 1,000-node cluster uses centralized heartbeat (every node sends heartbeats to a single leader). The leader receives 1,000 heartbeats per interval. What problem does this create if the cluster grows to 100,000 nodes?

No problem -- the leader just processes more heartbeats.

The leader becomes a bottleneck, overwhelmed by 100,000 heartbeats per interval, potentially missing real heartbeats and falsely marking healthy nodes as dead -- this is why large decentralized clusters use gossip protocol instead, which requires only O(N) messages total across all nodes.

The heartbeat interval should be increased to reduce load.

Each node should send heartbeats to all other nodes instead.

Background​

Definition​

How it works​

The timeout dilemma​

Examples​

GFS​

HDFS​

Kafka​

Chubby​