Skip to main content

9 Heartbeat

A server in your cluster just stopped responding. Is it dead? Is the network down? Is it just slow? And critically -- how long do you wait before you start moving its work to other nodes?

Think first
A server in your cluster stopped responding. You can't tell if it's dead, partitioned, or just slow. How would you design a failure detection mechanism, and what is the fundamental trade-off in choosing a timeout value?

Background

In a distributed system, nodes need to know which other nodes are alive. This sounds trivial, but it's one of the fundamental challenges of distributed computing. You can't simply "check" if a remote node is alive -- all you can do is send it a message and see if it responds. If it doesn't respond, you don't know whether the node is dead, the network is partitioned, or the response is just delayed.

Despite this uncertainty, the system must make a decision. Waiting too long means requests pile up on a dead node. Reacting too quickly means you might replace a node that's just experiencing a brief GC pause.

Definition

Each server periodically sends a heartbeat -- a small "I'm alive" message -- to a monitoring server or to other nodes in the system. If no heartbeat is received within a timeout period, the node is presumed dead.

How it works

Centralized heartbeat (leader-based systems):

  1. Each worker node sends a heartbeat to the leader/master every few seconds
  2. The leader tracks the last heartbeat time for each node
  3. If a node misses heartbeats beyond a configured timeout, the leader marks it as dead
  4. The leader stops routing requests to the dead node and initiates recovery

Decentralized heartbeat (peer-to-peer systems):

  1. Each node sends heartbeats to a randomly selected set of peers
  2. Nodes share heartbeat information through gossip protocol
  3. If a node hasn't been heard from (directly or via gossip) beyond a timeout, it's suspected dead

The timeout dilemma

Timeout too shortTimeout too long
Healthy but slow nodes get marked as deadDead nodes continue receiving requests
Unnecessary data migration and recovery workSystem performance degrades as work piles up
"Flapping" -- nodes rapidly alternating between alive and dead statesSlow response to actual failures

There's no perfect timeout value. It depends on network latency, expected GC pauses, and how costly false positives are versus delayed detection. Some systems (like Cassandra) use Phi Accrual Failure Detection to dynamically adjust the threshold based on observed network conditions.

Interview angle

Heartbeat is the simplest failure detection mechanism, and it's the right starting point in any interview answer about failure detection. But always mention the timeout trade-off -- it shows you understand that failure detection in distributed systems is fundamentally about probability, not certainty.

Examples

GFS

The GFS master communicates with each ChunkServer through periodic heartbeat messages. These serve a dual purpose: detecting dead ChunkServers and carrying instructions (like creating or deleting chunks) and collecting state.

HDFS

The NameNode tracks DataNodes through heartbeats sent every few seconds. If a DataNode misses enough consecutive heartbeats, the NameNode marks it as dead, stops routing I/O to it, and begins re-replicating the blocks it held to maintain the target replication factor.

Kafka

Kafka brokers send heartbeats to ZooKeeper to maintain their session. If a broker's session expires (no heartbeats received), ZooKeeper notifies the controller broker, which reassigns that broker's partition leadership to other brokers.

Chubby

Chubby uses a session-based heartbeat mechanism. Clients maintain a session with the Chubby master through periodic KeepAlive messages. If the session lapses, the client loses its locks and cached data -- a deliberate design choice that prevents stale locks from persisting.

Quiz
A 1,000-node cluster uses centralized heartbeat (every node sends heartbeats to a single leader). The leader receives 1,000 heartbeats per interval. What problem does this create if the cluster grows to 100,000 nodes?