9 Heartbeat
A server in your cluster just stopped responding. Is it dead? Is the network down? Is it just slow? And critically -- how long do you wait before you start moving its work to other nodes?
Background
In a distributed system, nodes need to know which other nodes are alive. This sounds trivial, but it's one of the fundamental challenges of distributed computing. You can't simply "check" if a remote node is alive -- all you can do is send it a message and see if it responds. If it doesn't respond, you don't know whether the node is dead, the network is partitioned, or the response is just delayed.
Despite this uncertainty, the system must make a decision. Waiting too long means requests pile up on a dead node. Reacting too quickly means you might replace a node that's just experiencing a brief GC pause.
Definition
Each server periodically sends a heartbeat -- a small "I'm alive" message -- to a monitoring server or to other nodes in the system. If no heartbeat is received within a timeout period, the node is presumed dead.
How it works
Centralized heartbeat (leader-based systems):
- Each worker node sends a heartbeat to the leader/master every few seconds
- The leader tracks the last heartbeat time for each node
- If a node misses heartbeats beyond a configured timeout, the leader marks it as dead
- The leader stops routing requests to the dead node and initiates recovery
Decentralized heartbeat (peer-to-peer systems):
- Each node sends heartbeats to a randomly selected set of peers
- Nodes share heartbeat information through gossip protocol
- If a node hasn't been heard from (directly or via gossip) beyond a timeout, it's suspected dead
The timeout dilemma
| Timeout too short | Timeout too long |
|---|---|
| Healthy but slow nodes get marked as dead | Dead nodes continue receiving requests |
| Unnecessary data migration and recovery work | System performance degrades as work piles up |
| "Flapping" -- nodes rapidly alternating between alive and dead states | Slow response to actual failures |
There's no perfect timeout value. It depends on network latency, expected GC pauses, and how costly false positives are versus delayed detection. Some systems (like Cassandra) use Phi Accrual Failure Detection to dynamically adjust the threshold based on observed network conditions.
Heartbeat is the simplest failure detection mechanism, and it's the right starting point in any interview answer about failure detection. But always mention the timeout trade-off -- it shows you understand that failure detection in distributed systems is fundamentally about probability, not certainty.
Examples
GFS
The GFS master communicates with each ChunkServer through periodic heartbeat messages. These serve a dual purpose: detecting dead ChunkServers and carrying instructions (like creating or deleting chunks) and collecting state.
HDFS
The NameNode tracks DataNodes through heartbeats sent every few seconds. If a DataNode misses enough consecutive heartbeats, the NameNode marks it as dead, stops routing I/O to it, and begins re-replicating the blocks it held to maintain the target replication factor.
Kafka
Kafka brokers send heartbeats to ZooKeeper to maintain their session. If a broker's session expires (no heartbeats received), ZooKeeper notifies the controller broker, which reassigns that broker's partition leadership to other brokers.
Chubby
Chubby uses a session-based heartbeat mechanism. Clients maintain a session with the Chubby master through periodic KeepAlive messages. If the session lapses, the client loses its locks and cached data -- a deliberate design choice that prevents stale locks from persisting.