Skip to main content

Gossiper

In a decentralized system with no master, how does each node learn about every other node's state? Cassandra's answer: the Gossiper -- its implementation of the gossip protocol.

Think first
In a 1000-node cluster with no central coordinator, how would you ensure every node eventually learns about a new node joining? What is the time complexity of information propagation?

How Cassandra gossips

Every second, each node:

  1. Picks 1 to 3 random nodes
  2. Exchanges state information -- what nodes it knows about, their status, their load, their schema version
  3. Merges received information with its own, keeping the most recent version (each gossip message includes a version number)

This means any cluster-wide change (node joining, node failing, schema update) propagates to all nodes in O(log N) rounds.

Generation number: Each node stores a generation number that increments on every restart. This number is included in gossip messages so other nodes can distinguish a node's current state from its pre-restart state. If a gossip message arrives with a higher generation number, the receiver knows the sender restarted and updates its view accordingly.

Seed nodes: Like Dynamo, Cassandra uses designated seed nodes to prevent logical partitions. Seeds have no special operational role -- they're just well-known bootstrap points that new nodes can use to discover the rest of the cluster.

Think first
Why can't Cassandra use a simple fixed-timeout heartbeat (e.g., 'if no heartbeat in 5 seconds, declare the node dead') for failure detection in a multi-data-center cluster?

Node failure detection: Phi Accrual

Cassandra doesn't use simple heartbeat timeouts to declare nodes dead. Instead, it uses the Phi Accrual Failure Detector -- an adaptive algorithm that:

  1. Tracks historical heartbeat arrival times for each node
  2. Outputs a suspicion level (φ) rather than a binary alive/dead signal
  3. Adapts automatically to network conditions -- a node on a slow network gets a more lenient threshold

When φ exceeds the configured threshold (default: 8), the node is declared down. This dramatically reduces false positives compared to fixed timeouts, especially in clusters spanning multiple data centers with varying network latencies.

Interview angle

When discussing failure detection, mention that Cassandra uses Phi Accrual instead of fixed-timeout heartbeats. The key insight: binary alive/dead decisions with fixed timeouts force you to choose between fast detection (many false positives) and accuracy (slow detection). Phi Accrual gives you both by adapting to observed network behavior.

Quiz
A Cassandra cluster spans two data centers connected by a WAN link. The WAN link becomes congested, doubling cross-DC latency for 30 seconds before recovering. With the Phi Accrual Failure Detector, what happens to cross-DC node status?