Skip to main content

11 Phi Accrual Failure Detection

A node hasn't sent a heartbeat in 5 seconds. Is it dead? Maybe. But it might also be experiencing a long GC pause, or the network might be congested. With a fixed timeout, you're forced to make a binary decision: alive or dead. Set the timeout too low and you get false positives. Set it too high and you detect real failures slowly.

What if, instead of a binary yes/no, you could get a probability that the node is down?

Think first
Node A's heartbeats typically arrive every 1 second. Node B's heartbeats typically arrive every 3 seconds (because it's across a slow WAN). Using the same fixed 5-second timeout for both, what problem do you create?

Background

Traditional failure detection uses a fixed timeout: if no heartbeat arrives within T seconds, the node is declared dead. The problem is choosing T:

Timeout too shortTimeout too long
Fast detection of real failuresSlow detection of real failures
Many false positives (healthy but slow nodes marked dead)Few false positives
Unnecessary failover and data migrationRequests pile up on dead nodes

No single timeout value works well across all network conditions. A value that's appropriate for a healthy LAN will trigger constant false alarms on a congested WAN. Networks fluctuate -- the right timeout changes over time.

Definition

Instead of a binary alive/dead output, the Phi Accrual Failure Detector outputs a suspicion level (φ, phi) -- a continuous value that represents how likely it is that the node has failed. The suspicion level is calculated from the statistical distribution of past heartbeat arrival times.

  • φ = 1 → about 10% chance the node is alive (suspicious)
  • φ = 2 → about 1% chance (very suspicious)
  • φ = 8 → about 0.00001% chance (almost certainly dead)

The application sets a threshold (e.g., φ > 8 means "declare dead"). The detector adapts automatically to network conditions because it's based on observed arrival time statistics, not a fixed constant.

How it works

  1. Collect heartbeat history: For each monitored node, record the inter-arrival times of heartbeats (e.g., last 1000 samples)
  2. Model the distribution: Fit the arrival times to a probability distribution (typically a normal distribution or exponential)
  3. On each check: Calculate the probability that the current silence duration would be observed if the node were still alive
  4. Compute φ: φ = -log₁₀(probability). The longer the silence relative to historical patterns, the higher φ gets.

Why this adapts: If a node's heartbeats typically arrive every 1 second, a 5-second gap produces a high φ. But if another node's heartbeats typically arrive every 3 seconds (because it's across a slow WAN), a 5-second gap produces a much lower φ. The detector automatically learns what's "normal" for each node.

Phi Accrual vs. fixed timeout

Fixed timeoutPhi Accrual
Binary output (alive/dead)Continuous suspicion level (0 to ∞)
Same threshold for all nodesAdapts per-node based on history
Requires manual tuningSelf-tuning from observed data
Hard cutoff → false positives or slow detectionGradual escalation → application decides threshold
The key insight

By outputting a probability instead of a boolean, the Phi Accrual detector lets the application decide its own trade-off between false positives and detection speed. A latency-sensitive application might set a low threshold (react quickly, accept some false alarms). A data-critical application might set a high threshold (wait longer, but be more certain).

Examples

Cassandra

Cassandra uses the Phi Accrual Failure Detector as its primary failure detection mechanism. Each node monitors heartbeat arrival times from other nodes and computes φ. The default threshold is φ > 8, meaning Cassandra won't declare a node dead unless there's overwhelming evidence. This avoids the flapping problem where nodes rapidly alternate between alive and dead states in a congested network.

Akka

The Akka actor framework uses Phi Accrual for failure detection in its cluster module, directly inspired by the same academic paper.

Interview angle

If an interviewer asks "How do you detect node failures?", start with heartbeat as the basic mechanism, then mention Phi Accrual as the sophisticated version. The key phrase: "Instead of a fixed timeout, we use an adaptive failure detector that learns from historical heartbeat patterns and outputs a suspicion level rather than a binary alive/dead signal."

Quiz
Cassandra uses Phi Accrual with a default threshold of phi > 8. What would happen if you lowered the threshold to phi > 1?