11 Phi Accrual Failure Detection
A node hasn't sent a heartbeat in 5 seconds. Is it dead? Maybe. But it might also be experiencing a long GC pause, or the network might be congested. With a fixed timeout, you're forced to make a binary decision: alive or dead. Set the timeout too low and you get false positives. Set it too high and you detect real failures slowly.
What if, instead of a binary yes/no, you could get a probability that the node is down?
Background
Traditional failure detection uses a fixed timeout: if no heartbeat arrives within T seconds, the node is declared dead. The problem is choosing T:
| Timeout too short | Timeout too long |
|---|---|
| Fast detection of real failures | Slow detection of real failures |
| Many false positives (healthy but slow nodes marked dead) | Few false positives |
| Unnecessary failover and data migration | Requests pile up on dead nodes |
No single timeout value works well across all network conditions. A value that's appropriate for a healthy LAN will trigger constant false alarms on a congested WAN. Networks fluctuate -- the right timeout changes over time.
Definition
Instead of a binary alive/dead output, the Phi Accrual Failure Detector outputs a suspicion level (φ, phi) -- a continuous value that represents how likely it is that the node has failed. The suspicion level is calculated from the statistical distribution of past heartbeat arrival times.
- φ = 1 → about 10% chance the node is alive (suspicious)
- φ = 2 → about 1% chance (very suspicious)
- φ = 8 → about 0.00001% chance (almost certainly dead)
The application sets a threshold (e.g., φ > 8 means "declare dead"). The detector adapts automatically to network conditions because it's based on observed arrival time statistics, not a fixed constant.
How it works
- Collect heartbeat history: For each monitored node, record the inter-arrival times of heartbeats (e.g., last 1000 samples)
- Model the distribution: Fit the arrival times to a probability distribution (typically a normal distribution or exponential)
- On each check: Calculate the probability that the current silence duration would be observed if the node were still alive
- Compute φ: φ = -log₁₀(probability). The longer the silence relative to historical patterns, the higher φ gets.
Why this adapts: If a node's heartbeats typically arrive every 1 second, a 5-second gap produces a high φ. But if another node's heartbeats typically arrive every 3 seconds (because it's across a slow WAN), a 5-second gap produces a much lower φ. The detector automatically learns what's "normal" for each node.
Phi Accrual vs. fixed timeout
| Fixed timeout | Phi Accrual |
|---|---|
| Binary output (alive/dead) | Continuous suspicion level (0 to ∞) |
| Same threshold for all nodes | Adapts per-node based on history |
| Requires manual tuning | Self-tuning from observed data |
| Hard cutoff → false positives or slow detection | Gradual escalation → application decides threshold |
By outputting a probability instead of a boolean, the Phi Accrual detector lets the application decide its own trade-off between false positives and detection speed. A latency-sensitive application might set a low threshold (react quickly, accept some false alarms). A data-critical application might set a high threshold (wait longer, but be more certain).
Examples
Cassandra
Cassandra uses the Phi Accrual Failure Detector as its primary failure detection mechanism. Each node monitors heartbeat arrival times from other nodes and computes φ. The default threshold is φ > 8, meaning Cassandra won't declare a node dead unless there's overwhelming evidence. This avoids the flapping problem where nodes rapidly alternate between alive and dead states in a congested network.
Akka
The Akka actor framework uses Phi Accrual for failure detection in its cluster module, directly inspired by the same academic paper.
If an interviewer asks "How do you detect node failures?", start with heartbeat as the basic mechanism, then mention Phi Accrual as the sophisticated version. The key phrase: "Instead of a fixed timeout, we use an adaptive failure detector that learns from historical heartbeat patterns and outputs a suspicion level rather than a binary alive/dead signal."