Skip to main content

Master Election and Chubby Events

When a Chubby master dies, the cell must elect a new one and reconstruct state -- without losing client sessions or violating lock guarantees. The sequence of steps a new master follows is one of the most operationally critical paths in Chubby.

Think first
A new Chubby master has just been elected after the old master died. The new master has no in-memory state (sessions, locks, handles). How should it reconstruct state, and in what order should it allow operations to resume?

New master initialization sequence

A newly elected master proceeds through these steps in order:

StepActionPurpose
1Pick a new epoch numberDistinguishes this master from the previous one. Clients must present the epoch on every call; the master rejects calls with stale epochs. This prevents the new master from processing old packets meant for the previous master. Same concept as Split-brain epoch fencing.
2Respond to master-location requests onlyThe master announces itself but does not yet handle session operations.
3Rebuild in-memory stateReconstruct session and lock data structures from the database. Extend session leases to the maximum the previous master may have used.
4Allow KeepAlives onlyClients can maintain their sessions but cannot perform other operations yet.
5Emit failover event to every sessionClients flush their caches (they may have missed invalidations) and receive warnings that other events may have been lost.
6Wait for acknowledgmentsThe master waits until every session acknowledges the failover event or lets its session expire.
7Allow all operationsNormal service resumes.
8Honor pre-failover handlesIf a client presents a handle created before the failover, the master reconstructs the in-memory handle representation and processes the request.
9Delete stale ephemeral filesAfter ~1 minute, ephemeral files with no open handles are cleaned up. Clients must refresh ephemeral file handles within this window.
Interview angle

The epoch number is the same concept as a fencing token for the master itself. When explaining Chubby's failover, emphasize: "Each new master gets a strictly increasing epoch. Any request carrying an old epoch is rejected. This is how Chubby prevents split-brain -- even if the old master is still alive, its requests are ignored."

Chubby events

Chubby provides a simple event mechanism. Clients subscribe to events when creating a handle, and events are delivered asynchronously via callbacks in the Chubby library.

File and lock events

EventTriggered when...
File contents modifiedA file's data changes
Child node added/removed/modifiedA directory's children change
Master failoverA new master takes over
Handle invalidatedA handle (and its associated lock) becomes invalid
Lock acquiredA lock transitions from free to held
Conflicting lock requestAnother client requests a lock held by this client

Session events (sent to the application)

EventMeaning
JeopardySession lease timed out; grace period has begun. The client's cached data is no longer trustworthy.
SafeSession survived a communication problem. Cache is re-enabled.
ExpiredSession timed out. All handles, locks, and cached data are invalidated.
warning

Events can be lost during failover. The failover event (step 5 above) warns clients of this explicitly. Applications must treat the failover event as a signal to re-validate all assumptions -- re-read files, re-check lock ownership, and re-register for events. Treating the failover event as purely informational is a common mistake.

Quiz
What would happen if the new Chubby master skipped step 6 (waiting for all sessions to acknowledge the failover event) and immediately allowed all operations?