Sessions and Events
What happens when the network between a client and the Chubby master goes silent for 30 seconds? Does the client lose all its locks? Does it keep operating on stale data? Chubby's session and lease protocol defines exactly when a client's state is valid -- and what happens when it isn't.
What is a Chubby session?
A session is the relationship between a Chubby cell and a single client. All of a client's handles, locks, and cached data are valid only as long as its session is valid.
Sessions are maintained through periodic handshakes called KeepAlives.
Session lifecycle
| Event | Behavior |
|---|---|
| Creation | Client requests a new session on first contact with the master |
| Active | Client and master exchange KeepAlive RPCs; session lease is periodically extended |
| Idle timeout | Session expires if no open handles and no calls for one minute |
| Explicit end | Client terminates the session |
Each session has an associated lease -- a time interval during which the master guarantees it will not unilaterally terminate the session. The master extends the lease in three cases:
- On session creation
- When a master failover occurs
- When the master responds to a KeepAlive RPC
KeepAlive protocol
KeepAlive is the heartbeat mechanism that keeps sessions alive. The protocol works as follows:
- Client sends a KeepAlive RPC to the master (step "1" in the diagram).
- The master blocks the RPC -- it does not reply until the client's current lease is close to expiring.
- The master replies (step "2"), informing the client of the new lease timeout (lease M2). The default extension is 12 seconds; an overloaded master may use longer intervals to reduce KeepAlive traffic.
- The client immediately sends a new KeepAlive after receiving the reply, ensuring there is almost always a pending KeepAlive blocked at the master.
In the diagram below, thick arrows represent lease intervals, upward arrows are KeepAlive requests, and downward arrows are KeepAlive responses. Note the difference between master-side leases (M1, M2) and client-side local leases (C1, C2).
The KeepAlive pattern -- client sends a request, server holds it until the lease is about to expire, then replies with a new lease -- is an elegant form of long polling. It minimizes round trips while keeping the client informed of its lease status. This is the same Lease pattern used across distributed systems.
Session states
| State | Trigger | Client behavior |
|---|---|---|
| Normal | KeepAlives succeeding | Cache enabled, locks valid |
| Jeopardy | Client's local lease expires without a KeepAlive response | Cache flushed and disabled; client uncertain whether master has terminated the session |
| Grace period | Jeopardy begins | Client waits an extra 45 seconds (default) to re-establish contact before giving up |
| Safe | Successful KeepAlive during grace period | Cache re-enabled; session continues |
| Expired | Grace period ends without successful KeepAlive | Session terminated; all handles, locks, and cached data invalidated |
Piggybacking events: KeepAlive replies carry event notifications and cache invalidations back to the client, avoiding the need for a separate notification channel.
Failover scenario
When a master fails or loses membership:
- The failing master discards its in-memory state (sessions, handles, locks).
- The session lease timer stops -- no leases expire during election. This is effectively a lease extension.
- If master election completes quickly, clients reconnect before their local leases expire.
- If election is slow, clients flush caches (enter jeopardy) and wait through the 45-second grace period.
Failover walkthrough (diagram steps)
- Client holds lease M1 (local lease C1) and has a pending KeepAlive.
- Master starts lease M2 and replies to the KeepAlive.
- Client extends local lease to C2 and sends a new KeepAlive. Master dies before replying. No new leases can be assigned. C2 expires; client flushes cache and enters jeopardy. Grace period starts.
- New master is elected. It uses a conservative estimate M3 for the session lease the old master may have granted. Client sends KeepAlive to new master.
- First KeepAlive is rejected -- wrong epoch number (see next section).
- Client retries with corrected epoch.
- KeepAlive succeeds. Client extends lease to C3 and exits jeopardy (safe).
- Normal KeepAlive protocol resumes.
- Because the grace period covered the gap between C2 expiry and C3 start, the client experienced only a delay. If the grace period had been shorter than that gap, the client would have reported session expiry to the application.
The 45-second grace period is not infinite. Applications must handle session expiry: re-acquire locks, re-read state, and re-register for events. Designing your application to recover from Chubby session loss is critical -- Google learned this the hard way when developers assumed Chubby would always be available (see Scaling Chubby).