Sessions and Events

What happens when the network between a client and the Chubby master goes silent for 30 seconds? Does the client lose all its locks? Does it keep operating on stale data? Chubby's session and lease protocol defines exactly when a client's state is valid -- and what happens when it isn't.

Think first

A client holds locks and cached data in Chubby. The network goes silent for 30 seconds. Should the client immediately assume its locks are lost, or should it wait? How would you design the session lifecycle to handle this uncertainty?

What is a Chubby session?

A session is the relationship between a Chubby cell and a single client. All of a client's handles, locks, and cached data are valid only as long as its session is valid.

Sessions are maintained through periodic handshakes called KeepAlives.

Session lifecycle

Event	Behavior
Creation	Client requests a new session on first contact with the master
Active	Client and master exchange KeepAlive RPCs; session lease is periodically extended
Idle timeout	Session expires if no open handles and no calls for one minute
Explicit end	Client terminates the session

Each session has an associated lease -- a time interval during which the master guarantees it will not unilaterally terminate the session. The master extends the lease in three cases:

On session creation
When a master failover occurs
When the master responds to a KeepAlive RPC

KeepAlive protocol

KeepAlive is the heartbeat mechanism that keeps sessions alive. The protocol works as follows:

Client sends a KeepAlive RPC to the master (step "1" in the diagram).
The master blocks the RPC -- it does not reply until the client's current lease is close to expiring.
The master replies (step "2"), informing the client of the new lease timeout (lease M2). The default extension is 12 seconds; an overloaded master may use longer intervals to reduce KeepAlive traffic.
The client immediately sends a new KeepAlive after receiving the reply, ensuring there is almost always a pending KeepAlive blocked at the master.

In the diagram below, thick arrows represent lease intervals, upward arrows are KeepAlive requests, and downward arrows are KeepAlive responses. Note the difference between master-side leases (M1, M2) and client-side local leases (C1, C2).

Interview angle

The KeepAlive pattern -- client sends a request, server holds it until the lease is about to expire, then replies with a new lease -- is an elegant form of long polling. It minimizes round trips while keeping the client informed of its lease status. This is the same Lease pattern used across distributed systems.

Session states

State	Trigger	Client behavior
Normal	KeepAlives succeeding	Cache enabled, locks valid
Jeopardy	Client's local lease expires without a KeepAlive response	Cache flushed and disabled; client uncertain whether master has terminated the session
Grace period	Jeopardy begins	Client waits an extra 45 seconds (default) to re-establish contact before giving up
Safe	Successful KeepAlive during grace period	Cache re-enabled; session continues
Expired	Grace period ends without successful KeepAlive	Session terminated; all handles, locks, and cached data invalidated

Piggybacking events: KeepAlive replies carry event notifications and cache invalidations back to the client, avoiding the need for a separate notification channel.

Failover scenario

When a master fails or loses membership:

The failing master discards its in-memory state (sessions, handles, locks).
The session lease timer stops -- no leases expire during election. This is effectively a lease extension.
If master election completes quickly, clients reconnect before their local leases expire.
If election is slow, clients flush caches (enter jeopardy) and wait through the 45-second grace period.

Failover walkthrough (diagram steps)

Client holds lease M1 (local lease C1) and has a pending KeepAlive.
Master starts lease M2 and replies to the KeepAlive.
Client extends local lease to C2 and sends a new KeepAlive. Master dies before replying. No new leases can be assigned. C2 expires; client flushes cache and enters jeopardy. Grace period starts.
New master is elected. It uses a conservative estimate M3 for the session lease the old master may have granted. Client sends KeepAlive to new master.
First KeepAlive is rejected -- wrong epoch number (see next section).
Client retries with corrected epoch.
KeepAlive succeeds. Client extends lease to C3 and exits jeopardy (safe).
Normal KeepAlive protocol resumes.
Because the grace period covered the gap between C2 expiry and C3 start, the client experienced only a delay. If the grace period had been shorter than that gap, the client would have reported session expiry to the application.

warning

The 45-second grace period is not infinite. Applications must handle session expiry: re-acquire locks, re-read state, and re-register for events. Designing your application to recover from Chubby session loss is critical -- Google learned this the hard way when developers assumed Chubby would always be available (see Scaling Chubby).

Quiz

What would happen if Chubby reduced its grace period from 45 seconds to 0 seconds (no grace period at all)?

Session reliability would improve because stale sessions would be cleaned up faster.

Every master failover or brief network hiccup would cause mass session expiry across all connected clients, forcing them to re-acquire locks, re-read state, and re-register for events -- creating a thundering herd that could overwhelm the new master.

There would be minimal impact because master elections are rare.

Clients would simply reconnect to the new master instantly.

What is a Chubby session?​

Session lifecycle​

KeepAlive protocol​

Session states​

Failover scenario​

Failover walkthrough (diagram steps)​