8 Lease
A client acquires a lock on a resource. Then it crashes. The lock is never released. The resource is now stuck -- unavailable to everyone -- until an administrator intervenes.
This is one of the oldest problems in distributed systems. Locks are essential for coordination, but they're dangerous because they assume the lock holder will always be around to release them.
Background
In distributed systems, clients frequently need exclusive access to a resource -- a file, a configuration entry, a piece of shared state. The natural solution is a distributed lock: acquire the lock, do your work, release it.
The problem: what if the client dies, hangs, or loses network connectivity after acquiring the lock but before releasing it? The lock is now held indefinitely by a process that will never release it. This is called a zombie lock, and it can make critical resources permanently unavailable.
You could have a monitoring system that detects dead lock holders and forcefully revokes their locks. But how do you know the difference between "dead" and "slow"? A client experiencing a GC pause or network delay might look dead but could still be holding and using the resource.
Definition
A lease is a lock with a built-in expiration time. When the time runs out, the lease automatically expires -- regardless of whether the holder is alive or dead. If the holder wants to keep the lease, it must explicitly renew it before expiration.
How it works
- Client requests a lease from the coordination service (e.g., Chubby, ZooKeeper) for a specific resource
- The service grants a lease for a fixed duration (e.g., 12 seconds)
- Client does its work while holding the lease
- Before the lease expires, the client sends a renewal request to extend it
- If the client fails to renew (because it crashed, lost connectivity, etc.), the lease expires automatically and the resource becomes available for other clients
| Property | Lock | Lease |
|---|---|---|
| Duration | Indefinite (until explicitly released) | Time-bounded (auto-expires) |
| Client crash | Resource stuck forever | Resource freed after timeout |
| Complexity | Simple acquire/release | Must handle renewals |
| Safety during partitions | Dangerous (holder may be unreachable) | Safe (lease eventually expires) |
Leases convert a liveness problem (waiting for a dead client to release a lock) into a performance problem (waiting for a timeout). The resource is never permanently stuck -- it's unavailable for at most the lease duration.
Lease duration trade-offs
| Short lease (seconds) | Long lease (minutes/hours) |
|---|---|
| Fast recovery from client failures | Slow recovery from failures |
| More renewal traffic (network overhead) | Less renewal overhead |
| Risk of false expiry during GC pauses or network blips | Stale leases persist longer |
The right duration depends on how expensive a false expiry is versus how long you can tolerate a dead lease holder blocking the resource.
Examples
Chubby
Chubby clients maintain a session lease with the Chubby master. The default lease duration is 12 seconds. The client must send KeepAlive messages (similar to heartbeats) before the lease expires. If the session lease lapses:
- All locks held by that client are released
- All cached data is invalidated
- Other clients can now acquire those locks
This is the canonical example of leases in distributed systems -- Chubby's entire coordination model is built on them.
GFS
GFS uses leases to manage chunk mutations. The master grants a 60-second "mutation lease" to one replica (the primary) for each chunk being written. The primary coordinates all writes to that chunk during the lease period. If the primary fails, the master waits for the lease to expire, then grants it to another replica. This ensures only one primary can write to a chunk at any time.
HDFS
HDFS uses leases for write access to files. A client writing to a file holds a "soft lease" (renewable) and a "hard lease" (absolute maximum duration). If the soft lease expires and isn't renewed, another client can claim the file. The hard lease acts as a final safety net.
Leases come up whenever you need to answer "How do you prevent a failed node from holding a resource forever?" The answer: time-bound all resource grants. This applies to leader election (the leader's authority is a lease), cache coherence (cached data has a TTL), and lock management. Leases are the distributed systems equivalent of a dead man's switch.