Scaling Chubby
At Google, a single Chubby master has served 90,000+ clients simultaneously. Since Chubby clients are individual processes (not machines), this number grows fast. How does a 5-node cell handle that load without collapsing?
Scaling techniques
| Technique | How it reduces load |
|---|---|
| More cells | Deploy additional Chubby cells so clients use a nearby cell (found via DNS). Reduces reliance on remote machines. |
| Longer leases | Increasing client lease time from 12s to 60s cuts KeepAlive traffic (the dominant request type) by ~5x. |
| Aggressive caching | Clients cache file data, metadata, handles, and even the absence of files. Most reads never reach the master. |
| Proxies | Intermediate servers that handle KeepAlives and reads on behalf of clients. |
| Partitioning | Splitting the namespace across multiple Chubby cells. |
Proxies
A proxy server sits between clients and the Chubby master, handling KeepAlives and read requests on the master's behalf.
| Operation | With proxy | Without proxy |
|---|---|---|
| KeepAlives | Proxy consolidates N client KeepAlives into fewer master KeepAlives (N:1 reduction) | Each client sends its own KeepAlive |
| Reads (cached) | Served from proxy cache | Served from client cache |
| Reads (first-time) | Extra hop: client -> proxy -> master | Direct: client -> master |
| Writes | Extra hop: client -> proxy -> master | Direct: client -> master |
The extra RPC hop for writes and first-time reads is acceptable because Chubby is a read-heavy service.
Proxies are a general scaling pattern for leader-based systems. When you mention "all reads/writes go through a single master" in an interview, the natural follow-up is "how does that scale?" Proxies that absorb KeepAlive and read traffic are the standard answer. The same pattern applies to ZooKeeper observers and Kafka follower fetching.
Partitioning
Chubby's file/directory interface was designed to support namespace partitioning across cells:
/ls/cell/fooand everything below it can be served by one Chubby cell/ls/cell/barand everything below it can be served by another
This reduces per-cell read/write traffic proportionally.
Partitioning limitations:
| Limitation | Explanation |
|---|---|
| Cross-partition directory deletes | Deleting a directory may require coordination across partition boundaries |
| KeepAlive traffic | Partitioning does not reduce KeepAlive traffic (each client still maintains a session per cell) |
| ACL lookups | ACLs stored in one partition may need cross-partition calls from another |
Lessons from production
Google's operational experience with Chubby revealed several anti-patterns that required intervention:
Lack of aggressive caching
Early clients failed to cache file absence or open handles. This led to pathological behavior:
- Retry loops that poll for non-existent files indefinitely
- Repeated open/close cycles on files that should stay open
Chubby later educated users to leverage caching aggressively for these cases.
No storage quotas
Chubby was never intended for large data. Without quotas, some teams treated it as general-purpose storage. Chubby eventually introduced a 256 KB file size limit.
This is a common anti-pattern in any shared infrastructure service. If you design a coordination service, enforce size limits from day one. Retroactive limits are painful to impose on established users.
Misuse as publish/subscribe
Teams attempted to use Chubby's event mechanism as a pub/sub system. Because Chubby maintains a strongly consistent cache, every event triggers invalidation across all caching clients -- making it a slow, inefficient message bus. For pub/sub workloads, use Kafka or a dedicated message queue.
Developers assume 100% availability
Teams rarely planned for Chubby outages, assuming it would always be up. When Chubby experienced even brief downtime, dependent services failed catastrophically. Chubby's team eventually intentionally scheduled brief outages to force dependent teams to build resilience -- testing in production by design.
The "planned outage" strategy is worth mentioning in system design interviews. Netflix's Chaos Monkey follows the same philosophy: deliberately inject failures so teams build systems that handle them gracefully. If your design depends on a single coordination service, explain how your application degrades gracefully when that service is briefly unavailable.