Scaling Chubby

At Google, a single Chubby master has served 90,000+ clients simultaneously. Since Chubby clients are individual processes (not machines), this number grows fast. How does a 5-node cell handle that load without collapsing?

Think first

A single Chubby master serves 90,000+ clients. Since all reads and writes go through the master for consistency, how would you scale the system without sacrificing strong consistency guarantees?

Scaling techniques

Technique	How it reduces load
More cells	Deploy additional Chubby cells so clients use a nearby cell (found via DNS). Reduces reliance on remote machines.
Longer leases	Increasing client lease time from 12s to 60s cuts KeepAlive traffic (the dominant request type) by ~5x.
Aggressive caching	Clients cache file data, metadata, handles, and even the absence of files. Most reads never reach the master.
Proxies	Intermediate servers that handle KeepAlives and reads on behalf of clients.
Partitioning	Splitting the namespace across multiple Chubby cells.

Proxies

A proxy server sits between clients and the Chubby master, handling KeepAlives and read requests on the master's behalf.

Operation	With proxy	Without proxy
KeepAlives	Proxy consolidates N client KeepAlives into fewer master KeepAlives (N:1 reduction)	Each client sends its own KeepAlive
Reads (cached)	Served from proxy cache	Served from client cache
Reads (first-time)	Extra hop: client -> proxy -> master	Direct: client -> master
Writes	Extra hop: client -> proxy -> master	Direct: client -> master

The extra RPC hop for writes and first-time reads is acceptable because Chubby is a read-heavy service.

Interview angle

Proxies are a general scaling pattern for leader-based systems. When you mention "all reads/writes go through a single master" in an interview, the natural follow-up is "how does that scale?" Proxies that absorb KeepAlive and read traffic are the standard answer. The same pattern applies to ZooKeeper observers and Kafka follower fetching.

Partitioning

Chubby's file/directory interface was designed to support namespace partitioning across cells:

/ls/cell/foo and everything below it can be served by one Chubby cell
/ls/cell/bar and everything below it can be served by another

This reduces per-cell read/write traffic proportionally.

Partitioning limitations:

Limitation	Explanation
Cross-partition directory deletes	Deleting a directory may require coordination across partition boundaries
KeepAlive traffic	Partitioning does not reduce KeepAlive traffic (each client still maintains a session per cell)
ACL lookups	ACLs stored in one partition may need cross-partition calls from another

Lessons from production

Google's operational experience with Chubby revealed several anti-patterns that required intervention:

Lack of aggressive caching

Early clients failed to cache file absence or open handles. This led to pathological behavior:

Retry loops that poll for non-existent files indefinitely
Repeated open/close cycles on files that should stay open

Chubby later educated users to leverage caching aggressively for these cases.

No storage quotas

Chubby was never intended for large data. Without quotas, some teams treated it as general-purpose storage. Chubby eventually introduced a 256 KB file size limit.

warning

This is a common anti-pattern in any shared infrastructure service. If you design a coordination service, enforce size limits from day one. Retroactive limits are painful to impose on established users.

Misuse as publish/subscribe

Teams attempted to use Chubby's event mechanism as a pub/sub system. Because Chubby maintains a strongly consistent cache, every event triggers invalidation across all caching clients -- making it a slow, inefficient message bus. For pub/sub workloads, use Kafka or a dedicated message queue.

Developers assume 100% availability

Teams rarely planned for Chubby outages, assuming it would always be up. When Chubby experienced even brief downtime, dependent services failed catastrophically. Chubby's team eventually intentionally scheduled brief outages to force dependent teams to build resilience -- testing in production by design.

Interview angle

The "planned outage" strategy is worth mentioning in system design interviews. Netflix's Chaos Monkey follows the same philosophy: deliberately inject failures so teams build systems that handle them gracefully. If your design depends on a single coordination service, explain how your application degrades gracefully when that service is briefly unavailable.

Quiz

Google's Chubby team intentionally scheduled brief outages to force dependent teams to build resilience. What would likely happen if they stopped doing this?

System reliability would improve because there would be fewer outages.

Teams would gradually stop handling Chubby failures in their code, and the next unplanned Chubby outage would cascade into widespread failures across dependent services -- because untested failure paths tend to accumulate bugs and rot over time.

Dependent teams would still handle failures correctly because they follow best practices.

Only new services would be affected since existing services already handle failures.

Scaling techniques​

Proxies​

Partitioning​

Lessons from production​

Lack of aggressive caching​

No storage quotas​

Misuse as publish/subscribe​

Developers assume 100% availability​