Skip to main content

Scaling Chubby

At Google, a single Chubby master has served 90,000+ clients simultaneously. Since Chubby clients are individual processes (not machines), this number grows fast. How does a 5-node cell handle that load without collapsing?

Think first
A single Chubby master serves 90,000+ clients. Since all reads and writes go through the master for consistency, how would you scale the system without sacrificing strong consistency guarantees?

Scaling techniques

TechniqueHow it reduces load
More cellsDeploy additional Chubby cells so clients use a nearby cell (found via DNS). Reduces reliance on remote machines.
Longer leasesIncreasing client lease time from 12s to 60s cuts KeepAlive traffic (the dominant request type) by ~5x.
Aggressive cachingClients cache file data, metadata, handles, and even the absence of files. Most reads never reach the master.
ProxiesIntermediate servers that handle KeepAlives and reads on behalf of clients.
PartitioningSplitting the namespace across multiple Chubby cells.

Proxies

A proxy server sits between clients and the Chubby master, handling KeepAlives and read requests on the master's behalf.

OperationWith proxyWithout proxy
KeepAlivesProxy consolidates N client KeepAlives into fewer master KeepAlives (N:1 reduction)Each client sends its own KeepAlive
Reads (cached)Served from proxy cacheServed from client cache
Reads (first-time)Extra hop: client -> proxy -> masterDirect: client -> master
WritesExtra hop: client -> proxy -> masterDirect: client -> master

The extra RPC hop for writes and first-time reads is acceptable because Chubby is a read-heavy service.

Interview angle

Proxies are a general scaling pattern for leader-based systems. When you mention "all reads/writes go through a single master" in an interview, the natural follow-up is "how does that scale?" Proxies that absorb KeepAlive and read traffic are the standard answer. The same pattern applies to ZooKeeper observers and Kafka follower fetching.

Partitioning

Chubby's file/directory interface was designed to support namespace partitioning across cells:

  • /ls/cell/foo and everything below it can be served by one Chubby cell
  • /ls/cell/bar and everything below it can be served by another

This reduces per-cell read/write traffic proportionally.

Partitioning limitations:

LimitationExplanation
Cross-partition directory deletesDeleting a directory may require coordination across partition boundaries
KeepAlive trafficPartitioning does not reduce KeepAlive traffic (each client still maintains a session per cell)
ACL lookupsACLs stored in one partition may need cross-partition calls from another

Lessons from production

Google's operational experience with Chubby revealed several anti-patterns that required intervention:

Lack of aggressive caching

Early clients failed to cache file absence or open handles. This led to pathological behavior:

  • Retry loops that poll for non-existent files indefinitely
  • Repeated open/close cycles on files that should stay open

Chubby later educated users to leverage caching aggressively for these cases.

No storage quotas

Chubby was never intended for large data. Without quotas, some teams treated it as general-purpose storage. Chubby eventually introduced a 256 KB file size limit.

warning

This is a common anti-pattern in any shared infrastructure service. If you design a coordination service, enforce size limits from day one. Retroactive limits are painful to impose on established users.

Misuse as publish/subscribe

Teams attempted to use Chubby's event mechanism as a pub/sub system. Because Chubby maintains a strongly consistent cache, every event triggers invalidation across all caching clients -- making it a slow, inefficient message bus. For pub/sub workloads, use Kafka or a dedicated message queue.

Developers assume 100% availability

Teams rarely planned for Chubby outages, assuming it would always be up. When Chubby experienced even brief downtime, dependent services failed catastrophically. Chubby's team eventually intentionally scheduled brief outages to force dependent teams to build resilience -- testing in production by design.

Interview angle

The "planned outage" strategy is worth mentioning in system design interviews. Netflix's Chaos Monkey follows the same philosophy: deliberately inject failures so teams build systems that handle them gracefully. If your design depends on a single coordination service, explain how your application degrades gracefully when that service is briefly unavailable.

Quiz
Google's Chubby team intentionally scheduled brief outages to force dependent teams to build resilience. What would likely happen if they stopped doing this?