Skip to main content

How Chubby Works

What happens when a Chubby cell boots up, a client connects for the first time, or an application needs to elect a leader? This section walks through the operational mechanics.

Think first
Multiple application instances need to elect exactly one leader. How would you implement this using a lock service, and what happens if the elected leader crashes?

Service initialization

When a Chubby cell starts:

  1. Replicas run Paxos to elect a master.
  2. The master's identity is persisted in storage and propagated to all replicas.

From this point, the master handles every client request. Replicas participate in consensus but do not serve clients directly.

Client initialization

When a Chubby client starts:

  1. The client queries DNS for the list of Chubby replicas in the cell.
  2. The client sends an RPC to any replica.
  3. If that replica is not the master, it returns the master's address.
  4. The client establishes a session with the master and sends all subsequent requests to it -- until the master indicates it is no longer the leader or stops responding.
Interview angle

This client discovery pattern -- contact any node, get redirected to the leader -- appears in many distributed systems (etcd, ZooKeeper, Kafka controller). It avoids a single point of failure for discovery while keeping writes centralized for consistency.

Leader election example

Consider an application with multiple instances that need exactly one leader. Here is how they use Chubby:

  1. All candidates attempt to acquire a Chubby lock on a designated election file.
  2. The first to acquire the lock becomes the leader.
  3. The leader writes its identity into the file so other instances can discover who won.
  4. If the leader crashes and its session lease expires, it loses the lock. Another candidate acquires it and becomes the new leader.

Pseudocode for leader election

This shows how little code is needed to add Chubby-based leader election to an existing application:

/* Create these files in Chubby manually once.
Usually there are at least 3-5 required for minimum quorum requirement. */
lock_file_paths = \{
"ls/cell1/foo",
"ls/cell2/foo",
"ls/cell3/foo",
\}

Main() \{
// Initialize Chubby client library.
chubbyLeader = newChubbyLeader(lock_file_paths)

// Establish client's connection with Chubby service.
chubbyLeader.Start()

// Waiting to become the leader.
chubbyLeader.Wait()

// Becomes Leader
Log("Is Leader: " + chubbyLeader.isLeader ())

While(chubbyLeader.renewLeader ()) \{
// Do work
\}
// Not leader anymore.
\}
warning

Leader election with locks requires careful handling of the "zombie leader" problem. If an old leader's session expires but it hasn't realized it yet, it may still issue commands. Chubby addresses this with sequencers and lock-delay (covered in Locks, Sequencers, and Lock-delays) -- the same concept as Fencing tokens.

Quiz
In Chubby's leader election, what would happen if a client's network connection drops for 30 seconds but the client process continues running and issuing commands to worker nodes?