Design a Service for Leader Election and Distributed Locking
The problem
You run a distributed system with several components that need coordination:
Problem 1 -- Leader election: You have 5 replicas of your payment processing service. Exactly one must be the "leader" that processes payments at any time. If the leader crashes, another replica must take over within seconds. Two replicas must never both believe they are the leader simultaneously -- double-processing a payment is worse than a brief outage.
Problem 2 -- Distributed locking: Your order fulfillment pipeline has a step that reserves inventory. When two orders for the last item arrive simultaneously at different servers, only one should succeed. You need a distributed lock that prevents concurrent access to the same inventory item.
Problem 3 -- Configuration management: You have 200 microservices that need to read shared configuration (feature flags, rate limits, database connection strings). When you update a config value, all services must see the change within seconds. You want a strongly consistent store -- not "eventually this will propagate."
These are all coordination problems. Building coordination into each service is error-prone and leads to subtle bugs (split-brain, deadlocks, phantom leaders). You need a dedicated coordination service that solves these primitives correctly once so that application developers do not have to.
Key requirements to identify
| Requirement | What it implies |
|---|---|
| Exactly one leader at a time | Need strong consistency -- this is a CP system, not AP |
| Leader failover in seconds | Need session tracking with timeouts and heartbeats |
| No split-brain | Need fencing -- a mechanism to prevent the old leader from acting after failover |
| Distributed locks | Need sequencer tokens to detect stale locks after network delays |
| Config changes visible quickly | Need an event/watch mechanism so clients are notified of changes |
| Small metadata store | The data is small (KBs to MBs), so replication overhead is acceptable |
The design approach
This is fundamentally different from the systems in the previous problems. Dynamo and Cassandra chose availability over consistency. Here, you need the opposite: consistency over availability. A lock service that sometimes grants the same lock to two clients is worse than useless.
The core of the design is a replicated state machine. You run 5 servers (an odd number for majority quorums). They use a consensus protocol like Paxos or Raft to agree on every state change. A write succeeds only when a majority (3 of 5) of servers have durably logged it. This guarantees that even if 2 servers crash, the remaining 3 have the latest state.
One server is the leader (or master) of the coordination service itself. All reads and writes go through the leader. This simplifies the protocol -- the leader proposes values, and followers accept or reject. If the leader crashes, the remaining servers elect a new leader using the consensus protocol.
For leader election of your application services, the coordination service offers a primitive: acquire a lock on a named path. The payment service replica that successfully acquires the lock at /services/payment/leader is the leader. The lock is tied to a session with a TTL. If the leader crashes (stops sending keepalives), the session expires, the lock is released, and another replica acquires it.
But sessions and locks alone are not enough to prevent split-brain. Consider: the leader acquires the lock, then enters a long garbage collection pause. The coordination service thinks the leader is dead (missed keepalives), releases the lock, and another replica becomes leader. The old leader wakes up and still believes it holds the lock. Now you have two leaders.
The solution is fencing tokens (or sequencers). Each lock acquisition returns a monotonically increasing token number. When the leader makes a request to a downstream service (e.g., the database), it includes the token. The downstream service rejects requests with tokens older than the highest it has seen. This is a crucial safety mechanism.
How the industry solved it
Two systems define this space: Google's Chubby and Apache ZooKeeper. They take slightly different approaches to the same fundamental problem.
Google Chubby
Chubby is Google's distributed lock service. Internally, it powers leader election for GFS, BigTable, and many other Google systems. It exposes a file-system-like API where "files" can be used as locks.
Start here: Chubby Introduction
Key design decisions:
- Coarse-grained locks: Chubby is designed for locks held for hours or days (e.g., "who is the GFS master"), not fine-grained per-request locks. See Design Rationale.
- Sessions and KeepAlives: Each client maintains a session with periodic KeepAlive RPCs. If KeepAlives stop, the session enters a grace period before expiring. See Sessions and Events.
- Sequencers for fencing: When acquiring a lock, the client receives a sequencer that it passes to downstream services. See Locks, Sequencers, and Lock Delays.
- Event notifications: Clients can subscribe to events on files (lock acquired, content changed, master failover). See Master Election and Events.
- Client-side caching: Chubby aggressively caches data on the client and uses invalidations to maintain consistency. See Caching.
Deep dive: How Chubby Works | Files, Directories, and Handles
Apache ZooKeeper
ZooKeeper is the open-source coordination service used by Kafka, HBase, HDFS (for HA failover), and many other distributed systems. It provides a hierarchical namespace (like a file system) where nodes can store small amounts of data and watch for changes.
ZooKeeper is referenced throughout this course:
- Kafka uses it for broker coordination and controller election
- HDFS uses it for NameNode failover
- BigTable relies on Chubby, which ZooKeeper is modeled after: GFS and Chubby dependencies
The consensus protocol underneath
Both Chubby and ZooKeeper are powered by consensus protocols. Chubby uses Paxos; ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), which is similar to Raft. Understanding that these services are replicated state machines running consensus is key to understanding their consistency guarantees and performance characteristics.
Key patterns used
| Pattern | Why it is needed | Reference |
|---|---|---|
| Leader and Follower | All writes go through a single leader for ordering; followers replicate | Pattern |
| Lease | Time-limited authority that automatically expires if not renewed | Pattern |
| Fencing | Prevent a stale leader from performing actions after failover | Pattern |
| Split Brain | Detect and prevent multiple leaders operating simultaneously | Pattern |
| Write-Ahead Log | Durably log every state change before applying it | Pattern |
| Heartbeat | Detect client/session failures via KeepAlive messages | Pattern |
When NOT to use a coordination service
Coordination services are CP systems -- they sacrifice availability during network partitions. If a majority of the coordination servers are unreachable, the service stops accepting writes. This is the correct tradeoff for locks and leader election (a lock that works during partitions is broken by definition), but it means you should NOT use a coordination service as a general-purpose database or configuration store for data that can tolerate stale reads.