Skip to main content

Design a Service for Leader Election and Distributed Locking

The problem

You run a distributed system with several components that need coordination:

Problem 1 -- Leader election: You have 5 replicas of your payment processing service. Exactly one must be the "leader" that processes payments at any time. If the leader crashes, another replica must take over within seconds. Two replicas must never both believe they are the leader simultaneously -- double-processing a payment is worse than a brief outage.

Problem 2 -- Distributed locking: Your order fulfillment pipeline has a step that reserves inventory. When two orders for the last item arrive simultaneously at different servers, only one should succeed. You need a distributed lock that prevents concurrent access to the same inventory item.

Problem 3 -- Configuration management: You have 200 microservices that need to read shared configuration (feature flags, rate limits, database connection strings). When you update a config value, all services must see the change within seconds. You want a strongly consistent store -- not "eventually this will propagate."

These are all coordination problems. Building coordination into each service is error-prone and leads to subtle bugs (split-brain, deadlocks, phantom leaders). You need a dedicated coordination service that solves these primitives correctly once so that application developers do not have to.

Key requirements to identify

RequirementWhat it implies
Exactly one leader at a timeNeed strong consistency -- this is a CP system, not AP
Leader failover in secondsNeed session tracking with timeouts and heartbeats
No split-brainNeed fencing -- a mechanism to prevent the old leader from acting after failover
Distributed locksNeed sequencer tokens to detect stale locks after network delays
Config changes visible quicklyNeed an event/watch mechanism so clients are notified of changes
Small metadata storeThe data is small (KBs to MBs), so replication overhead is acceptable

The design approach

This is fundamentally different from the systems in the previous problems. Dynamo and Cassandra chose availability over consistency. Here, you need the opposite: consistency over availability. A lock service that sometimes grants the same lock to two clients is worse than useless.

The core of the design is a replicated state machine. You run 5 servers (an odd number for majority quorums). They use a consensus protocol like Paxos or Raft to agree on every state change. A write succeeds only when a majority (3 of 5) of servers have durably logged it. This guarantees that even if 2 servers crash, the remaining 3 have the latest state.

One server is the leader (or master) of the coordination service itself. All reads and writes go through the leader. This simplifies the protocol -- the leader proposes values, and followers accept or reject. If the leader crashes, the remaining servers elect a new leader using the consensus protocol.

For leader election of your application services, the coordination service offers a primitive: acquire a lock on a named path. The payment service replica that successfully acquires the lock at /services/payment/leader is the leader. The lock is tied to a session with a TTL. If the leader crashes (stops sending keepalives), the session expires, the lock is released, and another replica acquires it.

But sessions and locks alone are not enough to prevent split-brain. Consider: the leader acquires the lock, then enters a long garbage collection pause. The coordination service thinks the leader is dead (missed keepalives), releases the lock, and another replica becomes leader. The old leader wakes up and still believes it holds the lock. Now you have two leaders.

The solution is fencing tokens (or sequencers). Each lock acquisition returns a monotonically increasing token number. When the leader makes a request to a downstream service (e.g., the database), it includes the token. The downstream service rejects requests with tokens older than the highest it has seen. This is a crucial safety mechanism.

Think first
Before reading the solution, think about why a lease (time-limited lock) is not sufficient to prevent split-brain on its own. If the leader holds a 30-second lease, and its clock runs 5 seconds fast compared to the lock server, what happens?

How the industry solved it

Two systems define this space: Google's Chubby and Apache ZooKeeper. They take slightly different approaches to the same fundamental problem.

Google Chubby

Chubby is Google's distributed lock service. Internally, it powers leader election for GFS, BigTable, and many other Google systems. It exposes a file-system-like API where "files" can be used as locks.

Start here: Chubby Introduction

Key design decisions:

  • Coarse-grained locks: Chubby is designed for locks held for hours or days (e.g., "who is the GFS master"), not fine-grained per-request locks. See Design Rationale.
  • Sessions and KeepAlives: Each client maintains a session with periodic KeepAlive RPCs. If KeepAlives stop, the session enters a grace period before expiring. See Sessions and Events.
  • Sequencers for fencing: When acquiring a lock, the client receives a sequencer that it passes to downstream services. See Locks, Sequencers, and Lock Delays.
  • Event notifications: Clients can subscribe to events on files (lock acquired, content changed, master failover). See Master Election and Events.
  • Client-side caching: Chubby aggressively caches data on the client and uses invalidations to maintain consistency. See Caching.

Deep dive: How Chubby Works | Files, Directories, and Handles

Apache ZooKeeper

ZooKeeper is the open-source coordination service used by Kafka, HBase, HDFS (for HA failover), and many other distributed systems. It provides a hierarchical namespace (like a file system) where nodes can store small amounts of data and watch for changes.

ZooKeeper is referenced throughout this course:

The consensus protocol underneath

Both Chubby and ZooKeeper are powered by consensus protocols. Chubby uses Paxos; ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), which is similar to Raft. Understanding that these services are replicated state machines running consensus is key to understanding their consistency guarantees and performance characteristics.

Key patterns used

PatternWhy it is neededReference
Leader and FollowerAll writes go through a single leader for ordering; followers replicatePattern
LeaseTime-limited authority that automatically expires if not renewedPattern
FencingPrevent a stale leader from performing actions after failoverPattern
Split BrainDetect and prevent multiple leaders operating simultaneouslyPattern
Write-Ahead LogDurably log every state change before applying itPattern
HeartbeatDetect client/session failures via KeepAlive messagesPattern

When NOT to use a coordination service

Coordination services are CP systems -- they sacrifice availability during network partitions. If a majority of the coordination servers are unreachable, the service stops accepting writes. This is the correct tradeoff for locks and leader election (a lock that works during partitions is broken by definition), but it means you should NOT use a coordination service as a general-purpose database or configuration store for data that can tolerate stale reads.

Design Challenge

Variation: Design a distributed rate limiter

You need a distributed rate limiter that enforces 'no more than 100 API requests per user per minute' across 20 API gateway servers. The rate limiter must be accurate (not approximate) and must not become a bottleneck for request processing (sub-millisecond overhead).
Hints (0/4)