Skip to main content

Design a Key-Value Store That Never Says No

The problem

It is Black Friday. Your e-commerce platform processes millions of shopping-cart operations per hour. Customers add items, remove items, and merge carts across devices -- all at the same time. A single datacenter cannot handle the load, so you run servers in three regions. Network links between those regions occasionally drop, and individual machines crash several times a day at this scale.

Your product team has one non-negotiable requirement: the cart must always accept writes. A customer who clicks "Add to Cart" and sees an error message is a lost sale. Returning a slightly stale cart is acceptable -- losing a write is not. The team can tolerate brief inconsistencies as long as they resolve eventually.

Here are the concrete numbers you must design for:

  • 100,000+ operations per second (reads and writes combined)
  • Sub-10 ms latency at the 99th percentile
  • Horizontal scalability -- adding nodes should increase throughput linearly
  • No single point of failure -- the system keeps working when any node goes down
  • Eventual consistency is acceptable; strong consistency is NOT required

Key requirements to identify

RequirementWhat it implies
Always-writableChoose AP in the CAP theorem -- sacrifice consistency for availability
No single point of failureFully decentralized architecture -- no leader node
Horizontal scalabilityNeed a partitioning scheme that distributes data evenly and handles node additions gracefully
Sub-10 ms latencyReads/writes should contact only a small subset of nodes, not all of them
Eventual consistencyNeed a conflict resolution mechanism for concurrent writes
Multi-regionReplication across failure domains with tolerance for temporary node unavailability

The design approach

Start by thinking about data placement. You have N nodes and millions of keys. How do you decide which node owns which key? Naive hashing (key % N) breaks catastrophically when you add or remove a node -- nearly every key remaps. You need a scheme where adding a node moves only a fraction of the keys.

Next, consider replication. If a key lives on node A, you probably want copies on nodes B and C. But what happens when node A goes down during a write? You could reject the write (sacrificing availability) or accept it on a different node and reconcile later (sacrificing consistency). The requirements tell you to accept it.

Now the hard part: two clients update the same cart on two different nodes simultaneously. Neither write is "wrong." You need a way to detect that a conflict happened and merge the results. Timestamps alone are unreliable across machines. You need a data structure that captures causal history -- which updates happened before which.

Finally, consider failure recovery. When a node comes back online after a crash, how does it catch up on the writes it missed? And how do you detect that two replicas have silently diverged over time?

Think first
Before reading the solution, try to sketch your own design. How would you partition keys across nodes? How would you replicate data? What data structure would you use to detect conflicting writes?

How the industry solved it

Amazon built Dynamo to solve exactly this problem. The core insight: for a shopping cart, availability is worth more than consistency. A customer who sees a slightly outdated cart can still check out; a customer who gets an error cannot.

Dynamo's architecture is fully decentralized -- every node is identical, there is no leader, and any node can accept reads or writes for any key. This is the foundation of its availability guarantee.

Read the full Dynamo deep-dive starting from Dynamo Introduction.

Data partitioning with consistent hashing

Dynamo places keys on a hash ring. Each node owns a range of the ring. When a node joins, it takes over a portion of its neighbors' ranges -- only those keys move. Virtual nodes (vnodes) ensure an even distribution even when physical machines have different capacities.

Deep dive: Data Partitioning and the Consistent Hashing pattern

Replication and sloppy quorum

Each key is replicated on N nodes (typically 3). Reads and writes use a quorum protocol: a write succeeds when W replicas acknowledge, a read succeeds when R replicas respond. By tuning W and R (e.g., W=1 for maximum write availability), you control the consistency-availability tradeoff.

Deep dive: Replication and the Quorum pattern

Conflict detection with vector clocks

When two nodes accept writes to the same key concurrently, Dynamo uses vector clocks to detect the conflict. Each write carries a vector of logical timestamps. During a read, if Dynamo finds conflicting versions, it returns all of them to the client for reconciliation (e.g., merging two shopping carts by taking the union of items).

Deep dive: Vector Clocks and Conflicting Data and the Vector Clocks pattern

Failure handling

  • Hinted handoff: When a replica is down, another node temporarily accepts writes on its behalf and forwards them when it recovers. See the Hinted Handoff pattern.
  • Merkle trees: Background anti-entropy compares hash trees between replicas to detect and repair divergence. See the Merkle Trees pattern.
  • Gossip protocol: Nodes share membership and failure information through periodic randomized communication. See the Gossip Protocol pattern.

Key patterns used

PatternWhy it is neededReference
Consistent HashingPartition data across nodes with minimal redistribution on membership changesPattern
QuorumTunable consistency -- balance availability vs. consistency per operationPattern
Vector ClocksDetect causal ordering and conflicts between concurrent writesPattern
Hinted HandoffMaintain write availability when a replica is temporarily unreachablePattern
Gossip ProtocolDecentralized failure detection and membership managementPattern
Merkle TreesEfficient background synchronization between diverged replicasPattern
Read RepairFix stale replicas opportunistically during read operationsPattern

Connections to the CAP and PACELC theorems

Dynamo is a textbook AP system. Under the CAP theorem, it chooses availability over consistency during partitions. Under the PACELC theorem, it trades consistency for lower latency even when there is no partition (PA/EL).

Design Challenge

Variation: Design a distributed session store

You need a distributed session store for a web app with 50 million active users. Sessions expire after 30 minutes of inactivity. The system handles 100K writes/second across 3 data centers. Sessions must be writable during network partitions, but stale reads are acceptable.
Hints (0/4)