Design a Key-Value Store That Never Says No
The problem
It is Black Friday. Your e-commerce platform processes millions of shopping-cart operations per hour. Customers add items, remove items, and merge carts across devices -- all at the same time. A single datacenter cannot handle the load, so you run servers in three regions. Network links between those regions occasionally drop, and individual machines crash several times a day at this scale.
Your product team has one non-negotiable requirement: the cart must always accept writes. A customer who clicks "Add to Cart" and sees an error message is a lost sale. Returning a slightly stale cart is acceptable -- losing a write is not. The team can tolerate brief inconsistencies as long as they resolve eventually.
Here are the concrete numbers you must design for:
- 100,000+ operations per second (reads and writes combined)
- Sub-10 ms latency at the 99th percentile
- Horizontal scalability -- adding nodes should increase throughput linearly
- No single point of failure -- the system keeps working when any node goes down
- Eventual consistency is acceptable; strong consistency is NOT required
Key requirements to identify
| Requirement | What it implies |
|---|---|
| Always-writable | Choose AP in the CAP theorem -- sacrifice consistency for availability |
| No single point of failure | Fully decentralized architecture -- no leader node |
| Horizontal scalability | Need a partitioning scheme that distributes data evenly and handles node additions gracefully |
| Sub-10 ms latency | Reads/writes should contact only a small subset of nodes, not all of them |
| Eventual consistency | Need a conflict resolution mechanism for concurrent writes |
| Multi-region | Replication across failure domains with tolerance for temporary node unavailability |
The design approach
Start by thinking about data placement. You have N nodes and millions of keys. How do you decide which node owns which key? Naive hashing (key % N) breaks catastrophically when you add or remove a node -- nearly every key remaps. You need a scheme where adding a node moves only a fraction of the keys.
Next, consider replication. If a key lives on node A, you probably want copies on nodes B and C. But what happens when node A goes down during a write? You could reject the write (sacrificing availability) or accept it on a different node and reconcile later (sacrificing consistency). The requirements tell you to accept it.
Now the hard part: two clients update the same cart on two different nodes simultaneously. Neither write is "wrong." You need a way to detect that a conflict happened and merge the results. Timestamps alone are unreliable across machines. You need a data structure that captures causal history -- which updates happened before which.
Finally, consider failure recovery. When a node comes back online after a crash, how does it catch up on the writes it missed? And how do you detect that two replicas have silently diverged over time?
How the industry solved it
Amazon built Dynamo to solve exactly this problem. The core insight: for a shopping cart, availability is worth more than consistency. A customer who sees a slightly outdated cart can still check out; a customer who gets an error cannot.
Dynamo's architecture is fully decentralized -- every node is identical, there is no leader, and any node can accept reads or writes for any key. This is the foundation of its availability guarantee.
Read the full Dynamo deep-dive starting from Dynamo Introduction.
Data partitioning with consistent hashing
Dynamo places keys on a hash ring. Each node owns a range of the ring. When a node joins, it takes over a portion of its neighbors' ranges -- only those keys move. Virtual nodes (vnodes) ensure an even distribution even when physical machines have different capacities.
Deep dive: Data Partitioning and the Consistent Hashing pattern
Replication and sloppy quorum
Each key is replicated on N nodes (typically 3). Reads and writes use a quorum protocol: a write succeeds when W replicas acknowledge, a read succeeds when R replicas respond. By tuning W and R (e.g., W=1 for maximum write availability), you control the consistency-availability tradeoff.
Deep dive: Replication and the Quorum pattern
Conflict detection with vector clocks
When two nodes accept writes to the same key concurrently, Dynamo uses vector clocks to detect the conflict. Each write carries a vector of logical timestamps. During a read, if Dynamo finds conflicting versions, it returns all of them to the client for reconciliation (e.g., merging two shopping carts by taking the union of items).
Deep dive: Vector Clocks and Conflicting Data and the Vector Clocks pattern
Failure handling
- Hinted handoff: When a replica is down, another node temporarily accepts writes on its behalf and forwards them when it recovers. See the Hinted Handoff pattern.
- Merkle trees: Background anti-entropy compares hash trees between replicas to detect and repair divergence. See the Merkle Trees pattern.
- Gossip protocol: Nodes share membership and failure information through periodic randomized communication. See the Gossip Protocol pattern.
Key patterns used
| Pattern | Why it is needed | Reference |
|---|---|---|
| Consistent Hashing | Partition data across nodes with minimal redistribution on membership changes | Pattern |
| Quorum | Tunable consistency -- balance availability vs. consistency per operation | Pattern |
| Vector Clocks | Detect causal ordering and conflicts between concurrent writes | Pattern |
| Hinted Handoff | Maintain write availability when a replica is temporarily unreachable | Pattern |
| Gossip Protocol | Decentralized failure detection and membership management | Pattern |
| Merkle Trees | Efficient background synchronization between diverged replicas | Pattern |
| Read Repair | Fix stale replicas opportunistically during read operations | Pattern |
Connections to the CAP and PACELC theorems
Dynamo is a textbook AP system. Under the CAP theorem, it chooses availability over consistency during partitions. Under the PACELC theorem, it trades consistency for lower latency even when there is no partition (PA/EL).