Design a Key-Value Store That Never Says No

The problem

It is Black Friday. Your e-commerce platform processes millions of shopping-cart operations per hour. Customers add items, remove items, and merge carts across devices -- all at the same time. A single datacenter cannot handle the load, so you run servers in three regions. Network links between those regions occasionally drop, and individual machines crash several times a day at this scale.

Your product team has one non-negotiable requirement: the cart must always accept writes. A customer who clicks "Add to Cart" and sees an error message is a lost sale. Returning a slightly stale cart is acceptable -- losing a write is not. The team can tolerate brief inconsistencies as long as they resolve eventually.

Here are the concrete numbers you must design for:

100,000+ operations per second (reads and writes combined)
Sub-10 ms latency at the 99th percentile
Horizontal scalability -- adding nodes should increase throughput linearly
No single point of failure -- the system keeps working when any node goes down
Eventual consistency is acceptable; strong consistency is NOT required

Key requirements to identify

Requirement	What it implies
Always-writable	Choose AP in the CAP theorem -- sacrifice consistency for availability
No single point of failure	Fully decentralized architecture -- no leader node
Horizontal scalability	Need a partitioning scheme that distributes data evenly and handles node additions gracefully
Sub-10 ms latency	Reads/writes should contact only a small subset of nodes, not all of them
Eventual consistency	Need a conflict resolution mechanism for concurrent writes
Multi-region	Replication across failure domains with tolerance for temporary node unavailability

The design approach

Start by thinking about data placement. You have N nodes and millions of keys. How do you decide which node owns which key? Naive hashing (key % N) breaks catastrophically when you add or remove a node -- nearly every key remaps. You need a scheme where adding a node moves only a fraction of the keys.

Next, consider replication. If a key lives on node A, you probably want copies on nodes B and C. But what happens when node A goes down during a write? You could reject the write (sacrificing availability) or accept it on a different node and reconcile later (sacrificing consistency). The requirements tell you to accept it.

Now the hard part: two clients update the same cart on two different nodes simultaneously. Neither write is "wrong." You need a way to detect that a conflict happened and merge the results. Timestamps alone are unreliable across machines. You need a data structure that captures causal history -- which updates happened before which.

Finally, consider failure recovery. When a node comes back online after a crash, how does it catch up on the writes it missed? And how do you detect that two replicas have silently diverged over time?

Think first

Before reading the solution, try to sketch your own design. How would you partition keys across nodes? How would you replicate data? What data structure would you use to detect conflicting writes?

How the industry solved it

Amazon built Dynamo to solve exactly this problem. The core insight: for a shopping cart, availability is worth more than consistency. A customer who sees a slightly outdated cart can still check out; a customer who gets an error cannot.

Dynamo's architecture is fully decentralized -- every node is identical, there is no leader, and any node can accept reads or writes for any key. This is the foundation of its availability guarantee.

Read the full Dynamo deep-dive starting from Dynamo Introduction.

Data partitioning with consistent hashing

Dynamo places keys on a hash ring. Each node owns a range of the ring. When a node joins, it takes over a portion of its neighbors' ranges -- only those keys move. Virtual nodes (vnodes) ensure an even distribution even when physical machines have different capacities.

Deep dive: Data Partitioning and the Consistent Hashing pattern

Replication and sloppy quorum

Each key is replicated on N nodes (typically 3). Reads and writes use a quorum protocol: a write succeeds when W replicas acknowledge, a read succeeds when R replicas respond. By tuning W and R (e.g., W=1 for maximum write availability), you control the consistency-availability tradeoff.

Deep dive: Replication and the Quorum pattern

Conflict detection with vector clocks

When two nodes accept writes to the same key concurrently, Dynamo uses vector clocks to detect the conflict. Each write carries a vector of logical timestamps. During a read, if Dynamo finds conflicting versions, it returns all of them to the client for reconciliation (e.g., merging two shopping carts by taking the union of items).

Deep dive: Vector Clocks and Conflicting Data and the Vector Clocks pattern

Failure handling

Hinted handoff: When a replica is down, another node temporarily accepts writes on its behalf and forwards them when it recovers. See the Hinted Handoff pattern.
Merkle trees: Background anti-entropy compares hash trees between replicas to detect and repair divergence. See the Merkle Trees pattern.
Gossip protocol: Nodes share membership and failure information through periodic randomized communication. See the Gossip Protocol pattern.

Key patterns used

Pattern	Why it is needed	Reference
Consistent Hashing	Partition data across nodes with minimal redistribution on membership changes	Pattern
Quorum	Tunable consistency -- balance availability vs. consistency per operation	Pattern
Vector Clocks	Detect causal ordering and conflicts between concurrent writes	Pattern
Hinted Handoff	Maintain write availability when a replica is temporarily unreachable	Pattern
Gossip Protocol	Decentralized failure detection and membership management	Pattern
Merkle Trees	Efficient background synchronization between diverged replicas	Pattern
Read Repair	Fix stale replicas opportunistically during read operations	Pattern

Connections to the CAP and PACELC theorems

Dynamo is a textbook AP system. Under the CAP theorem, it chooses availability over consistency during partitions. Under the PACELC theorem, it trades consistency for lower latency even when there is no partition (PA/EL).

Design Challenge

Variation: Design a distributed session store

You need a distributed session store for a web app with 50 million active users. Sessions expire after 30 minutes of inactivity. The system handles 100K writes/second across 3 data centers. Sessions must be writable during network partitions, but stale reads are acceptable.

Hints (0/4)

The problem​

Key requirements to identify​

The design approach​

How the industry solved it​

Data partitioning with consistent hashing​

Replication and sloppy quorum​

Conflict detection with vector clocks​

Failure handling​

Key patterns used​

Connections to the CAP and PACELC theorems​