Summary: Dynamo

The big picture

Dynamo is Amazon's answer to a specific business question: "How do we build a data store that never says no to a customer?" Every design decision flows from that requirement -- from choosing eventual consistency, to the peer-to-peer architecture, to the conflict resolution strategy that pushes decisions to the application layer.

What makes Dynamo remarkable isn't any single technique. It's how it pulls together a dozen distributed systems patterns into a coherent design. If you understand Dynamo, you've seen most of the fundamental patterns in action.

Key takeaways

Business requirements drive architecture. Amazon chose availability over consistency because downtime = lost revenue. This isn't a technical preference -- it's a business decision with system-wide implications.
Peer-to-peer by design. Every node is equal. No leader, no single point of failure. This means no coordination bottleneck, but it also means every node must be able to handle any request.
Always writable. Dynamo accepts writes even during failures using sloppy quorum and hinted handoff. The cost: multiple conflicting versions can exist simultaneously.
Conflicts are the client's problem. When Dynamo can't automatically reconcile versions using vector clocks, it hands all versions to the application. The reasoning: the app has semantic knowledge that a generic database doesn't.
No security. Dynamo was built for Amazon's trusted internal network. This is a deliberate scope limitation, not an oversight.

How Dynamo uses system design patterns

This is where everything connects. Each row links a problem Dynamo faces to the pattern that solves it and explains why that pattern fits:

Problem	Pattern	Why this pattern?	Advantage
Distributing data across nodes	Consistent Hashing	Adding/removing nodes moves only neighboring keys, not everything	Incremental scalability
Ensuring writes survive failures	Quorum (sloppy)	Doesn't require all replicas -- a majority is enough. Sloppy quorum goes further by accepting any healthy node	High availability for writes
Detecting conflicting writes	Vector Clocks	Tracks causal history per-node, so you can tell if versions are sequential or concurrent	Version tracking decoupled from update rate
Handling temporarily failed nodes	Hinted Handoff	Healthy nodes temporarily accept writes for failed ones and replay them later	No data loss during transient failures
Repairing stale replicas	Read Repair	During reads, if a replica is stale, update it immediately	Repairs inconsistencies lazily without background overhead
Detecting diverged replicas	Merkle Trees	Efficiently compare data between nodes by hashing tree structure -- only transfer the parts that differ	Background synchronization of divergent replicas
Discovering nodes and failures	Gossip Protocol	Decentralized information spreading -- no central membership service needed	Preserves symmetry, no central monitoring

Dynamo's influence

Dynamo isn't open-source, but its ideas spread everywhere. Here's how its spiritual successors adopted (or rejected) its techniques:

Technique	Cassandra	Riak
Consistent hashing with virtual nodes	✔	✔
Hinted handoff	✔	✔
Anti-entropy with Merkle trees	✔ (manual repair)	✔
Vector clocks	✘ (uses last-write-wins)	✔
Gossip-based protocol	✔	✔

Why did Cassandra drop vector clocks?

Cassandra chose last-write-wins (LWW) instead of vector clocks because it dramatically simplifies the client API -- applications don't need to handle conflict resolution. The trade-off: silently lost writes when concurrent updates happen. For Cassandra's typical use cases (time-series data, event logs), this is acceptable.

Criticism

No system is perfect. Dynamo's design has real limitations:

Full routing table on every node -- every node stores the entire ring topology. As the cluster grows, this becomes expensive to maintain and gossip.
Seeds violate symmetry -- despite claiming "every node is equal," Dynamo uses special seed nodes for bootstrapping. These are externally discoverable and serve as entry points, which is a form of asymmetry.
Leaky abstraction -- clients sometimes have to resolve conflicts themselves (e.g., merging shopping cart versions). The user experience isn't fully transparent.
No security model -- built for Amazon's internal trusted network, making it unsuitable for untrusted environments without additional security layers.

Quick reference card

Property	Value
Type	Key-value store
CAP classification	AP (available + partition-tolerant)
Consistency model	Eventually consistent (tunable)
Data partitioning	Consistent hashing with virtual nodes
Replication	Sloppy quorum (configurable N, R, W)
Conflict resolution	Vector clocks → client reconciliation
Failure handling	Hinted handoff + Merkle tree anti-entropy
Node communication	Gossip protocol
API	`get(key)`, `put(key, context, object)`

Design Challenge

Design a distributed session store

You need to design a distributed session store for a web application with 50 million active users. Sessions must be writable even during network partitions. Stale reads are acceptable for up to 5 seconds. The system handles 100K writes/second across 3 data centers.

Hints (0/4)

References and further reading

Amazon's Dynamo paper -- the original 2007 paper
Eventually Consistent -- Werner Vogels on the consistency model
A Decade of Dynamo -- retrospective 10 years later
Dynamo: A flawed architecture -- critical discussion
Riak's Dynamo implementation

The big picture​

Key takeaways​

How Dynamo uses system design patterns​

Dynamo's influence​

Criticism​

Quick reference card​