Dynamo Introduction
Design a distributed key-value store that is highly available, highly scalable, and completely decentralized.
- High availability -- the system keeps serving requests even when nodes fail. No downtime, no "please try again later."
- Scalability -- throw more machines at the problem and get proportional improvement. No re-architecture needed.
- Decentralized -- no single node is special. No leader, no coordinator, no single point of failure.
Why Dynamo matters
It's Black Friday at Amazon. Millions of customers are adding items to their shopping carts simultaneously. A few servers go down -- because at this scale, hardware failure isn't a possibility, it's a certainty. What happens to those shopping carts?
This is the problem Amazon built Dynamo to solve. The insight that drove the entire design: a customer who can't add an item to their cart is a lost sale. Availability isn't a nice-to-have -- it's revenue. Amazon observed that even brief outages directly correlated with fewer customers served, and therefore chose to build a system that is always writable, even at the cost of occasionally returning stale data.
Think about that trade-off: Amazon would rather show you a slightly outdated cart than show you an error page. Most inconsistencies get resolved in the background, and the customer never notices.
Dynamo is one of the best examples of business requirements driving system design. In an interview, when you're asked to design a system, always start with: "What matters more -- consistency or availability?" Dynamo shows what happens when you go all-in on availability.
What is Dynamo?
Dynamo is a highly available key-value store that Amazon built for internal use. Many Amazon services -- shopping cart, bestseller lists, sales rank, product catalog -- only need simple primary-key access to data. A relational database would be overkill for these use cases and would actually limit the scalability and availability these services need.
Dynamo provides a flexible design where applications choose their own trade-off point between availability and consistency.
Dynamo (the subject of this chapter) is Amazon's internal system, described in their 2007 paper. DynamoDB is the commercial managed service Amazon later built, inspired by Dynamo's design but with significant differences. They are not the same thing.
Background
In CAP theorem terms, Dynamo is an AP system -- it chooses availability and partition tolerance over strong consistency. When a network partition happens, Dynamo keeps serving reads and writes on both sides of the partition, and reconciles later.
The Dynamo design was hugely influential. It directly inspired:
If you understand Dynamo deeply, you already understand the foundation of half the NoSQL databases in production today.
Design goals
| Goal | What it means | Why it matters |
|---|---|---|
| Highly available | The system always responds, even during failures | Downtime = lost revenue for Amazon |
| Scalable | Add a node, get proportional capacity increase | Must handle Black Friday-scale traffic spikes |
| Decentralized | No leader node, no single point of failure | A leader is a bottleneck and a risk |
| Eventually consistent | Writes propagate asynchronously; reads may see stale data briefly | The price of being always-writable |
Dynamo uses optimistic replication -- it accepts writes without waiting for all replicas to confirm. This makes writes fast and highly available, but means different nodes can temporarily have different versions of the same data. Dynamo resolves these conflicts later, during reads.
Dynamo's use cases
Dynamo works well when:
- You need simple key-value access (no complex queries, no joins)
- Availability matters more than consistency (e.g., a shopping cart should always be writable)
- You need predictable low-latency at massive scale
Dynamo is a poor fit when:
- You need strong consistency (e.g., financial transactions, inventory counts)
- You need complex queries across multiple keys
- Your data model requires relations between entities
When an interviewer asks "Why not just use a relational database?", the answer for Dynamo's use case is clear: Amazon's services only need primary-key lookups. A relational DB would impose coordination overhead (for joins, transactions, schemas) that hurts availability and scalability without providing any benefit these services actually need.
System APIs
Dynamo exposes just two operations -- radical simplicity by design:
-
get(key)-- Returns the object(s) associated with the key. If there are conflicting versions (because of concurrent writes), it returns all of them along with acontextthat carries version metadata. The application decides which version wins. -
put(key, context, object)-- Stores the object for the given key. Thecontext(from a previousget) is passed back so Dynamo can track version lineage -- think of it as a cookie that helps Dynamo know which version you're updating.
Dynamo treats both keys and values as opaque byte arrays (typically under 1 MB). It applies MD5 hashing to the key to produce a 128-bit identifier, which determines which nodes are responsible for storing that key.
MD5 produces a uniform distribution of hash values, which means keys get spread evenly across nodes. The specific hash function doesn't matter much -- what matters is that it distributes well and is fast to compute. Dynamo doesn't use MD5 for security, just for distribution.
What's next
In the following chapters, we'll walk through Dynamo's architecture piece by piece:
- How data gets partitioned across nodes using consistent hashing
- How data gets replicated for fault tolerance
- How conflicts are detected and resolved using vector clocks
- How reads and writes actually flow through the system
- What happens when nodes fail -- hinted handoff, read repair, and Merkle trees
Each of these maps directly to a system design pattern you'll see in Part 2 of this course.