Dynamo Introduction

Design a distributed key-value store that is highly available, highly scalable, and completely decentralized.

Core concepts

High availability -- the system keeps serving requests even when nodes fail. No downtime, no "please try again later."
Scalability -- throw more machines at the problem and get proportional improvement. No re-architecture needed.
Decentralized -- no single node is special. No leader, no coordinator, no single point of failure.

Why Dynamo matters

It's Black Friday at Amazon. Millions of customers are adding items to their shopping carts simultaneously. A few servers go down -- because at this scale, hardware failure isn't a possibility, it's a certainty. What happens to those shopping carts?

This is the problem Amazon built Dynamo to solve. The insight that drove the entire design: a customer who can't add an item to their cart is a lost sale. Availability isn't a nice-to-have -- it's revenue. Amazon observed that even brief outages directly correlated with fewer customers served, and therefore chose to build a system that is always writable, even at the cost of occasionally returning stale data.

Think about that trade-off: Amazon would rather show you a slightly outdated cart than show you an error page. Most inconsistencies get resolved in the background, and the customer never notices.

Interview insight

Dynamo is one of the best examples of business requirements driving system design. In an interview, when you're asked to design a system, always start with: "What matters more -- consistency or availability?" Dynamo shows what happens when you go all-in on availability.

What is Dynamo?

Dynamo is a highly available key-value store that Amazon built for internal use. Many Amazon services -- shopping cart, bestseller lists, sales rank, product catalog -- only need simple primary-key access to data. A relational database would be overkill for these use cases and would actually limit the scalability and availability these services need.

Dynamo provides a flexible design where applications choose their own trade-off point between availability and consistency.

Don't confuse these

Dynamo (the subject of this chapter) is Amazon's internal system, described in their 2007 paper. DynamoDB is the commercial managed service Amazon later built, inspired by Dynamo's design but with significant differences. They are not the same thing.

Background

In CAP theorem terms, Dynamo is an AP system -- it chooses availability and partition tolerance over strong consistency. When a network partition happens, Dynamo keeps serving reads and writes on both sides of the partition, and reconciles later.

The Dynamo design was hugely influential. It directly inspired:

Cassandra (which we cover in this course)
Riak
Voldemort
Amazon's own DynamoDB

If you understand Dynamo deeply, you already understand the foundation of half the NoSQL databases in production today.

Design goals

Goal	What it means	Why it matters
Highly available	The system always responds, even during failures	Downtime = lost revenue for Amazon
Scalable	Add a node, get proportional capacity increase	Must handle Black Friday-scale traffic spikes
Decentralized	No leader node, no single point of failure	A leader is a bottleneck and a risk
Eventually consistent	Writes propagate asynchronously; reads may see stale data briefly	The price of being always-writable

The key trade-off

Dynamo uses optimistic replication -- it accepts writes without waiting for all replicas to confirm. This makes writes fast and highly available, but means different nodes can temporarily have different versions of the same data. Dynamo resolves these conflicts later, during reads.

Dynamo's use cases

Dynamo works well when:

You need simple key-value access (no complex queries, no joins)
Availability matters more than consistency (e.g., a shopping cart should always be writable)
You need predictable low-latency at massive scale

Dynamo is a poor fit when:

You need strong consistency (e.g., financial transactions, inventory counts)
You need complex queries across multiple keys
Your data model requires relations between entities

Interview angle

When an interviewer asks "Why not just use a relational database?", the answer for Dynamo's use case is clear: Amazon's services only need primary-key lookups. A relational DB would impose coordination overhead (for joins, transactions, schemas) that hurts availability and scalability without providing any benefit these services actually need.

System APIs

Dynamo exposes just two operations -- radical simplicity by design:

get(key) -- Returns the object(s) associated with the key. If there are conflicting versions (because of concurrent writes), it returns all of them along with a context that carries version metadata. The application decides which version wins.
put(key, context, object) -- Stores the object for the given key. The context (from a previous get) is passed back so Dynamo can track version lineage -- think of it as a cookie that helps Dynamo know which version you're updating.

Dynamo treats both keys and values as opaque byte arrays (typically under 1 MB). It applies MD5 hashing to the key to produce a 128-bit identifier, which determines which nodes are responsible for storing that key.

Why MD5?

MD5 produces a uniform distribution of hash values, which means keys get spread evenly across nodes. The specific hash function doesn't matter much -- what matters is that it distributes well and is fast to compute. Dynamo doesn't use MD5 for security, just for distribution.

Think first

Amazon needs a data store for its shopping cart. The store must always accept writes, even during network partitions or node failures. What trade-off is Amazon implicitly making, and what problems will this create?

Quiz

A Dynamo cluster has N=3, W=2, R=2. One of the three replica nodes goes down. What happens to write operations?

Writes fail because not all replicas are available

Writes succeed because W=2 healthy nodes can still acknowledge

Writes are queued until the node recovers

The system switches to strong consistency mode

What's next

In the following chapters, we'll walk through Dynamo's architecture piece by piece:

How data gets partitioned across nodes using consistent hashing
How data gets replicated for fault tolerance
How conflicts are detected and resolved using vector clocks
How reads and writes actually flow through the system
What happens when nodes fail -- hinted handoff, read repair, and Merkle trees

Each of these maps directly to a system design pattern you'll see in Part 2 of this course.

Why Dynamo matters​

What is Dynamo?​

Background​

Design goals​

Dynamo's use cases​

System APIs​

What's next​