Skip to main content

High-Level Architecture

At a high level, Dynamo is a Distributed Hash Table (DHT) that is replicated across the cluster for high availability and fault tolerance.

Key terms
  • DHT -- A distributed storage system that provides key-value lookups. Any node in the DHT can efficiently retrieve the value for a given key.
  • Fault tolerance -- The ability to continue operating when components fail. Dynamo achieves this through redundancy: every piece of data exists on multiple nodes.
Think first
Dynamo needs to be 'always writeable' — even when nodes fail or network partitions occur. What are the key problems a distributed key-value store must solve to achieve this?

Dynamo's architecture in one picture

Dynamo combines six techniques into a single coherent system. Here's how they fit together:

ComponentTechniquePattern
Data distributionConsistent Hashing with virtual nodesSpread data evenly; minimize movement when nodes join/leave
Data replicationOptimistic replication → eventual consistencyReplicate asynchronously for speed; tolerate temporary inconsistency
Temporary failure handlingSloppy quorum + Hinted handoffAlways accept writes, even when replicas are down
Node communication & failure detectionGossip protocolDecentralized -- every node knows the state of every other node
Conflict detectionVector clocksTrack causal history to detect concurrent writes
Permanent failure recoveryMerkle trees for anti-entropyEfficiently find and fix divergence between replicas
The design philosophy

Notice that every component prioritizes availability over consistency. Sloppy quorum keeps writes flowing during failures. Hinted handoff stores data for downed nodes. Optimistic replication avoids waiting for confirmations. Conflict resolution happens later, during reads. This isn't an accident -- it's the direct consequence of Amazon's business requirement that the system must always accept writes.

How they work together

The flow for a typical write:

  1. Client sends put(key, value) to any Dynamo node
  2. Consistent hashing determines which nodes should store this key
  3. The coordinator sends the write to the top N nodes on the preference list
  4. If a target node is down, a healthy node accepts the write via sloppy quorum and stores a hint
  5. Once W nodes confirm, the write is acknowledged to the client
  6. Gossip protocol keeps all nodes informed about who's alive and what ranges they own
  7. If a node was down, hinted handoff delivers the write when it recovers
  8. If a node was down for a long time, Merkle trees detect and repair divergence in the background
  9. If concurrent writes created conflicts, vector clocks detect them during the next read
Think first
In Dynamo's write flow, what happens if one of the designated replica nodes is down when a write arrives?

Each of the following chapters dives deep into one of these components. But keep this big picture in mind -- Dynamo's power comes from how these pieces fit together, not from any single technique.

Quiz
A Dynamo node goes down for 2 hours and then recovers. Which mechanisms will help bring its data back up to date?
Interview angle

If asked to describe Dynamo's architecture, walk through this table. It shows you understand not just the individual techniques but why each was chosen and how they compose into a system that's always available for writes.