Vector Clocks and Conflicting Data
As we saw in the previous chapter, sloppy quorum allows concurrent writes to the same key on non-overlapping sets of nodes. This means multiple conflicting versions can coexist. How does Dynamo detect and resolve these conflicts?
Why wall clocks don't work
On a single machine, timestamps are enough to order events: write at t=1, then write at t=2 -- clearly, the second is newer.
In a distributed system, this breaks because of clock skew -- different machines have different clocks, and they drift apart over time. Even with NTP synchronization, clocks can differ by milliseconds. You can't reliably say that event at time t on Node A happened before event at time t+1 on Node B.
Network Time Protocol synchronizes clocks across machines but cannot guarantee perfect synchronization. Typical accuracy is 1-10ms on a LAN -- enough uncertainty to make timestamp-based ordering unreliable for concurrent writes.
What is a vector clock?
Instead of timestamps, Dynamo uses vector clocks to track causal relationships between versions. A vector clock is a list of (node, counter) pairs -- one entry for each node that has written to the data.
Rule for comparing versions: If every counter in version A ≤ the corresponding counter in version B, then A is an ancestor of B (B is newer, A can be discarded). If neither version dominates the other, they are concurrent (conflicting) and must be reconciled.
Walkthrough: how conflicts arise and get detected
Let's trace a concrete scenario:
| Step | What happens | Vector clock |
|---|---|---|
| 1 | Server A writes k1 = "foo" | [A:1] → replicated to B |
| 2 | Server A updates k1 = "bar" | [A:2] → replicated to B |
| 3 | Network partition -- A and B can't communicate | |
| 4 | Server A writes k1 = "baz" | [A:3] (can't reach B, stored as hint elsewhere) |
| 5 | Server B writes k1 = "bax" | [A:2, B:1] (can't reach A, stored as hint elsewhere) |
| 6 | Partition heals | |
| 7 | Read for k1 sees both versions | [A:3] vs. [A:2, B:1] -- concurrent! |
At step 7, neither version dominates: [A:3] has A:3 > A:2, but [A:2, B:1] has B:1 > B:0. The system returns both versions to the client and says: "You figure it out."
Most of the time, new versions cleanly supersede old ones (e.g., [A:2] is an obvious successor to [A:1]). Branching only happens when failures combine with concurrent writes. But when it does happen, Dynamo pushes resolution to the client -- because the client has semantic knowledge about the data.
Think of it like Git. If Git can merge two branches automatically, it does. If not, the developer resolves the conflict manually. Dynamo works the same way -- vector clocks detect the conflict, the client resolves it.
For Amazon's shopping cart, the reconciliation strategy is a union: merge all items from all conflicting versions. This means an add is never lost -- if you added an item on one version and your friend added another item on a different version, both items appear in the merged cart. The downside: deleted items can "resurface" if a conflict merges an older version that still had the deleted item.
Vector clock truncation: a known weakness
Vector clocks grow over time -- one entry per node that has ever written to the key. In a large cluster with many nodes, this can become expensive. Dynamo truncates vector clocks (removing the oldest entries) when they exceed a size threshold.
The risk: if a truncated entry is needed to reconcile a conflict, Dynamo loses the ability to determine causal ordering and falls back to less precise reconciliation. The Dynamo authors acknowledge this as a potential problem but report it hasn't surfaced in production.
Alternatives to vector clocks
Conflict-free Replicated Data Types (CRDTs)
If you model your data so that concurrent changes can be applied in any order and produce the same result, conflicts resolve automatically. Amazon's shopping cart is a natural CRDT: adding items A and B can happen in any order, and the result is always {A, B}. Deletions are modeled as "negative adds."
This gives you strong eventual consistency: any two nodes that have received the same set of updates will converge to the same state, regardless of the order they applied them.
Riak provides built-in CRDTs.
Last-Write-Wins (LWW)
The simplest approach: use wall-clock timestamps and keep whichever version has the later timestamp. This is what Cassandra chose instead of vector clocks. The advantage: no client-side conflict resolution. The disadvantage: concurrent writes silently lose data -- equivalent to flipping a coin on which write to discard.
When discussing conflict resolution, present the spectrum:
- Vector clocks -- detect conflicts, push resolution to the client (Dynamo)
- CRDTs -- model data so conflicts resolve automatically (Riak)
- Last-write-wins -- simple but lossy (Cassandra)
The right choice depends on whether data loss is acceptable. For a shopping cart, LWW could lose items -- unacceptable. For a timestamp or counter, LWW is fine.