Vector Clocks and Conflicting Data

As we saw in the previous chapter, sloppy quorum allows concurrent writes to the same key on non-overlapping sets of nodes. This means multiple conflicting versions can coexist. How does Dynamo detect and resolve these conflicts?

Think first

In a distributed system, two different nodes accept writes to the same key at the same time during a network partition. How can we determine which write happened 'first' — or whether that question even makes sense?

Why wall clocks don't work

On a single machine, timestamps are enough to order events: write at t=1, then write at t=2 -- clearly, the second is newer.

In a distributed system, this breaks because of clock skew -- different machines have different clocks, and they drift apart over time. Even with NTP synchronization, clocks can differ by milliseconds. You can't reliably say that event at time t on Node A happened before event at time t+1 on Node B.

NTP

Network Time Protocol synchronizes clocks across machines but cannot guarantee perfect synchronization. Typical accuracy is 1-10ms on a LAN -- enough uncertainty to make timestamp-based ordering unreliable for concurrent writes.

What is a vector clock?

Instead of timestamps, Dynamo uses vector clocks to track causal relationships between versions. A vector clock is a list of (node, counter) pairs -- one entry for each node that has written to the data.

Rule for comparing versions: If every counter in version A ≤ the corresponding counter in version B, then A is an ancestor of B (B is newer, A can be discarded). If neither version dominates the other, they are concurrent (conflicting) and must be reconciled.

Walkthrough: how conflicts arise and get detected

Let's trace a concrete scenario:

Step	What happens	Vector clock
1	Server A writes `k1 = "foo"`	`[A:1]` → replicated to B
2	Server A updates `k1 = "bar"`	`[A:2]` → replicated to B
3	Network partition -- A and B can't communicate
4	Server A writes `k1 = "baz"`	`[A:3]` (can't reach B, stored as hint elsewhere)
5	Server B writes `k1 = "bax"`	`[A:2, B:1]` (can't reach A, stored as hint elsewhere)
6	Partition heals
7	Read for `k1` sees both versions	`[A:3]` vs. `[A:2, B:1]` -- concurrent!

At step 7, neither version dominates: [A:3] has A:3 > A:2, but [A:2, B:1] has B:1 > B:0. The system returns both versions to the client and says: "You figure it out."

Most of the time, new versions cleanly supersede old ones (e.g., [A:2] is an obvious successor to [A:1]). Branching only happens when failures combine with concurrent writes. But when it does happen, Dynamo pushes resolution to the client -- because the client has semantic knowledge about the data.

Think of it like Git. If Git can merge two branches automatically, it does. If not, the developer resolves the conflict manually. Dynamo works the same way -- vector clocks detect the conflict, the client resolves it.

Shopping cart example

For Amazon's shopping cart, the reconciliation strategy is a union: merge all items from all conflicting versions. This means an add is never lost -- if you added an item on one version and your friend added another item on a different version, both items appear in the merged cart. The downside: deleted items can "resurface" if a conflict merges an older version that still had the deleted item.

Think first

Vector clocks grow by adding an entry for each node that writes to a key. In a large cluster with hundreds of nodes, what problem does this create, and what is the trade-off of any solution?

Vector clock truncation: a known weakness

Vector clocks grow over time -- one entry per node that has ever written to the key. In a large cluster with many nodes, this can become expensive. Dynamo truncates vector clocks (removing the oldest entries) when they exceed a size threshold.

The risk: if a truncated entry is needed to reconcile a conflict, Dynamo loses the ability to determine causal ordering and falls back to less precise reconciliation. The Dynamo authors acknowledge this as a potential problem but report it hasn't surfaced in production.

Alternatives to vector clocks

Conflict-free Replicated Data Types (CRDTs)

If you model your data so that concurrent changes can be applied in any order and produce the same result, conflicts resolve automatically. Amazon's shopping cart is a natural CRDT: adding items A and B can happen in any order, and the result is always {A, B}. Deletions are modeled as "negative adds."

This gives you strong eventual consistency: any two nodes that have received the same set of updates will converge to the same state, regardless of the order they applied them.

Riak provides built-in CRDTs.

Last-Write-Wins (LWW)

The simplest approach: use wall-clock timestamps and keep whichever version has the later timestamp. This is what Cassandra chose instead of vector clocks. The advantage: no client-side conflict resolution. The disadvantage: concurrent writes silently lose data -- equivalent to flipping a coin on which write to discard.

Quiz

Two versions of a shopping cart exist after a network partition heals: version A with items \{X, Y\} and vector clock [A:2], and version B with items \{X, Z\} and vector clock [A:1, B:1]. What does Dynamo do?

Keeps version A because it has a higher counter for node A

Returns the version with the most recent wall-clock timestamp

Returns both versions to the client for reconciliation, which merges them into {X, Y, Z}

Discards both versions and asks the client to retry the write

Interview angle

When discussing conflict resolution, present the spectrum:

Vector clocks -- detect conflicts, push resolution to the client (Dynamo)
CRDTs -- model data so conflicts resolve automatically (Riak)
Last-write-wins -- simple but lossy (Cassandra)

The right choice depends on whether data loss is acceptable. For a shopping cart, LWW could lose items -- unacceptable. For a timestamp or counter, LWW is fine.

Why wall clocks don't work​

What is a vector clock?​

Walkthrough: how conflicts arise and get detected​

Vector clock truncation: a known weakness​

Alternatives to vector clocks​

Conflict-free Replicated Data Types (CRDTs)​

Last-Write-Wins (LWW)​