Skip to main content

The Consistency Spectrum: From Eventually Consistent to Linearizable

Consistency is not a binary choice. The seven systems in this course span the entire consistency spectrum, and each chose its position for specific, defensible reasons. Understanding why each system landed where it did -- not just what consistency it offers -- is what separates a surface-level answer from a deep one in an interview.

The spectrum at a glance

Eventual consistency: Dynamo

Position: Furthest toward availability on the spectrum.

The business reason: Dynamo was built for Amazon's shopping cart. The core insight was brutally pragmatic: a customer who sees a slightly stale cart still buys things; a customer who sees an error page leaves. Amazon quantified this -- even brief unavailability correlated with measurable revenue loss.

Think first
Dynamo allows conflicting writes to coexist and pushes conflict resolution to the application. Why is this acceptable for a shopping cart but would be catastrophic for, say, a bank account balance?

How it works: Dynamo uses vector clocks to track causality between writes. When concurrent writes create divergent versions, Dynamo stores all versions and presents them to the application during the next read. The application (not the database) decides how to reconcile. For shopping carts, Amazon uses a "merge all items" strategy -- the union of conflicting carts.

The cost: Application-level conflict resolution is complex and error-prone. Every application team must understand distributed systems well enough to write correct merge logic. This is a significant operational burden that Cassandra later tried to reduce.

See: Dynamo's vector clocks and conflicting data

Tunable consistency: Cassandra

Position: Spans a range -- from eventually consistent to strongly consistent per-query -- depending on the R and W parameters chosen by the client.

The business reason: Cassandra was built at Facebook for inbox search, then evolved into a general-purpose database. Different use cases within the same cluster need different consistency levels. A notification counter can tolerate staleness; a financial ledger cannot.

How it works: Cassandra uses quorum reads and writes with configurable consistency levels. The key formula:

  • W + R > N guarantees strong consistency (any read quorum overlaps with any write quorum)
  • W + R ≤ N allows eventual consistency for lower latency

Where N is the replication factor, W is the number of write acknowledgments required, and R is the number of replicas read.

Cassandra SettingWRConsistencyLatencyUse Case
ONE/ONE11EventualLowestAnalytics, logging
QUORUM/QUORUMN/2+1N/2+1StrongMediumGeneral purpose
ALL/ONEN1StrongHighest writeRare writes, frequent reads
LOCAL_QUORUMMajority in local DCMajority in local DCStrong within DCLow within DCMulti-datacenter apps

The advancement over Dynamo: Cassandra uses last-write-wins (LWW) with timestamps as the default conflict resolution, eliminating the need for application-level merge logic. This is simpler but introduces a different problem: clock skew between nodes can cause "later" writes to overwrite "earlier" ones incorrectly.

See: Cassandra consistency levels

Relaxed consistency: GFS

Position: Stronger than eventual, but weaker than traditional strong consistency. GFS defines a nuanced model with three states: consistent, defined, and undefined.

Think first
GFS allows concurrent writes to the same region of a file, which can result in 'consistent but undefined' data -- all replicas have the same content, but that content is an interleaving of multiple writes. Why would Google accept this? Think about GFS's primary workload.

The business reason: GFS serves Google's batch processing infrastructure. Throughput on multi-gigabyte files matters far more than read-your-writes semantics. The applications (MapReduce, web crawlers) are designed to tolerate duplicate and out-of-order records.

How it works: GFS distinguishes between:

  • Record append (atomic, at-least-once): the primary assigns an offset and all replicas write there. If a replica fails mid-write, the append is retried, potentially creating duplicates -- but each successful append is defined (all replicas agree on its content at that offset).
  • Random write (concurrent): multiple clients writing to the same region produces consistent-but-undefined data. Replicas agree on the bytes, but the content is an unpredictable interleaving.

See: GFS consistency model

Strong consistency: BigTable, HDFS, Kafka

These three systems provide strong consistency but scoped to different units -- a tablet, a file, or a partition.

BigTable: strong per-tablet

The business reason: BigTable serves Google's latency-sensitive applications (web search indexing, Google Earth, Google Finance). Users querying a table must see the latest data. Stale search results are acceptable for web search (seconds-old index is fine), but the index build itself must be strongly consistent to avoid corruption.

How it works: Each tablet (a range of rows) is served by exactly one tablet server at a time. All reads and writes for a tablet go through that single server, which provides strong consistency trivially -- there is only one writer. Chubby ensures that the assignment of tablets to servers is itself consistent (no two servers believe they own the same tablet).

The single-server-per-tablet model means BigTable's consistency is a consequence of its architecture rather than a distributed consensus protocol running per operation.

See: BigTable components, Working with tablets

HDFS: strong per-file (single-writer)

The business reason: HDFS stores input and output for MapReduce and other batch processing frameworks. Each file is written once by a single writer and read many times. The single-writer model eliminates write conflicts entirely.

How it works: The NameNode controls all metadata and block placement. A file being written has a single client holding a lease. Writes are pipelined to DataNodes in a chain, and a write is acknowledged only after all replicas confirm. Once a file is closed, it is immutable. This write-once-read-many model provides strong consistency without any consensus protocol during normal writes.

Kafka: strong per-partition

The business reason: Kafka was built for LinkedIn's activity stream and operational metrics. Within a single partition, messages must be strictly ordered -- if a user likes a post and then unlikes it, consumers must see those events in order or the derived state will be wrong.

How it works: Each partition has a single leader broker that handles all reads and writes. The leader maintains a high-water mark -- the last offset that has been replicated to all in-sync replicas (ISR). Consumers only see messages up to the high-water mark, ensuring they never read data that could be lost if the leader fails.

Kafka SettingGuaranteeTrade-off
acks=0No durabilityFastest
acks=1Leader has writtenLeader failure can lose data
acks=all, min.insync.replicas=2Majority-replicatedStrongest, higher latency

See: Kafka delivery semantics

Linearizable consistency: Chubby

Position: Furthest toward consistency on the spectrum.

The business reason: Chubby is a distributed lock service. If two processes both believe they hold the same lock, the invariant that the lock is supposed to protect is violated. Stale reads are not just inconvenient -- they are correctness failures. A DNS entry served from a stale cache might direct traffic to a decommissioned server. A master election with inconsistent lock state might produce two masters.

Think first
Chubby could have been designed as a highly available, eventually consistent system -- after all, availability seems important for a lock service. Why did Google explicitly choose linearizability over availability?

How it works: Chubby uses Paxos consensus among typically five replicas. All reads and writes go through the elected master, which is the only replica that can serve requests. The Paxos protocol ensures that all replicas agree on the sequence of operations, even in the presence of failures. Client sessions include a grace period to handle master failover without immediately invalidating all locks.

See: Chubby sessions and events, Locks and sequencers

The PACELC extension

The CAP theorem tells us what happens during a network partition: choose between consistency and availability. But partitions are rare. What about normal operation?

The PACELC theorem extends CAP: if there is a Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency.

SystemDuring Partition (PAC)Normal Operation (ELC)PACELC Classification
DynamoAvailabilityLatencyPA/EL
CassandraAvailability (default)Latency (tunable)PA/EL (default)
GFSConsistencyConsistencyPC/EC
HDFSConsistencyConsistencyPC/EC
BigTableConsistencyConsistencyPC/EC
KafkaConsistency (acks=all)ConsistencyPC/EC
ChubbyConsistencyConsistencyPC/EC

Notice the pattern: all leader-based systems in this course choose consistency, while the peer-to-peer systems choose availability. This is not a coincidence -- it is a direct consequence of the architecture. A system with a single leader naturally provides consistency (one authoritative copy) and naturally sacrifices availability when the leader is unreachable. A peer-to-peer system naturally provides availability (any node can serve requests) and naturally sacrifices consistency (multiple copies may diverge).

Practical guidance: choosing a consistency level

Interview framework

When an interviewer asks you to design a system, the consistency choice should flow from three questions:

  1. What is the cost of a stale read? (Inconvenience? Data corruption? Financial loss?)
  2. What is the cost of unavailability? (User retry? Lost revenue? Safety hazard?)
  3. What is the read/write ratio and latency budget?

Map the answers to the spectrum:

  • Stale reads are harmless, unavailability is costly: eventual consistency (Dynamo model)
  • Depends on the operation: tunable consistency (Cassandra model)
  • Stale reads are costly, unavailability is tolerable: strong consistency (BigTable/Kafka model)
  • Stale reads are correctness violations: linearizable (Chubby model)

Decision table

If your application...ChooseExample system model
Can merge conflicting writesEventualShopping cart (Dynamo)
Has mixed read/write patterns with varying freshness needsTunableSocial media with analytics + messaging (Cassandra)
Does large sequential writes, single writerRelaxed / strong single-writerLog processing (GFS, HDFS)
Needs strict ordering within a streamStrong per-partitionEvent streaming, change data capture (Kafka)
Needs mutual exclusion or coordinationLinearizableLeader election, distributed locking (Chubby)

The hidden cost of strong consistency

Strong consistency is not free. Every system that chooses it pays a price:

  1. Latency: Every write must be acknowledged by multiple replicas (or the single authoritative server) before returning. In GFS, the pipeline write to three chunkservers adds latency. In Chubby, Paxos requires a majority round-trip.

  2. Availability: If the leader or a quorum of replicas is unreachable, the system cannot serve writes (and often cannot serve reads either). GFS's single master is a well-known availability bottleneck.

  3. Throughput: Serializing all operations through a single leader limits write throughput to what one node can handle. BigTable mitigates this by partitioning into tablets (one leader per tablet), and Kafka mitigates it by partitioning into partitions (one leader per partition). But within a single tablet or partition, throughput is bounded by one machine.

The peer-to-peer systems avoid all three costs -- any node can handle any request, there is no quorum delay, and throughput scales with node count. The price they pay is the complexity of conflict resolution.

Quiz

Quiz
Cassandra offers tunable consistency, allowing clients to choose between eventual and strong consistency per query. Under what condition does setting R=1, W=1 (with N=3) risk returning data that was never successfully written to any other replica?
Quiz
GFS distinguishes between 'consistent' and 'defined' states for file regions. A record append that fails on one replica but succeeds on others results in the failed replica having padding at that offset. What state is this region in?