16 CAP Theorem
The CAP theorem is probably the most cited -- and most misunderstood -- concept in distributed systems. It sounds simple, but the implications are subtle and shape every major design decision in the systems we study in this course.
Background
In a distributed system, your data lives on multiple machines connected by a network. Networks are unreliable -- cables get cut, switches fail, data centers lose connectivity. When this happens, parts of your system can't talk to each other. This is called a network partition, and it's not a theoretical concern -- it will happen in any system that runs long enough at scale.
The question becomes: when your system is partitioned, what do you sacrifice?
Definition
The CAP theorem (proposed by Eric Brewer in 2000, proven by Gilbert and Lynch in 2002) states that a distributed system can provide at most two of these three guarantees simultaneously:
Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. It's as if you had a single copy of the data.
Availability (A): Every request to a non-failing node receives a response -- no errors, no timeouts. The system is always usable.
Partition tolerance (P): The system continues to operate despite network partitions -- messages being lost or delayed between nodes.
Why it's really a two-choice problem
Here's what most explanations get wrong: this isn't "pick any two from a menu." In any real distributed system, partitions will happen. You don't get to choose "no partitions" -- the network will eventually fail. So partition tolerance isn't optional. It's a given.
That means the real choice is:
When a network partition occurs, do you sacrifice consistency or availability?
| Choice | What happens during a partition | Systems that make this choice |
|---|---|---|
| CP (Consistency + Partition tolerance) | The system refuses to respond to requests it can't guarantee are consistent. Some requests get errors or timeouts. | BigTable, HBase, MongoDB, Chubby/ZooKeeper |
| AP (Availability + Partition tolerance) | The system responds to every request, but some responses may return stale data. Different nodes may return different values. | Dynamo, Cassandra, CouchDB, Riak |
You'll sometimes see "CA systems" mentioned -- systems that have consistency and availability but not partition tolerance. In a truly distributed system, this doesn't exist. A "CA" system is just a single-node database (like a traditional RDBMS that doesn't replicate). The moment you distribute data across a network, partitions become possible, and you must choose.
What CAP actually means in practice
CAP is about behavior during a partition, not during normal operation. When everything is healthy, you can absolutely have both consistency and availability. The theorem only constrains you during failures.
This is a crucial distinction. Most of the time, your system is running fine. CAP tells you what your system does in the worst case -- and that's exactly what interviewers want to hear you reason about.
CAP doesn't mean you get zero consistency in an AP system or zero availability in a CP system. It's a spectrum. Dynamo (AP) still resolves most inconsistencies quickly. BigTable (CP) is still highly available in practice -- it just might reject requests during partitions. The question is: what's the guarantee under the worst conditions?
Examples from this course
Dynamo (AP): Amazon's philosophy is "the shopping cart must always be writable." During a partition, Dynamo keeps accepting writes on both sides. This creates conflicting versions, which are reconciled later using vector clocks. The customer's experience is uninterrupted, but they might briefly see stale data.
BigTable (CP): Google chose strong consistency for BigTable. Every read returns the latest write, guaranteed. During a partition, some tablets may become unavailable until the partition heals. Google's use case (indexing, analytics) values correctness over constant availability.
Cassandra (tunable): Cassandra is interesting because it lets you choose per-query. With a quorum read (R + W > N), you get consistency. With lower consistency levels, you get higher availability. This makes Cassandra a powerful case study for understanding CAP as a spectrum, not a binary choice.
When asked about CAP in an interview, don't just recite the definition. Show that you understand:
- Partitions are inevitable, so it's really "C or A during partitions"
- It's about guarantees during failure, not normal operation
- Real systems treat it as a spectrum, not a binary switch (see PACELC for the full picture)
- The right choice depends on the business requirement -- there's no universally "correct" answer
CAP doesn't tell the whole story
CAP is a useful mental model, but it's incomplete. It says nothing about what happens when there is no partition -- yet real systems still have to choose between latency and consistency during normal operation. That's where the PACELC theorem comes in, extending CAP to cover the full picture.