Introduction: System Design Patterns
If Part 1 of this course is about studying specific houses, Part 2 is about understanding the bricks, beams, and blueprints they're all built from.
Every distributed system -- no matter how different it looks on the surface -- is assembled from a surprisingly small set of recurring techniques. Dynamo uses consistent hashing for data distribution. So does Cassandra. Kafka uses write-ahead logs for durability. So does GFS. BigTable uses Bloom filters to avoid unnecessary disk reads. So does Cassandra.
These techniques keep showing up because distributed systems keep facing the same fundamental problems:
- How do you spread data across machines? (Consistent Hashing)
- How do you agree on the state of the world? (Quorum, Leader and Follower)
- How do you survive crashes? (Write-ahead Log, Checksum)
- How do you detect failures? (Heartbeat, Gossip Protocol, Phi Accrual)
- How do you handle conflicts? (Vector Clocks, Read Repair, Merkle Trees)
- How do you reason about trade-offs? (CAP Theorem, PACELC)
We call these System Design Patterns -- reusable solutions to the problems that come up in every distributed system. Knowing them gives you a vocabulary for system design conversations and a toolkit for interviews.
The patterns
| # | Pattern | One-line summary |
|---|---|---|
| 1 | Bloom Filters | Probabilistically check set membership without reading data |
| 2 | Consistent Hashing | Distribute data so adding/removing nodes moves minimal keys |
| 3 | Quorum | Require a minimum number of nodes to agree on reads/writes |
| 4 | Leader and Follower | Elect one node to coordinate, others replicate |
| 5 | Write-ahead Log | Log every change before applying it, so you can recover |
| 6 | Segmented Log | Split logs into segments to manage size and enable cleanup |
| 7 | High-Water Mark | Track how far replication has progressed |
| 8 | Lease | Grant time-limited ownership to prevent conflicts |
| 9 | Heartbeat | Periodic signals to prove a node is alive |
| 10 | Gossip Protocol | Spread information by telling random neighbors |
| 11 | Phi Accrual Failure Detection | Probabilistic failure detection that adapts to network conditions |
| 12 | Split-brain | When a network partition makes two nodes think they're both leader |
| 13 | Fencing | Prevent stale leaders from corrupting data |
| 14 | Checksum | Detect data corruption with hash verification |
| 15 | Vector Clocks | Track causal ordering across distributed writes |
| 16 | CAP Theorem | You can't have consistency, availability, and partition tolerance all at once |
| 17 | PACELC Theorem | Even without partitions, there's a latency vs. consistency trade-off |
| 18 | Hinted Handoff | Temporarily store writes meant for a downed node |
| 19 | Read Repair | Fix stale replicas when you detect inconsistency during reads |
| 20 | Merkle Trees | Efficiently detect which data differs between replicas |
How patterns connect to systems
Each pattern page includes Examples showing which real systems use it. Here's a preview of how densely connected these patterns are:
| Pattern | Dynamo | Cassandra | BigTable | Kafka | GFS | HDFS | Chubby |
|---|---|---|---|---|---|---|---|
| Consistent Hashing | ✔ | ✔ | |||||
| Quorum | ✔ | ✔ | |||||
| Write-ahead Log | ✔ | ✔ | ✔ | ✔ | |||
| Leader and Follower | ✔ | ✔ | ✔ | ✔ | ✔ | ||
| Gossip Protocol | ✔ | ✔ | |||||
| Vector Clocks | ✔ | ||||||
| Bloom Filters | ✔ | ✔ | |||||
| Heartbeat | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
| Checksum | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| Lease | ✔ | ✔ | ✔ |
Notice how the same patterns appear in system after system. That's the whole point -- learn the pattern once, recognize it everywhere.