Skip to main content

Introduction: System Design Patterns

If Part 1 of this course is about studying specific houses, Part 2 is about understanding the bricks, beams, and blueprints they're all built from.

Every distributed system -- no matter how different it looks on the surface -- is assembled from a surprisingly small set of recurring techniques. Dynamo uses consistent hashing for data distribution. So does Cassandra. Kafka uses write-ahead logs for durability. So does GFS. BigTable uses Bloom filters to avoid unnecessary disk reads. So does Cassandra.

These techniques keep showing up because distributed systems keep facing the same fundamental problems:

  • How do you spread data across machines? (Consistent Hashing)
  • How do you agree on the state of the world? (Quorum, Leader and Follower)
  • How do you survive crashes? (Write-ahead Log, Checksum)
  • How do you detect failures? (Heartbeat, Gossip Protocol, Phi Accrual)
  • How do you handle conflicts? (Vector Clocks, Read Repair, Merkle Trees)
  • How do you reason about trade-offs? (CAP Theorem, PACELC)

We call these System Design Patterns -- reusable solutions to the problems that come up in every distributed system. Knowing them gives you a vocabulary for system design conversations and a toolkit for interviews.

The patterns

#PatternOne-line summary
1Bloom FiltersProbabilistically check set membership without reading data
2Consistent HashingDistribute data so adding/removing nodes moves minimal keys
3QuorumRequire a minimum number of nodes to agree on reads/writes
4Leader and FollowerElect one node to coordinate, others replicate
5Write-ahead LogLog every change before applying it, so you can recover
6Segmented LogSplit logs into segments to manage size and enable cleanup
7High-Water MarkTrack how far replication has progressed
8LeaseGrant time-limited ownership to prevent conflicts
9HeartbeatPeriodic signals to prove a node is alive
10Gossip ProtocolSpread information by telling random neighbors
11Phi Accrual Failure DetectionProbabilistic failure detection that adapts to network conditions
12Split-brainWhen a network partition makes two nodes think they're both leader
13FencingPrevent stale leaders from corrupting data
14ChecksumDetect data corruption with hash verification
15Vector ClocksTrack causal ordering across distributed writes
16CAP TheoremYou can't have consistency, availability, and partition tolerance all at once
17PACELC TheoremEven without partitions, there's a latency vs. consistency trade-off
18Hinted HandoffTemporarily store writes meant for a downed node
19Read RepairFix stale replicas when you detect inconsistency during reads
20Merkle TreesEfficiently detect which data differs between replicas

How patterns connect to systems

Each pattern page includes Examples showing which real systems use it. Here's a preview of how densely connected these patterns are:

PatternDynamoCassandraBigTableKafkaGFSHDFSChubby
Consistent Hashing
Quorum
Write-ahead Log
Leader and Follower
Gossip Protocol
Vector Clocks
Bloom Filters
Heartbeat
Checksum
Lease

Notice how the same patterns appear in system after system. That's the whole point -- learn the pattern once, recognize it everywhere.

Quick review: do you know these patterns?

Flashcards
1 / 10
How do you distribute data across nodes so that adding/removing a node moves minimal keys?
Space/Enter to flip, arrows to navigate