Systems at a Glance

This page is your cheat sheet. Use it before an interview to refresh your memory, or as a map while reading the course to see how everything connects.

All seven systems compared

	Dynamo	Cassandra	BigTable	Kafka	GFS	HDFS	Chubby
Category	Key-value store	Wide-column NoSQL	Wide-column NoSQL	Message broker / streaming	Distributed file system	Distributed file system	Coordination / locking service
Built by	Amazon (2007)	Facebook/Apache (2008)	Google (2006)	LinkedIn/Apache (2011)	Google (2003)	Apache/Yahoo (2006)	Google (2006)
CAP choice	AP	AP (tunable)	CP	CP (within partitions)	CP	CP	CP
Architecture	Peer-to-peer (decentralized)	Peer-to-peer (decentralized)	Leader-based (master + tablet servers)	Leader-based (controller broker + brokers)	Leader-based (single master + chunkservers)	Leader-based (NameNode + DataNodes)	Leader-based (master + replicas)
Data model	Key → opaque blob	Row key → columns (column families)	Row key → columns (column families)	Topic → ordered log of messages	Hierarchical file paths → chunks	Hierarchical file paths → blocks	Hierarchical file paths → small objects
Consistency	Eventually consistent	Tunable (per-query)	Strongly consistent	Strongly consistent (per-partition)	Relaxed (defined per operation)	Strongly consistent	Strongly consistent
Replication	Sloppy quorum (N, R, W configurable)	Quorum-based (replication factor configurable)	GFS-level replication (3 replicas per chunk)	In-sync replicas (ISR)	3 replicas per chunk	3 replicas per block (configurable)	Paxos consensus (5 replicas typical)
Partitioning	Consistent hashing + vnodes	Consistent hashing + vnodes	Range-based (tablet splitting)	Topic partitions	Fixed-size chunks (64MB)	Fixed-size blocks (128MB)	Not partitioned (small data)
Failure handling	Hinted handoff, Merkle trees, read repair	Hinted handoff, Merkle trees, read repair	Master re-assigns tablets, GFS handles chunk recovery	ISR tracks healthy replicas, controller re-assigns leaders	Master re-replicates under-replicated chunks	NameNode re-replicates under-replicated blocks	Paxos elects new master
Optimized for	High availability, write-heavy workloads	High write throughput, flexible consistency	Fast reads, batch analytics, sparse data	High-throughput message streaming	Large sequential reads/writes	Large batch processing (MapReduce)	Small metadata, coordination, locking
Open source?	No (internal Amazon)	Yes (Apache)	No (internal Google)	Yes (Apache)	No (internal Google)	Yes (Apache)	No (internal Google; ZooKeeper is the open-source equivalent)

Architecture styles

The systems in this course fall into two fundamentally different architectural camps:

Decentralized (peer-to-peer)

Dynamo, Cassandra -- Every node is equal. No single point of failure. Nodes discover each other and share state through gossip protocol. Trade-off: harder to maintain strong consistency.

Centralized (leader-based)

BigTable, Kafka, GFS, HDFS, Chubby -- A designated leader coordinates operations. Simpler consistency model, but the leader is a potential bottleneck and single point of failure (mitigated by leader election and failover).

Interview pattern

When designing a system in an interview, one of the first decisions to articulate is: "Should this be leader-based or peer-to-peer?" Leader-based is simpler and gives you strong consistency for free, but the leader is a bottleneck. Peer-to-peer scales better but makes consistency much harder.

Which patterns appear where?

Pattern	Dynamo	Cassandra	BigTable	Kafka	GFS	HDFS	Chubby
Consistent Hashing	✔	✔
Quorum	✔	✔
Leader and Follower			✔	✔	✔	✔	✔
Write-ahead Log		✔	✔	✔	✔	✔
Segmented Log				✔
High-Water Mark				✔
Lease					✔	✔	✔
Heartbeat	✔	✔	✔	✔	✔	✔	✔
Gossip Protocol	✔	✔
Split-brain				✔			✔
Fencing							✔
Checksum	✔	✔	✔	✔	✔	✔
Vector Clocks	✔
Hinted Handoff	✔	✔
Read Repair	✔	✔
Merkle Trees	✔	✔
Bloom Filters		✔	✔

Notice the clusters: Dynamo and Cassandra share nearly identical pattern sets (because Cassandra was built on Dynamo's ideas). GFS and HDFS overlap heavily (because HDFS is an open-source reimplementation of GFS). BigTable, GFS, and Chubby form a stack (BigTable runs on top of both).

Decision framework: which system for which problem?

If you need...	Consider...	Because...
Simple key-value access, always-available writes	Dynamo pattern (or Cassandra/Riak)	AP design, no coordination overhead
Flexible queries with tunable consistency	Cassandra	Column families + configurable R/W levels
Strong consistency for structured data	BigTable pattern (or HBase)	CP design, single-master simplicity
High-throughput event streaming	Kafka	Append-only log, consumer groups, ISR
Storing very large files (GBs+)	GFS/HDFS	Chunk/block architecture, optimized for sequential I/O
Distributed locking or leader election	Chubby pattern (or ZooKeeper)	Paxos-based consensus, strong consistency for metadata

All seven systems compared​

Architecture styles​

Decentralized (peer-to-peer)​

Centralized (leader-based)​

Which patterns appear where?​

Decision framework: which system for which problem?​