Centralized vs. Decentralized: The Architecture Decision
The most fundamental fork in distributed system design is whether to have a distinguished leader or treat all nodes as equal peers. Five of the seven systems in this course chose leader-based architectures; two chose peer-to-peer. Neither choice is inherently superior -- each is a rational response to a different set of constraints. This page dissects the decision, system by system.
The two camps
Leader-based systems: why they chose centralization
GFS: the single master
What the master does: Holds the entire file system namespace and the mapping of chunks to chunkservers. All metadata operations (file creation, deletion, chunk allocation) go through it. Data operations flow directly between clients and chunkservers.
Why Google chose this: Simplicity. A single master eliminates the need for distributed consensus on metadata operations. File creation, chunk allocation, and garbage collection are trivially correct when one node makes all decisions. Google's engineering culture prizes simplicity over theoretical scalability -- they would rather build a system that is easy to reason about and fix operationally.
What they gain: Simple, correct metadata management. Global knowledge enables intelligent chunk placement (rack-aware replication). Straightforward garbage collection and re-replication.
What they risk: The master is a single point of failure and a throughput bottleneck. Google mitigated the bottleneck by keeping metadata small (in-memory), using large chunk sizes (64MB) to reduce metadata volume, and ensuring data never flows through the master. They accepted the availability risk of single-master failure by relying on fast manual recovery.
See: GFS single master and large chunk size, GFS master operations
HDFS: the NameNode (learning from GFS's mistakes)
What the NameNode does: Functionally identical to GFS's master -- all metadata, all namespace operations, block-to-DataNode mapping.
Why Hadoop chose this: HDFS was explicitly modeled on GFS. The single-master architecture was proven to work at Google's scale. But the open-source community learned from GFS's single-point-of-failure weakness and invested heavily in solving it.
What they improved over GFS:
- High Availability: Standby NameNode with shared edit log (Quorum Journal Manager) enables automatic failover in seconds rather than the 30+ minutes of cold recovery
- Federation: Multiple independent NameNodes, each owning a portion of the namespace, break through the single-node memory limit
- Fencing: STONITH and ZooKeeper-based failover controller prevent split-brain during NameNode transitions
See: HDFS High Availability, HDFS deep dive
BigTable: master plus Chubby
What the master does: Assigns tablets to tablet servers, detects tablet server additions and removals, balances load, and handles schema changes. The master does not serve data reads or writes -- clients talk directly to tablet servers.
Why Google chose this: BigTable delegates the hardest parts of coordination (lock management, leader election, consistent configuration storage) to Chubby rather than building them into the master itself. This separation of concerns keeps BigTable's master relatively simple -- it is a coordinator, not a consensus participant.
See: BigTable partitioning and architecture, BigTable components
Kafka: controller broker plus per-partition leaders
What the controller does: One broker is elected as the controller (via ZooKeeper). The controller manages partition-to-broker assignment, detects broker failures, and orchestrates leader election for partitions.
What partition leaders do: Each partition has a leader broker that handles all produce and consume requests for that partition. Followers replicate from the leader.
Why LinkedIn chose this: Kafka's architecture is a hybrid. The controller is a single leader for cluster-wide coordination, but the actual data plane has thousands of independent leaders (one per partition). This is the key insight: by partitioning the leadership responsibility, Kafka avoids the throughput bottleneck of a single master while retaining the simplicity of single-leader-per-partition consistency.
What they gain: Per-partition ordering guarantees with minimal coordination. Partition leaders operate independently -- no cross-partition consensus needed.
What they risk: Controller failure temporarily blocks partition reassignment (but existing partitions continue operating). ZooKeeper dependency adds operational complexity.
See: Kafka controller broker, Kafka high-level architecture
Chubby: Paxos-elected master
What the master does: Serves all client requests (reads and writes). The other four replicas in a typical Chubby cell participate in Paxos consensus to replicate the master's state but do not serve client requests directly.
Why Google chose this: Chubby provides distributed locking and consistent configuration storage. These operations require linearizability -- the strongest consistency guarantee. Routing all operations through a single master that replicates via Paxos is the most straightforward way to achieve this.
What they gain: Linearizable reads and writes. Simple client programming model (talk to the master, get correct answers).
What they risk: The master is a throughput bottleneck. Chubby mitigates this with aggressive client-side caching and by being designed for low-volume, high-importance operations (lock acquisition, leader election, small config files) rather than high-throughput data.
See: Chubby design rationale, Scaling Chubby
Peer-to-peer systems: why they chose decentralization
Dynamo: all nodes are equal
Why Amazon chose this: Dynamo powers shopping carts during peak traffic. Amazon needed a system with no single point of failure -- not even a momentary one during failover. If a master must be elected, there is a window where the system cannot accept writes. For Amazon, that window is lost revenue.
How it works without a leader:
- Data placement: Consistent hashing determines which nodes own which data. No coordinator assigns ranges.
- Request routing: Any node can handle any request. The client (or a load balancer) picks a coordinator node, which forwards to the appropriate replicas.
- Failure detection: Gossip protocol spreads membership and failure information. No central authority declares nodes dead.
- Conflict resolution: Vector clocks track causality. The application resolves conflicts.
What they gain: True zero-downtime operation. Any node can serve any request. No failover window. Writes always succeed (to at least one node).
What they risk: Consistency is sacrificed. Conflicting writes require application-level resolution. Gossip-based failure detection is slower than centralized heartbeats. Consistent hashing can create load imbalances that are harder to fix without a central coordinator.
See: Dynamo high-level architecture, Dynamo's put/get operations
Cassandra: Dynamo's architecture with BigTable's data model
Why Facebook chose this: Cassandra was originally built for Facebook's inbox search. It needed Dynamo's availability guarantees (always-on for billions of users) combined with a richer data model than simple key-value (column families, range queries). Facebook took Dynamo's decentralized architecture and married it to BigTable's column-family data model.
Key differences from Dynamo:
- Tunable consistency: Unlike Dynamo's strict eventual consistency, Cassandra lets clients choose consistency level per query using quorum parameters
- Simpler conflict resolution: Last-write-wins with timestamps rather than vector clocks and application-level merging
- Better failure detection: Phi Accrual failure detector adapts to network conditions, reducing false positives in multi-datacenter deployments
See: Cassandra high-level architecture, Cassandra gossiper
The master bottleneck problem and solutions
Every leader-based system faces the same fundamental tension: the leader provides simplicity and correctness, but it is a bottleneck and a single point of failure.
Solutions the systems employ
| Solution | System | How it works |
|---|---|---|
| Shadow/Standby master | HDFS | Standby NameNode continuously replays edit log; can take over in seconds |
| Master off the data path | GFS, BigTable | Master handles metadata only; data flows directly between clients and workers |
| Partitioned leadership | Kafka | One leader per partition; thousands of independent leaders across the cluster |
| Federation | HDFS | Multiple NameNodes, each owning a namespace slice; breaks memory bottleneck |
| Client-side caching | Chubby, BigTable | Clients cache metadata aggressively, reducing master load |
| Delegation to coordination service | BigTable, Kafka | Hard coordination problems delegated to Chubby/ZooKeeper |
The architecture of master resilience
The consensus problem: how each camp solves it
Both architectures must solve a fundamental problem: how do nodes agree on the state of the system?
Leader-based approach to consensus
Leader-based systems solve consensus by avoiding distributed consensus for most operations. When one node makes all decisions, there is no disagreement to resolve. The consensus problem shrinks to a single point: who is the leader?
- GFS: No automatic consensus. The master is a known, fixed node. Recovery is manual.
- HDFS: ZooKeeper Failover Controller runs leader election via ZooKeeper (which uses ZAB, a Paxos-like protocol internally).
- BigTable: Chubby (running Paxos) handles tablet server membership and master election.
- Kafka: ZooKeeper handles controller election. Partition leaders are assigned by the controller, not elected.
- Chubby: Paxos consensus among five replicas elects the master and replicates all state.
Peer-to-peer approach to consensus
Peer-to-peer systems cannot avoid distributed consensus entirely -- they must agree on data placement, membership, and (for reads) which version of data is correct.
- Dynamo: No formal consensus. Consistent hashing determines placement. Vector clocks track causality. Conflicts are pushed to the application.
- Cassandra: Same as Dynamo for placement and membership. Quorum reads/writes provide "consensus" at the operation level -- if R+W > N, the read and write quorums overlap, ensuring at least one node in the read set has the latest write.
The key insight: peer-to-peer systems replace global consensus with per-operation quorum agreement. This is weaker than full consensus but sufficient for many workloads, and it avoids the throughput bottleneck of a single leader.
Comparison table
| Dimension | Leader-Based (GFS, HDFS, BigTable, Kafka, Chubby) | Peer-to-Peer (Dynamo, Cassandra) |
|---|---|---|
| Consistency | Strong (naturally) | Eventual or tunable |
| Availability | Limited by leader health | High (any node serves requests) |
| Throughput ceiling | One leader's capacity (mitigated by partitioning) | Scales linearly with nodes |
| Failure blast radius | Leader failure affects all clients | Single node failure affects only its data range |
| Complexity location | In failover and fencing logic | In conflict resolution and anti-entropy |
| Operational burden | Monitor and protect the leader | Monitor and tune consistency parameters |
| Data placement | Centrally decided (optimal) | Hash-determined (potentially unbalanced) |
| Adding/removing nodes | Master coordinates rebalancing | Consistent hashing absorbs changes |
| Split-brain risk | Yes -- requires fencing | No leader to split |
| Coordination service needed | Often (ZooKeeper, Chubby) | No |
Decision framework
When an interviewer asks you to choose between leader-based and peer-to-peer architecture, use this framework:
Use leader-based when:
- Strong consistency is required (financial data, metadata, coordination)
- The workload can be partitioned into independent units (tablets, partitions) to distribute leadership
- You can tolerate brief unavailability during failover
- Operations need global ordering or coordination
- You want simpler application-level logic (no conflict resolution)
Use peer-to-peer when:
- Availability must be absolute (zero-downtime requirement)
- The workload is simple key-value access without cross-key transactions
- You are willing to handle conflicts at the application layer (or can use last-write-wins)
- Multi-datacenter deployment is a first-class requirement
- No single node should be more important than any other
The hybrid option (Kafka's approach):
- Use a leader for coordination (controller) but partition the data plane into independent units with their own leaders
- This gives you strong per-partition consistency with horizontally scalable throughput
- This is the most common pattern in modern distributed systems
The evolution of the debate
It is worth noting that the industry has largely converged on a hybrid approach since these systems were designed:
- Pure single-master (GFS, 2003) proved too fragile. HDFS added HA and federation.
- Pure peer-to-peer (Dynamo, 2007) proved too complex for application developers. Cassandra added tunable consistency to reduce the burden.
- Modern systems (Kafka, CockroachDB, Spanner) partition leadership so that no single node bottlenecks the system, while each partition gets the simplicity of a single leader.
The lesson: the question is not "centralized or decentralized?" but "at what granularity should leadership operate?" A system with one leader for the whole cluster is fragile. A system with one leader per partition scales well. A system with no leader at all pushes complexity to the application.