Skip to main content

Centralized vs. Decentralized: The Architecture Decision

The most fundamental fork in distributed system design is whether to have a distinguished leader or treat all nodes as equal peers. Five of the seven systems in this course chose leader-based architectures; two chose peer-to-peer. Neither choice is inherently superior -- each is a rational response to a different set of constraints. This page dissects the decision, system by system.

The two camps

Leader-based systems: why they chose centralization

GFS: the single master

What the master does: Holds the entire file system namespace and the mapping of chunks to chunkservers. All metadata operations (file creation, deletion, chunk allocation) go through it. Data operations flow directly between clients and chunkservers.

Why Google chose this: Simplicity. A single master eliminates the need for distributed consensus on metadata operations. File creation, chunk allocation, and garbage collection are trivially correct when one node makes all decisions. Google's engineering culture prizes simplicity over theoretical scalability -- they would rather build a system that is easy to reason about and fix operationally.

What they gain: Simple, correct metadata management. Global knowledge enables intelligent chunk placement (rack-aware replication). Straightforward garbage collection and re-replication.

What they risk: The master is a single point of failure and a throughput bottleneck. Google mitigated the bottleneck by keeping metadata small (in-memory), using large chunk sizes (64MB) to reduce metadata volume, and ensuring data never flows through the master. They accepted the availability risk of single-master failure by relying on fast manual recovery.

See: GFS single master and large chunk size, GFS master operations

HDFS: the NameNode (learning from GFS's mistakes)

What the NameNode does: Functionally identical to GFS's master -- all metadata, all namespace operations, block-to-DataNode mapping.

Why Hadoop chose this: HDFS was explicitly modeled on GFS. The single-master architecture was proven to work at Google's scale. But the open-source community learned from GFS's single-point-of-failure weakness and invested heavily in solving it.

What they improved over GFS:

  • High Availability: Standby NameNode with shared edit log (Quorum Journal Manager) enables automatic failover in seconds rather than the 30+ minutes of cold recovery
  • Federation: Multiple independent NameNodes, each owning a portion of the namespace, break through the single-node memory limit
  • Fencing: STONITH and ZooKeeper-based failover controller prevent split-brain during NameNode transitions

See: HDFS High Availability, HDFS deep dive

BigTable: master plus Chubby

What the master does: Assigns tablets to tablet servers, detects tablet server additions and removals, balances load, and handles schema changes. The master does not serve data reads or writes -- clients talk directly to tablet servers.

Think first
BigTable's master assigns tablets to tablet servers, but the master is not on the critical read/write path. What does this imply about the system's tolerance for master downtime compared to GFS, where the master IS on the metadata-read path?

Why Google chose this: BigTable delegates the hardest parts of coordination (lock management, leader election, consistent configuration storage) to Chubby rather than building them into the master itself. This separation of concerns keeps BigTable's master relatively simple -- it is a coordinator, not a consensus participant.

See: BigTable partitioning and architecture, BigTable components

Kafka: controller broker plus per-partition leaders

What the controller does: One broker is elected as the controller (via ZooKeeper). The controller manages partition-to-broker assignment, detects broker failures, and orchestrates leader election for partitions.

What partition leaders do: Each partition has a leader broker that handles all produce and consume requests for that partition. Followers replicate from the leader.

Why LinkedIn chose this: Kafka's architecture is a hybrid. The controller is a single leader for cluster-wide coordination, but the actual data plane has thousands of independent leaders (one per partition). This is the key insight: by partitioning the leadership responsibility, Kafka avoids the throughput bottleneck of a single master while retaining the simplicity of single-leader-per-partition consistency.

What they gain: Per-partition ordering guarantees with minimal coordination. Partition leaders operate independently -- no cross-partition consensus needed.

What they risk: Controller failure temporarily blocks partition reassignment (but existing partitions continue operating). ZooKeeper dependency adds operational complexity.

See: Kafka controller broker, Kafka high-level architecture

Chubby: Paxos-elected master

What the master does: Serves all client requests (reads and writes). The other four replicas in a typical Chubby cell participate in Paxos consensus to replicate the master's state but do not serve client requests directly.

Why Google chose this: Chubby provides distributed locking and consistent configuration storage. These operations require linearizability -- the strongest consistency guarantee. Routing all operations through a single master that replicates via Paxos is the most straightforward way to achieve this.

What they gain: Linearizable reads and writes. Simple client programming model (talk to the master, get correct answers).

What they risk: The master is a throughput bottleneck. Chubby mitigates this with aggressive client-side caching and by being designed for low-volume, high-importance operations (lock acquisition, leader election, small config files) rather than high-throughput data.

See: Chubby design rationale, Scaling Chubby

Peer-to-peer systems: why they chose decentralization

Dynamo: all nodes are equal

Why Amazon chose this: Dynamo powers shopping carts during peak traffic. Amazon needed a system with no single point of failure -- not even a momentary one during failover. If a master must be elected, there is a window where the system cannot accept writes. For Amazon, that window is lost revenue.

How it works without a leader:

  • Data placement: Consistent hashing determines which nodes own which data. No coordinator assigns ranges.
  • Request routing: Any node can handle any request. The client (or a load balancer) picks a coordinator node, which forwards to the appropriate replicas.
  • Failure detection: Gossip protocol spreads membership and failure information. No central authority declares nodes dead.
  • Conflict resolution: Vector clocks track causality. The application resolves conflicts.

What they gain: True zero-downtime operation. Any node can serve any request. No failover window. Writes always succeed (to at least one node).

What they risk: Consistency is sacrificed. Conflicting writes require application-level resolution. Gossip-based failure detection is slower than centralized heartbeats. Consistent hashing can create load imbalances that are harder to fix without a central coordinator.

See: Dynamo high-level architecture, Dynamo's put/get operations

Cassandra: Dynamo's architecture with BigTable's data model

Why Facebook chose this: Cassandra was originally built for Facebook's inbox search. It needed Dynamo's availability guarantees (always-on for billions of users) combined with a richer data model than simple key-value (column families, range queries). Facebook took Dynamo's decentralized architecture and married it to BigTable's column-family data model.

Key differences from Dynamo:

  • Tunable consistency: Unlike Dynamo's strict eventual consistency, Cassandra lets clients choose consistency level per query using quorum parameters
  • Simpler conflict resolution: Last-write-wins with timestamps rather than vector clocks and application-level merging
  • Better failure detection: Phi Accrual failure detector adapts to network conditions, reducing false positives in multi-datacenter deployments

See: Cassandra high-level architecture, Cassandra gossiper

The master bottleneck problem and solutions

Every leader-based system faces the same fundamental tension: the leader provides simplicity and correctness, but it is a bottleneck and a single point of failure.

Think first
GFS, HDFS, BigTable, Kafka, and Chubby all have a 'master' or 'leader' concept. But the master's role is different in each system. In which system is the master MOST critical to ongoing operations (i.e., its failure has the most immediate impact)?

Solutions the systems employ

SolutionSystemHow it works
Shadow/Standby masterHDFSStandby NameNode continuously replays edit log; can take over in seconds
Master off the data pathGFS, BigTableMaster handles metadata only; data flows directly between clients and workers
Partitioned leadershipKafkaOne leader per partition; thousands of independent leaders across the cluster
FederationHDFSMultiple NameNodes, each owning a namespace slice; breaks memory bottleneck
Client-side cachingChubby, BigTableClients cache metadata aggressively, reducing master load
Delegation to coordination serviceBigTable, KafkaHard coordination problems delegated to Chubby/ZooKeeper

The architecture of master resilience

The consensus problem: how each camp solves it

Both architectures must solve a fundamental problem: how do nodes agree on the state of the system?

Leader-based approach to consensus

Leader-based systems solve consensus by avoiding distributed consensus for most operations. When one node makes all decisions, there is no disagreement to resolve. The consensus problem shrinks to a single point: who is the leader?

  • GFS: No automatic consensus. The master is a known, fixed node. Recovery is manual.
  • HDFS: ZooKeeper Failover Controller runs leader election via ZooKeeper (which uses ZAB, a Paxos-like protocol internally).
  • BigTable: Chubby (running Paxos) handles tablet server membership and master election.
  • Kafka: ZooKeeper handles controller election. Partition leaders are assigned by the controller, not elected.
  • Chubby: Paxos consensus among five replicas elects the master and replicates all state.

Peer-to-peer approach to consensus

Peer-to-peer systems cannot avoid distributed consensus entirely -- they must agree on data placement, membership, and (for reads) which version of data is correct.

  • Dynamo: No formal consensus. Consistent hashing determines placement. Vector clocks track causality. Conflicts are pushed to the application.
  • Cassandra: Same as Dynamo for placement and membership. Quorum reads/writes provide "consensus" at the operation level -- if R+W > N, the read and write quorums overlap, ensuring at least one node in the read set has the latest write.

The key insight: peer-to-peer systems replace global consensus with per-operation quorum agreement. This is weaker than full consensus but sufficient for many workloads, and it avoids the throughput bottleneck of a single leader.

Comparison table

DimensionLeader-Based (GFS, HDFS, BigTable, Kafka, Chubby)Peer-to-Peer (Dynamo, Cassandra)
ConsistencyStrong (naturally)Eventual or tunable
AvailabilityLimited by leader healthHigh (any node serves requests)
Throughput ceilingOne leader's capacity (mitigated by partitioning)Scales linearly with nodes
Failure blast radiusLeader failure affects all clientsSingle node failure affects only its data range
Complexity locationIn failover and fencing logicIn conflict resolution and anti-entropy
Operational burdenMonitor and protect the leaderMonitor and tune consistency parameters
Data placementCentrally decided (optimal)Hash-determined (potentially unbalanced)
Adding/removing nodesMaster coordinates rebalancingConsistent hashing absorbs changes
Split-brain riskYes -- requires fencingNo leader to split
Coordination service neededOften (ZooKeeper, Chubby)No

Decision framework

Interview framework

When an interviewer asks you to choose between leader-based and peer-to-peer architecture, use this framework:

Use leader-based when:

  • Strong consistency is required (financial data, metadata, coordination)
  • The workload can be partitioned into independent units (tablets, partitions) to distribute leadership
  • You can tolerate brief unavailability during failover
  • Operations need global ordering or coordination
  • You want simpler application-level logic (no conflict resolution)

Use peer-to-peer when:

  • Availability must be absolute (zero-downtime requirement)
  • The workload is simple key-value access without cross-key transactions
  • You are willing to handle conflicts at the application layer (or can use last-write-wins)
  • Multi-datacenter deployment is a first-class requirement
  • No single node should be more important than any other

The hybrid option (Kafka's approach):

  • Use a leader for coordination (controller) but partition the data plane into independent units with their own leaders
  • This gives you strong per-partition consistency with horizontally scalable throughput
  • This is the most common pattern in modern distributed systems

The evolution of the debate

It is worth noting that the industry has largely converged on a hybrid approach since these systems were designed:

  1. Pure single-master (GFS, 2003) proved too fragile. HDFS added HA and federation.
  2. Pure peer-to-peer (Dynamo, 2007) proved too complex for application developers. Cassandra added tunable consistency to reduce the burden.
  3. Modern systems (Kafka, CockroachDB, Spanner) partition leadership so that no single node bottlenecks the system, while each partition gets the simplicity of a single leader.

The lesson: the question is not "centralized or decentralized?" but "at what granularity should leadership operate?" A system with one leader for the whole cluster is fragile. A system with one leader per partition scales well. A system with no leader at all pushes complexity to the application.

Quiz

Quiz
BigTable's master is responsible for assigning tablets to tablet servers, but it is NOT on the read/write path. If the master crashes, what happens to ongoing BigTable read and write operations?
Quiz
Dynamo uses consistent hashing to determine data placement without a central coordinator. What fundamental problem does this create that a leader-based system like HDFS does not face?