Anatomy of a Write Operation

Reads are simple -- pick a replica and fetch bytes. Writes are where distributed file systems get hard. GFS must ensure that a mutation reaches all replicas in the same order, even when multiple clients write concurrently. The mechanism that makes this work is the chunk lease.

Think first

If three replicas must all receive the same mutation, how do you ensure they apply mutations in the same order, and who decides the order?

Chunk leases

When a mutation (write, append, or delete) targets a chunk, the master grants a 60-second lease to one of the replicas. That replica becomes the primary for the chunk and is responsible for:

Assigning a serial order to all pending mutations
Ensuring every replica applies mutations in that same order

Only one lease exists per chunk at any time. If two write requests arrive, both see the same primary. The primary can request lease extensions if mutations take longer than 60 seconds.

When the master grants a lease, it increments the chunk version number and notifies all replicas of the new version. This is how GFS detects stale replicas later.

Interview angle

The lease mechanism is GFS's alternative to distributed locking. It provides bounded-time mutual exclusion without the complexity of a full lock manager. If the primary crashes, the master simply waits for the lease to expire (at most 60 seconds) and grants a new one. This pattern also appears in BigTable tablet assignments and HDFS block leases.

The two-phase write protocol

GFS splits data writing into two distinct phases that decouple data flow from control flow:

Phase 1: Data pushing (data flow)

The client gets a list of replicas (primary + secondaries) from the master
The client sends data to the closest replica
That replica forwards data in a chain to the next closest replica, and so on
Each replica stores the data in an internal buffer cache -- nothing is written to disk yet
All replicas acknowledge receipt

Phase 2: Write ordering (control flow)

The client sends a write request to the primary
The primary assigns consecutive serial numbers to all pending mutations
The primary applies mutations to its own copy in serial-number order
The primary forwards the write requests (in the same order) to all secondaries
Secondaries apply mutations in the same serial-number order and acknowledge
The primary replies to the client with success or error

Step-by-step breakdown

Step	Flow	Description
1	Client -> Master	Ask which ChunkServer holds the lease and where replicas are
2	Master -> Client	Reply with primary identity and secondary locations
3	Client -> Replicas	Push data to closest replica; data chains to all others
4	Client -> Primary	Send write request (after all replicas acknowledge data receipt)
5	Primary -> Secondaries	Forward mutation in serial-number order
6	Secondaries -> Primary	Acknowledge completion
7	Primary -> Client	Reply with success or error

Think first

Why is it beneficial to push data to replicas first and only then issue the write command, rather than combining them?

Why separate data flow from control flow?

Concern	Handled by	Optimization
Bandwidth	Data flow (client -> chain of ChunkServers)	Chain topology maximizes each machine's outbound bandwidth
Ordering	Control flow (client -> primary -> secondaries)	Primary serializes all mutations; replicas apply in identical order
Consistency	Version numbers	Stale replicas detected by version mismatch on ChunkServer restart

warning

If a write fails on some secondaries but succeeds on the primary, GFS does not roll back the primary. The affected chunk region enters an inconsistent state -- different replicas hold different data. The client retries the operation, but GFS does not guarantee that all replicas converge. Applications must handle this (e.g., by checksumming records and deduplicating on read). This is a direct consequence of GFS's relaxed consistency model.

Quiz

What would happen if GFS sent data directly from the client to ALL replicas simultaneously (fan-out) instead of using a chain topology?

Writes would be faster because all replicas receive data in parallel

The client's outbound network bandwidth would become the bottleneck, since it must send the same data N times instead of once

The master would need to coordinate data flow across all replicas

Replicas would receive data in a more consistent order

Chunk leases​

The two-phase write protocol​

Phase 1: Data pushing (data flow)​

Phase 2: Write ordering (control flow)​

Step-by-step breakdown​

Why separate data flow from control flow?​