Skip to main content

Anatomy of a Write Operation

Reads are simple -- pick a replica and fetch bytes. Writes are where distributed file systems get hard. GFS must ensure that a mutation reaches all replicas in the same order, even when multiple clients write concurrently. The mechanism that makes this work is the chunk lease.

Think first
If three replicas must all receive the same mutation, how do you ensure they apply mutations in the same order, and who decides the order?

Chunk leases

When a mutation (write, append, or delete) targets a chunk, the master grants a 60-second lease to one of the replicas. That replica becomes the primary for the chunk and is responsible for:

  • Assigning a serial order to all pending mutations
  • Ensuring every replica applies mutations in that same order

Only one lease exists per chunk at any time. If two write requests arrive, both see the same primary. The primary can request lease extensions if mutations take longer than 60 seconds.

When the master grants a lease, it increments the chunk version number and notifies all replicas of the new version. This is how GFS detects stale replicas later.

Interview angle

The lease mechanism is GFS's alternative to distributed locking. It provides bounded-time mutual exclusion without the complexity of a full lock manager. If the primary crashes, the master simply waits for the lease to expire (at most 60 seconds) and grants a new one. This pattern also appears in BigTable tablet assignments and HDFS block leases.

The two-phase write protocol

GFS splits data writing into two distinct phases that decouple data flow from control flow:

Phase 1: Data pushing (data flow)

  1. The client gets a list of replicas (primary + secondaries) from the master
  2. The client sends data to the closest replica
  3. That replica forwards data in a chain to the next closest replica, and so on
  4. Each replica stores the data in an internal buffer cache -- nothing is written to disk yet
  5. All replicas acknowledge receipt

Phase 2: Write ordering (control flow)

  1. The client sends a write request to the primary
  2. The primary assigns consecutive serial numbers to all pending mutations
  3. The primary applies mutations to its own copy in serial-number order
  4. The primary forwards the write requests (in the same order) to all secondaries
  5. Secondaries apply mutations in the same serial-number order and acknowledge
  6. The primary replies to the client with success or error

Step-by-step breakdown

StepFlowDescription
1Client -> MasterAsk which ChunkServer holds the lease and where replicas are
2Master -> ClientReply with primary identity and secondary locations
3Client -> ReplicasPush data to closest replica; data chains to all others
4Client -> PrimarySend write request (after all replicas acknowledge data receipt)
5Primary -> SecondariesForward mutation in serial-number order
6Secondaries -> PrimaryAcknowledge completion
7Primary -> ClientReply with success or error
Think first
Why is it beneficial to push data to replicas first and only then issue the write command, rather than combining them?

Why separate data flow from control flow?

ConcernHandled byOptimization
BandwidthData flow (client -> chain of ChunkServers)Chain topology maximizes each machine's outbound bandwidth
OrderingControl flow (client -> primary -> secondaries)Primary serializes all mutations; replicas apply in identical order
ConsistencyVersion numbersStale replicas detected by version mismatch on ChunkServer restart
warning

If a write fails on some secondaries but succeeds on the primary, GFS does not roll back the primary. The affected chunk region enters an inconsistent state -- different replicas hold different data. The client retries the operation, but GFS does not guarantee that all replicas converge. Applications must handle this (e.g., by checksumming records and deduplicating on read). This is a direct consequence of GFS's relaxed consistency model.

Quiz
What would happen if GFS sent data directly from the client to ALL replicas simultaneously (fan-out) instead of using a chain topology?