Anatomy of a Write Operation
Reads are simple -- pick a replica and fetch bytes. Writes are where distributed file systems get hard. GFS must ensure that a mutation reaches all replicas in the same order, even when multiple clients write concurrently. The mechanism that makes this work is the chunk lease.
Chunk leases
When a mutation (write, append, or delete) targets a chunk, the master grants a 60-second lease to one of the replicas. That replica becomes the primary for the chunk and is responsible for:
- Assigning a serial order to all pending mutations
- Ensuring every replica applies mutations in that same order
Only one lease exists per chunk at any time. If two write requests arrive, both see the same primary. The primary can request lease extensions if mutations take longer than 60 seconds.
When the master grants a lease, it increments the chunk version number and notifies all replicas of the new version. This is how GFS detects stale replicas later.
The lease mechanism is GFS's alternative to distributed locking. It provides bounded-time mutual exclusion without the complexity of a full lock manager. If the primary crashes, the master simply waits for the lease to expire (at most 60 seconds) and grants a new one. This pattern also appears in BigTable tablet assignments and HDFS block leases.
The two-phase write protocol
GFS splits data writing into two distinct phases that decouple data flow from control flow:
Phase 1: Data pushing (data flow)
- The client gets a list of replicas (primary + secondaries) from the master
- The client sends data to the closest replica
- That replica forwards data in a chain to the next closest replica, and so on
- Each replica stores the data in an internal buffer cache -- nothing is written to disk yet
- All replicas acknowledge receipt
Phase 2: Write ordering (control flow)
- The client sends a write request to the primary
- The primary assigns consecutive serial numbers to all pending mutations
- The primary applies mutations to its own copy in serial-number order
- The primary forwards the write requests (in the same order) to all secondaries
- Secondaries apply mutations in the same serial-number order and acknowledge
- The primary replies to the client with success or error
Step-by-step breakdown
| Step | Flow | Description |
|---|---|---|
| 1 | Client -> Master | Ask which ChunkServer holds the lease and where replicas are |
| 2 | Master -> Client | Reply with primary identity and secondary locations |
| 3 | Client -> Replicas | Push data to closest replica; data chains to all others |
| 4 | Client -> Primary | Send write request (after all replicas acknowledge data receipt) |
| 5 | Primary -> Secondaries | Forward mutation in serial-number order |
| 6 | Secondaries -> Primary | Acknowledge completion |
| 7 | Primary -> Client | Reply with success or error |
Why separate data flow from control flow?
| Concern | Handled by | Optimization |
|---|---|---|
| Bandwidth | Data flow (client -> chain of ChunkServers) | Chain topology maximizes each machine's outbound bandwidth |
| Ordering | Control flow (client -> primary -> secondaries) | Primary serializes all mutations; replicas apply in identical order |
| Consistency | Version numbers | Stale replicas detected by version mismatch on ChunkServer restart |
If a write fails on some secondaries but succeeds on the primary, GFS does not roll back the primary. The affected chunk region enters an inconsistent state -- different replicas hold different data. The client retries the operation, but GFS does not guarantee that all replicas converge. Applications must handle this (e.g., by checksumming records and deduplicating on read). This is a direct consequence of GFS's relaxed consistency model.