Anatomy of an Append Operation
What happens when hundreds of MapReduce workers need to write results to the same output file concurrently? Random writes would require external locking and serialization. GFS solves this with record append -- a first-class operation that lets multiple clients append to the same file atomically, without coordination among themselves.
Record append vs. random write
| Property | Random write | Record append |
|---|---|---|
| Offset | Client specifies | GFS chooses |
| Concurrency | Race conditions; region may contain mixed fragments | Atomic; each record written as a contiguous byte sequence |
| Guarantee | Undefined under concurrent writes | At-least-once atomicity |
| Analogy | pwrite() at a specific offset | open() with O_APPEND -- but safe under concurrency |
Record append is the operation GFS is most heavily optimized for. Google's dominant workloads -- producer-consumer queues, multi-way merge results, log collection -- all map naturally to concurrent appends.
How record append works
The flow mirrors the write operation with one critical difference: the primary decides where the data goes.
- The client pushes data to all replicas of the last chunk of the file (same chain topology as writes)
- The client sends the append request to the primary
- The primary checks: will this record fit in the current chunk (64MB limit)?
| Scenario | Primary's action |
|---|---|
| Record fits | Append data at the current offset, tell secondaries to write at the exact same offset, reply success to client |
| Record won't fit | Pad the current chunk to 64MB, instruct secondaries to do the same, tell the client to retry on the next chunk |
- On success, the primary returns the offset where the data was written
At-least-once semantics
If an append fails on any replica, the client retries the entire operation. This means:
- Some replicas may contain duplicate records (the append succeeded on some replicas before failing on others, then the retry writes it again)
- Replicas of the same chunk may not be byte-wise identical
- GFS guarantees only that the data is written at-least-once as an atomic unit
At-least-once means applications must handle duplicates. Google's approach: embed a unique record ID and checksum in each record. Readers skip duplicates and verify integrity. This pushes deduplication logic to the application layer, which has the context to do it correctly.
"Why at-least-once instead of exactly-once?" is a classic distributed systems question. Exactly-once requires either two-phase commit (expensive, blocks on failure) or idempotent operations with deduplication at the storage layer (complex). GFS chose the simplest approach: guarantee atomicity, accept duplicates, let applications deduplicate. This is the same philosophy behind Kafka's at-least-once delivery and many message queue systems. Always discuss the tradeoff between system complexity and application complexity.