Skip to main content

Anatomy of an Append Operation

What happens when hundreds of MapReduce workers need to write results to the same output file concurrently? Random writes would require external locking and serialization. GFS solves this with record append -- a first-class operation that lets multiple clients append to the same file atomically, without coordination among themselves.

Think first
Why is a regular write (where the client specifies the offset) problematic when hundreds of clients write to the same file concurrently?

Record append vs. random write

PropertyRandom writeRecord append
OffsetClient specifiesGFS chooses
ConcurrencyRace conditions; region may contain mixed fragmentsAtomic; each record written as a contiguous byte sequence
GuaranteeUndefined under concurrent writesAt-least-once atomicity
Analogypwrite() at a specific offsetopen() with O_APPEND -- but safe under concurrency

Record append is the operation GFS is most heavily optimized for. Google's dominant workloads -- producer-consumer queues, multi-way merge results, log collection -- all map naturally to concurrent appends.

How record append works

The flow mirrors the write operation with one critical difference: the primary decides where the data goes.

  1. The client pushes data to all replicas of the last chunk of the file (same chain topology as writes)
  2. The client sends the append request to the primary
  3. The primary checks: will this record fit in the current chunk (64MB limit)?
ScenarioPrimary's action
Record fitsAppend data at the current offset, tell secondaries to write at the exact same offset, reply success to client
Record won't fitPad the current chunk to 64MB, instruct secondaries to do the same, tell the client to retry on the next chunk
  1. On success, the primary returns the offset where the data was written

At-least-once semantics

If an append fails on any replica, the client retries the entire operation. This means:

  • Some replicas may contain duplicate records (the append succeeded on some replicas before failing on others, then the retry writes it again)
  • Replicas of the same chunk may not be byte-wise identical
  • GFS guarantees only that the data is written at-least-once as an atomic unit
warning

At-least-once means applications must handle duplicates. Google's approach: embed a unique record ID and checksum in each record. Readers skip duplicates and verify integrity. This pushes deduplication logic to the application layer, which has the context to do it correctly.

Interview angle

"Why at-least-once instead of exactly-once?" is a classic distributed systems question. Exactly-once requires either two-phase commit (expensive, blocks on failure) or idempotent operations with deduplication at the storage layer (complex). GFS chose the simplest approach: guarantee atomicity, accept duplicates, let applications deduplicate. This is the same philosophy behind Kafka's at-least-once delivery and many message queue systems. Always discuss the tradeoff between system complexity and application complexity.

Quiz
What would happen if GFS tried to provide exactly-once append semantics instead of at-least-once?