Anatomy of an Append Operation

What happens when hundreds of MapReduce workers need to write results to the same output file concurrently? Random writes would require external locking and serialization. GFS solves this with record append -- a first-class operation that lets multiple clients append to the same file atomically, without coordination among themselves.

Think first

Why is a regular write (where the client specifies the offset) problematic when hundreds of clients write to the same file concurrently?

Record append vs. random write

Property	Random write	Record append
Offset	Client specifies	GFS chooses
Concurrency	Race conditions; region may contain mixed fragments	Atomic; each record written as a contiguous byte sequence
Guarantee	Undefined under concurrent writes	At-least-once atomicity
Analogy	`pwrite()` at a specific offset	`open()` with O_APPEND -- but safe under concurrency

Record append is the operation GFS is most heavily optimized for. Google's dominant workloads -- producer-consumer queues, multi-way merge results, log collection -- all map naturally to concurrent appends.

How record append works

The flow mirrors the write operation with one critical difference: the primary decides where the data goes.

The client pushes data to all replicas of the last chunk of the file (same chain topology as writes)
The client sends the append request to the primary
The primary checks: will this record fit in the current chunk (64MB limit)?

Scenario	Primary's action
Record fits	Append data at the current offset, tell secondaries to write at the exact same offset, reply success to client
Record won't fit	Pad the current chunk to 64MB, instruct secondaries to do the same, tell the client to retry on the next chunk

On success, the primary returns the offset where the data was written

At-least-once semantics

If an append fails on any replica, the client retries the entire operation. This means:

Some replicas may contain duplicate records (the append succeeded on some replicas before failing on others, then the retry writes it again)
Replicas of the same chunk may not be byte-wise identical
GFS guarantees only that the data is written at-least-once as an atomic unit

warning

At-least-once means applications must handle duplicates. Google's approach: embed a unique record ID and checksum in each record. Readers skip duplicates and verify integrity. This pushes deduplication logic to the application layer, which has the context to do it correctly.

Interview angle

"Why at-least-once instead of exactly-once?" is a classic distributed systems question. Exactly-once requires either two-phase commit (expensive, blocks on failure) or idempotent operations with deduplication at the storage layer (complex). GFS chose the simplest approach: guarantee atomicity, accept duplicates, let applications deduplicate. This is the same philosophy behind Kafka's at-least-once delivery and many message queue systems. Always discuss the tradeoff between system complexity and application complexity.

Quiz

What would happen if GFS tried to provide exactly-once append semantics instead of at-least-once?

Performance would improve because there are no duplicate records to filter out

The system would need two-phase commit or a deduplication layer at the storage level, significantly increasing complexity and reducing throughput under failures

Applications would be simpler because they wouldn't need to handle duplicates

Network bandwidth usage would decrease

Record append vs. random write​

How record append works​

At-least-once semantics​

Record append vs. random write

How record append works

At-least-once semantics