Anatomy of a Write Operation

A write to HDFS must replicate every block to multiple DataNodes before acknowledging success. This replication pipeline is the heart of HDFS's durability guarantee -- and the reason writes are slower than reads.

Think first

HDFS must write every block to three DataNodes before acknowledging success. Should the client send the block to all three DataNodes simultaneously (fan-out), or should the DataNodes forward the block to each other in a chain (pipeline)? What are the bandwidth implications of each approach?

Write flow step by step

Step	What happens
1	Client calls `create()` on the `Distributed FileSystem` object
2	`Distributed FileSystem` sends a file creation request to the NameNode
3	NameNode verifies the file does not already exist and the client has write permission, then creates the file record and acknowledges
4	Client writes data through `FSData OutputStream`
5	`FSData OutputStream` buffers data into a local Data Queue until a full block (128MB) accumulates
6	The `DataStreamer` component is notified that a complete block is ready
7	`DataStreamer` asks the NameNode to allocate a new block and select target DataNodes (using the rack-aware placement policy)
8	NameNode returns the block location list (e.g., DN1, DN2, DN3)
9	`DataStreamer` begins transferring the block to the nearest DataNode
10	The first DataNode pipelines the block to the second, which pipelines to the third -- replicas are written during the file write itself
11	`DataStreamer` waits for acknowledgments from all DataNodes in the pipeline
12	After all blocks are written and acknowledged, the client calls `close()`
13	`Distributed FileSystem` notifies the NameNode that the write is complete; the NameNode commits the file, making it visible to readers

warning

If the NameNode dies before step 13 (the final commit), the file is lost. The blocks exist on DataNodes, but without the NameNode's metadata record, there is no way to reconstruct the file. This is why NameNode resilience (EditLog, FsImage, HA) is critical.

Key design choices

Choice	Rationale
Pipeline replication	DataNodes forward blocks to each other in a chain rather than the client sending to all replicas. This distributes network load and uses the client's bandwidth efficiently. GFS uses the same pipelining approach.
All-replica acknowledgment	HDFS does not acknowledge a write until every replica is written. This provides strong consistency but increases write latency.
Block-level buffering	Accumulating a full 128MB block before sending reduces the number of NameNode interactions (one allocation per block, not per write call).
Single writer	Only one client can write to a file at a time. HDFS uses a lease mechanism to enforce this -- the writing client holds a lease that expires if not renewed.

Interview angle

The write pipeline is a frequent interview topic. Walk through it in order: client buffers a block, DataStreamer gets target nodes from the NameNode, data pipelines through the chain, all replicas ACK, then the next block starts. The critical detail is that HDFS does not ACK the client until all replicas confirm -- this is how it guarantees strong consistency without concurrent writers.

What about the EditLog?

Every metadata change (file creation, block allocation, file close) gets written to the EditLog before the NameNode applies it. This write-ahead log ensures that metadata survives NameNode crashes. The data blocks themselves are safe on the DataNodes; the EditLog protects the mapping between files and blocks.

Quiz

What would happen if HDFS acknowledged a write to the client after only one replica is written (instead of waiting for all three)?

Write throughput would triple because the pipeline is eliminated

Writes would be faster but the system would lose strong consistency -- a reader might see the data on one replica but not on others, and a single DataNode failure before replication completes could cause data loss

The NameNode would need to track partially-replicated blocks, but otherwise nothing changes

Clients would need to verify data integrity themselves

Write flow step by step​

Key design choices​

What about the EditLog?​

Write flow step by step

Key design choices

What about the EditLog?