Skip to main content

Anatomy of a Write Operation

A write to HDFS must replicate every block to multiple DataNodes before acknowledging success. This replication pipeline is the heart of HDFS's durability guarantee -- and the reason writes are slower than reads.

Think first
HDFS must write every block to three DataNodes before acknowledging success. Should the client send the block to all three DataNodes simultaneously (fan-out), or should the DataNodes forward the block to each other in a chain (pipeline)? What are the bandwidth implications of each approach?

Write flow step by step

StepWhat happens
1Client calls create() on the Distributed FileSystem object
2Distributed FileSystem sends a file creation request to the NameNode
3NameNode verifies the file does not already exist and the client has write permission, then creates the file record and acknowledges
4Client writes data through FSData OutputStream
5FSData OutputStream buffers data into a local Data Queue until a full block (128MB) accumulates
6The DataStreamer component is notified that a complete block is ready
7DataStreamer asks the NameNode to allocate a new block and select target DataNodes (using the rack-aware placement policy)
8NameNode returns the block location list (e.g., DN1, DN2, DN3)
9DataStreamer begins transferring the block to the nearest DataNode
10The first DataNode pipelines the block to the second, which pipelines to the third -- replicas are written during the file write itself
11DataStreamer waits for acknowledgments from all DataNodes in the pipeline
12After all blocks are written and acknowledged, the client calls close()
13Distributed FileSystem notifies the NameNode that the write is complete; the NameNode commits the file, making it visible to readers
warning

If the NameNode dies before step 13 (the final commit), the file is lost. The blocks exist on DataNodes, but without the NameNode's metadata record, there is no way to reconstruct the file. This is why NameNode resilience (EditLog, FsImage, HA) is critical.

Key design choices

ChoiceRationale
Pipeline replicationDataNodes forward blocks to each other in a chain rather than the client sending to all replicas. This distributes network load and uses the client's bandwidth efficiently. GFS uses the same pipelining approach.
All-replica acknowledgmentHDFS does not acknowledge a write until every replica is written. This provides strong consistency but increases write latency.
Block-level bufferingAccumulating a full 128MB block before sending reduces the number of NameNode interactions (one allocation per block, not per write call).
Single writerOnly one client can write to a file at a time. HDFS uses a lease mechanism to enforce this -- the writing client holds a lease that expires if not renewed.
Interview angle

The write pipeline is a frequent interview topic. Walk through it in order: client buffers a block, DataStreamer gets target nodes from the NameNode, data pipelines through the chain, all replicas ACK, then the next block starts. The critical detail is that HDFS does not ACK the client until all replicas confirm -- this is how it guarantees strong consistency without concurrent writers.

What about the EditLog?

Every metadata change (file creation, block allocation, file close) gets written to the EditLog before the NameNode applies it. This write-ahead log ensures that metadata survives NameNode crashes. The data blocks themselves are safe on the DataNodes; the EditLog protects the mapping between files and blocks.

Quiz
What would happen if HDFS acknowledged a write to the client after only one replica is written (instead of waiting for all three)?