Anatomy of a Write Operation
A write to HDFS must replicate every block to multiple DataNodes before acknowledging success. This replication pipeline is the heart of HDFS's durability guarantee -- and the reason writes are slower than reads.
Write flow step by step
| Step | What happens |
|---|---|
| 1 | Client calls create() on the Distributed FileSystem object |
| 2 | Distributed FileSystem sends a file creation request to the NameNode |
| 3 | NameNode verifies the file does not already exist and the client has write permission, then creates the file record and acknowledges |
| 4 | Client writes data through FSData OutputStream |
| 5 | FSData OutputStream buffers data into a local Data Queue until a full block (128MB) accumulates |
| 6 | The DataStreamer component is notified that a complete block is ready |
| 7 | DataStreamer asks the NameNode to allocate a new block and select target DataNodes (using the rack-aware placement policy) |
| 8 | NameNode returns the block location list (e.g., DN1, DN2, DN3) |
| 9 | DataStreamer begins transferring the block to the nearest DataNode |
| 10 | The first DataNode pipelines the block to the second, which pipelines to the third -- replicas are written during the file write itself |
| 11 | DataStreamer waits for acknowledgments from all DataNodes in the pipeline |
| 12 | After all blocks are written and acknowledged, the client calls close() |
| 13 | Distributed FileSystem notifies the NameNode that the write is complete; the NameNode commits the file, making it visible to readers |
If the NameNode dies before step 13 (the final commit), the file is lost. The blocks exist on DataNodes, but without the NameNode's metadata record, there is no way to reconstruct the file. This is why NameNode resilience (EditLog, FsImage, HA) is critical.
Key design choices
| Choice | Rationale |
|---|---|
| Pipeline replication | DataNodes forward blocks to each other in a chain rather than the client sending to all replicas. This distributes network load and uses the client's bandwidth efficiently. GFS uses the same pipelining approach. |
| All-replica acknowledgment | HDFS does not acknowledge a write until every replica is written. This provides strong consistency but increases write latency. |
| Block-level buffering | Accumulating a full 128MB block before sending reduces the number of NameNode interactions (one allocation per block, not per write call). |
| Single writer | Only one client can write to a file at a time. HDFS uses a lease mechanism to enforce this -- the writing client holds a lease that expires if not renewed. |
The write pipeline is a frequent interview topic. Walk through it in order: client buffers a block, DataStreamer gets target nodes from the NameNode, data pipelines through the chain, all replicas ACK, then the next block starts. The critical detail is that HDFS does not ACK the client until all replicas confirm -- this is how it guarantees strong consistency without concurrent writers.
What about the EditLog?
Every metadata change (file creation, block allocation, file close) gets written to the EditLog before the NameNode applies it. This write-ahead log ensures that metadata survives NameNode crashes. The data blocks themselves are safe on the DataNodes; the EditLog protects the mapping between files and blocks.