Deep Dive
Where you place replicas in a data center determines whether a single rack failure takes your data offline or goes unnoticed. HDFS's rack-aware replication, synchronization semantics, and consistency model all flow from this insight.
Cluster topology
A typical Hadoop data center has 30--40 servers per rack. Each rack has a dedicated gigabit switch connecting its servers, plus an uplink to a core switch shared across racks.
HDFS maps every server to a specific rack and measures network distance in hops using a tree-style topology. The distance between two servers equals the sum of their distances to their closest common ancestor:
| Path | Hops |
|---|---|
| Same node (Node 1 to Node 1) | 0 |
| Same rack (Node 1 to Node 2) | 2 |
| Different rack (Node 3 to Node 4) | 4 |
This topology awareness drives replica placement decisions, read-path optimizations (prefer the closest replica), and pipeline construction during writes.
Rack-aware replication
HDFS's replica placement policy directly optimizes for surviving rack-level failures while keeping write latency reasonable. For a replication factor of three:
| Replica | Placement rule |
|---|---|
| 1st | Same node as the writing client (or a random node if the client is external) |
| 2nd | A node on a different rack (off-rack replica) |
| 3rd | Another random node on the same rack as the 2nd replica |
| Additional | Random nodes, but the system avoids concentrating too many replicas on one rack |
Two invariants govern placement:
- No DataNode holds more than one replica of the same block.
- If enough racks exist, no rack holds more than two replicas of the same block.
Rack-aware replication is a classic trade-off question. Writing across racks costs more network bandwidth (cross-rack links are shared and slower), but it guarantees availability even when an entire rack loses power or network connectivity. When an interviewer asks "why not put all replicas on the same rack for faster writes?", explain that a rack failure would make the block completely unavailable -- violating the whole point of replication.
This scheme closely mirrors GFS's replication strategy, where the Master also spreads chunk replicas across racks.
Synchronization semantics
HDFS follows a write-once, read-many access pattern:
| Version | Write behavior |
|---|---|
| Early HDFS | Strict immutable semantics -- once written, a file could never be reopened for writes (only deleted) |
| Current HDFS | Supports append, but existing data cannot be modified in place |
This design choice aligns with MapReduce workloads: reducers write independent output files to HDFS, and those files are read many times by downstream jobs. Dropping random writes and concurrent writers dramatically simplified HDFS's design compared to GFS, which supported both.
HDFS does not support multiple concurrent writers to the same file. If your use case requires concurrent writes (e.g., a logging pipeline with many producers), you need a different system or an intermediary like Kafka.
HDFS consistency model
HDFS enforces strong consistency: a write is declared successful only after all replicas have been written. Every subsequent read sees the same data regardless of which replica it hits.
Strong consistency is straightforward in HDFS because it forbids concurrent writers -- there is no need for complex conflict resolution or versioning. This is a deliberate simplification over GFS, which had a more nuanced (and harder to reason about) consistency model due to its support for concurrent appends.
If asked how HDFS achieves strong consistency, the answer is simple: single-writer semantics plus all-replica-acknowledged writes. No concurrent writers means no write conflicts, no need for vector clocks, and no "defined but inconsistent" regions like GFS had.