Deep Dive

Where you place replicas in a data center determines whether a single rack failure takes your data offline or goes unnoticed. HDFS's rack-aware replication, synchronization semantics, and consistency model all flow from this insight.

Cluster topology

A typical Hadoop data center has 30--40 servers per rack. Each rack has a dedicated gigabit switch connecting its servers, plus an uplink to a core switch shared across racks.

HDFS maps every server to a specific rack and measures network distance in hops using a tree-style topology. The distance between two servers equals the sum of their distances to their closest common ancestor:

Path	Hops
Same node (Node 1 to Node 1)	0
Same rack (Node 1 to Node 2)	2
Different rack (Node 3 to Node 4)	4

This topology awareness drives replica placement decisions, read-path optimizations (prefer the closest replica), and pipeline construction during writes.

Think first

When placing three replicas of a block, you face a trade-off between failure isolation and write performance. Placing all replicas on the same rack is fast but fragile. Spreading them across three different racks is safe but slow. Can you think of a middle ground that balances both concerns?

Rack-aware replication

HDFS's replica placement policy directly optimizes for surviving rack-level failures while keeping write latency reasonable. For a replication factor of three:

Replica	Placement rule
1st	Same node as the writing client (or a random node if the client is external)
2nd	A node on a different rack (off-rack replica)
3rd	Another random node on the same rack as the 2nd replica
Additional	Random nodes, but the system avoids concentrating too many replicas on one rack

Two invariants govern placement:

No DataNode holds more than one replica of the same block.
If enough racks exist, no rack holds more than two replicas of the same block.

Interview angle

Rack-aware replication is a classic trade-off question. Writing across racks costs more network bandwidth (cross-rack links are shared and slower), but it guarantees availability even when an entire rack loses power or network connectivity. When an interviewer asks "why not put all replicas on the same rack for faster writes?", explain that a rack failure would make the block completely unavailable -- violating the whole point of replication.

This scheme closely mirrors GFS's replication strategy, where the Master also spreads chunk replicas across racks.

Think first

HDFS's consistency model is much simpler than GFS's. If you remove concurrent writers and random writes from the equation, how much simpler does the consistency problem become? What failure scenarios no longer need to be handled?

Synchronization semantics

HDFS follows a write-once, read-many access pattern:

Version	Write behavior
Early HDFS	Strict immutable semantics -- once written, a file could never be reopened for writes (only deleted)
Current HDFS	Supports append, but existing data cannot be modified in place

This design choice aligns with MapReduce workloads: reducers write independent output files to HDFS, and those files are read many times by downstream jobs. Dropping random writes and concurrent writers dramatically simplified HDFS's design compared to GFS, which supported both.

warning

HDFS does not support multiple concurrent writers to the same file. If your use case requires concurrent writes (e.g., a logging pipeline with many producers), you need a different system or an intermediary like Kafka.

HDFS consistency model

HDFS enforces strong consistency: a write is declared successful only after all replicas have been written. Every subsequent read sees the same data regardless of which replica it hits.

Strong consistency is straightforward in HDFS because it forbids concurrent writers -- there is no need for complex conflict resolution or versioning. This is a deliberate simplification over GFS, which had a more nuanced (and harder to reason about) consistency model due to its support for concurrent appends.

Interview angle

If asked how HDFS achieves strong consistency, the answer is simple: single-writer semantics plus all-replica-acknowledged writes. No concurrent writers means no write conflicts, no need for vector clocks, and no "defined but inconsistent" regions like GFS had.

Quiz

HDFS places the 2nd and 3rd replicas on the same rack (different from the writer's rack). What would happen if instead it placed each replica on a separate rack (3 different racks total)?

Fault tolerance would improve significantly because data survives any two rack failures

Write latency would increase because the pipeline must cross two rack boundaries instead of one, and intra-rack bandwidth for the 2nd-to-3rd hop would be lost

The NameNode would need more memory to track cross-rack placements

Read performance would always be worse because data is more spread out

Cluster topology​

Rack-aware replication​

Synchronization semantics​

HDFS consistency model​

Cluster topology

Rack-aware replication

Synchronization semantics

HDFS consistency model