HDFS Characteristics
Every distributed system makes trade-offs. HDFS optimizes for throughput on large files at the cost of latency, small-file efficiency, and POSIX compliance. Understanding these trade-offs tells you when HDFS is the right tool -- and when it is not.
Security and permission
HDFS implements a POSIX-like permissions model:
| Permission | Files | Directories |
|---|---|---|
Read (r) | Required to read the file | Required to list directory contents |
Write (w) | Required to write or append | Required to create or delete children |
Execute (x) | Ignored (files cannot be executed on HDFS) | Required to access a child of the directory |
Each file and directory has an owner, a group, and separate permission bits for the owner, group members, and all other users. HDFS also supports optional POSIX ACLs (Access Control Lists) for finer-grained rules targeting specific named users or groups.
HDFS federation
The NameNode keeps all metadata in memory. On clusters with hundreds of millions of files, memory becomes the scaling bottleneck. A single NameNode also becomes a performance bottleneck when it must serve all metadata requests.
HDFS Federation (introduced in Hadoop 2.x) addresses both problems by allowing multiple independent NameNodes, each managing a portion of the namespace:
| Property | Detail |
|---|---|
| Independence | NameNodes operate independently -- no coordination required between them |
| Shared storage | All NameNodes share the same pool of DataNodes |
| Isolation | A NameNode failure affects only its namespace, not others |
| Client routing | Clients use client-side mount tables to map file paths to the correct NameNode |
For example, one NameNode might manage /user while another handles /share.
Block Pool IDs
Multiple independent NameNodes could generate the same 64-bit BlockID. HDFS solves this with Block Pools: each namespace owns one or more block pools, identified by a unique Block Pool ID. The extended block ID is a tuple of (Block Pool ID, Block ID), guaranteeing uniqueness across the federated cluster.
Federation is the answer to "how does HDFS scale metadata beyond one machine's RAM?" Each NameNode manages a disjoint portion of the namespace, so total metadata capacity scales linearly with the number of NameNodes. This is a horizontal scaling solution for metadata, while the data layer (DataNodes) was already horizontally scalable.
Erasure coding
Default 3x replication imposes a 200% storage overhead. Erasure Coding (EC) achieves the same fault tolerance with roughly 50% overhead -- effectively doubling usable storage capacity.
| Approach | Storage overhead | Fault tolerance | CPU cost |
|---|---|---|---|
| 3x replication | 200% | Tolerates 2 replica losses | Minimal |
| Erasure coding (e.g., RS-6-3) | ~50% | Tolerates up to 3 fragment losses | Higher -- encoding/decoding requires computation |
Under EC, data is broken into fragments, encoded with redundant parity pieces, and distributed across DataNodes. If fragments are lost (disk failure, corruption), the original data is reconstructed from the surviving fragments.
Erasure coding trades CPU for storage. It works well for cold or archival data that is rarely read, but the computational overhead of encoding/decoding makes it less suitable for hot data with frequent access. HDFS lets you set EC policies at the directory level, so you can mix replication (for hot data) and EC (for cold data) in the same cluster.
HDFS in practice
HDFS is used across the Hadoop ecosystem -- Pig, Hive, HBase, Giraph, Spark -- and beyond (e.g., GraphLab).
Advantages
| Strength | Why |
|---|---|
| High bandwidth | Large clusters sustain up to 1 TB/s of continuous writes |
| High reliability | Replication (or EC) across racks ensures availability even during disk and server failures |
| Low cost per byte | Runs on commodity hardware with software-managed redundancy -- no SAN or enterprise disk arrays needed |
| Horizontal scalability | Add DataNodes to a running cluster and rebalance without downtime |
Disadvantages
| Limitation | Why |
|---|---|
| Small file inefficiency | Each file consumes ~150 bytes of NameNode memory. Millions of small files exhaust the NameNode. Workaround: combine small files into sequence files (binary key-value containers where the file name is the key and the contents are the value) |
| POSIX non-compliance | Applications must use the HDFS client or Hadoop API. A FUSE driver exists but does not support writes after close |
| Write-once model | No concurrent writers, no random writes (append only). Latest versions support append, but in-place modification remains impossible |
| Latency | Optimized for throughput, not latency. For millisecond reads, use BigTable/HBase instead |
When an interviewer asks "what are HDFS's limitations?", lead with the NameNode memory bottleneck (it caps the total number of files) and the write-once model (no random writes, no concurrent writers). These are the two constraints most likely to disqualify HDFS for a given use case. Then mention that HDFS Federation and erasure coding address the first and storage-cost concerns respectively.
In short, HDFS excels when storing a moderate number of very large files for batch processing. It struggles with billions of small files, low-latency access, and workloads requiring concurrent or random writes.