Skip to main content

HDFS Characteristics

Every distributed system makes trade-offs. HDFS optimizes for throughput on large files at the cost of latency, small-file efficiency, and POSIX compliance. Understanding these trade-offs tells you when HDFS is the right tool -- and when it is not.

Think first
HDFS optimizes for throughput on large files at the cost of latency and small-file efficiency. Why would storing millions of small files (each a few KB) be problematic in a system where the NameNode keeps all metadata in memory?

Security and permission

HDFS implements a POSIX-like permissions model:

PermissionFilesDirectories
Read (r)Required to read the fileRequired to list directory contents
Write (w)Required to write or appendRequired to create or delete children
Execute (x)Ignored (files cannot be executed on HDFS)Required to access a child of the directory

Each file and directory has an owner, a group, and separate permission bits for the owner, group members, and all other users. HDFS also supports optional POSIX ACLs (Access Control Lists) for finer-grained rules targeting specific named users or groups.

HDFS federation

The NameNode keeps all metadata in memory. On clusters with hundreds of millions of files, memory becomes the scaling bottleneck. A single NameNode also becomes a performance bottleneck when it must serve all metadata requests.

HDFS Federation (introduced in Hadoop 2.x) addresses both problems by allowing multiple independent NameNodes, each managing a portion of the namespace:

PropertyDetail
IndependenceNameNodes operate independently -- no coordination required between them
Shared storageAll NameNodes share the same pool of DataNodes
IsolationA NameNode failure affects only its namespace, not others
Client routingClients use client-side mount tables to map file paths to the correct NameNode

For example, one NameNode might manage /user while another handles /share.

Block Pool IDs

Multiple independent NameNodes could generate the same 64-bit BlockID. HDFS solves this with Block Pools: each namespace owns one or more block pools, identified by a unique Block Pool ID. The extended block ID is a tuple of (Block Pool ID, Block ID), guaranteeing uniqueness across the federated cluster.

Interview angle

Federation is the answer to "how does HDFS scale metadata beyond one machine's RAM?" Each NameNode manages a disjoint portion of the namespace, so total metadata capacity scales linearly with the number of NameNodes. This is a horizontal scaling solution for metadata, while the data layer (DataNodes) was already horizontally scalable.

Think first
3x replication provides simple fault tolerance but uses 3x the storage. If you have petabytes of archival data that is rarely read, is there a more storage-efficient way to provide the same level of fault tolerance? Think about how RAID systems protect against disk failures with less overhead than full mirroring.

Erasure coding

Default 3x replication imposes a 200% storage overhead. Erasure Coding (EC) achieves the same fault tolerance with roughly 50% overhead -- effectively doubling usable storage capacity.

ApproachStorage overheadFault toleranceCPU cost
3x replication200%Tolerates 2 replica lossesMinimal
Erasure coding (e.g., RS-6-3)~50%Tolerates up to 3 fragment lossesHigher -- encoding/decoding requires computation

Under EC, data is broken into fragments, encoded with redundant parity pieces, and distributed across DataNodes. If fragments are lost (disk failure, corruption), the original data is reconstructed from the surviving fragments.

warning

Erasure coding trades CPU for storage. It works well for cold or archival data that is rarely read, but the computational overhead of encoding/decoding makes it less suitable for hot data with frequent access. HDFS lets you set EC policies at the directory level, so you can mix replication (for hot data) and EC (for cold data) in the same cluster.

HDFS in practice

HDFS is used across the Hadoop ecosystem -- Pig, Hive, HBase, Giraph, Spark -- and beyond (e.g., GraphLab).

Advantages

StrengthWhy
High bandwidthLarge clusters sustain up to 1 TB/s of continuous writes
High reliabilityReplication (or EC) across racks ensures availability even during disk and server failures
Low cost per byteRuns on commodity hardware with software-managed redundancy -- no SAN or enterprise disk arrays needed
Horizontal scalabilityAdd DataNodes to a running cluster and rebalance without downtime

Disadvantages

LimitationWhy
Small file inefficiencyEach file consumes ~150 bytes of NameNode memory. Millions of small files exhaust the NameNode. Workaround: combine small files into sequence files (binary key-value containers where the file name is the key and the contents are the value)
POSIX non-complianceApplications must use the HDFS client or Hadoop API. A FUSE driver exists but does not support writes after close
Write-once modelNo concurrent writers, no random writes (append only). Latest versions support append, but in-place modification remains impossible
LatencyOptimized for throughput, not latency. For millisecond reads, use BigTable/HBase instead
Interview angle

When an interviewer asks "what are HDFS's limitations?", lead with the NameNode memory bottleneck (it caps the total number of files) and the write-once model (no random writes, no concurrent writers). These are the two constraints most likely to disqualify HDFS for a given use case. Then mention that HDFS Federation and erasure coding address the first and storage-cost concerns respectively.

In short, HDFS excels when storing a moderate number of very large files for batch processing. It struggles with billions of small files, low-latency access, and workloads requiring concurrent or random writes.

Quiz
A team wants to use HDFS to store 500 million log files, each 10KB in size. What is the most critical problem they will face?