Hadoop Distributed File System Introduction

Design a distributed system that can store huge files (terabytes and larger). The system should be scalable, reliable, and highly available.

Why HDFS matters

Google published the GFS paper in 2003. It described a brilliant file system, but it was proprietary -- no one outside Google could use it. The open-source world needed its own version.

HDFS is that version. Built as part of the Apache Hadoop project, HDFS is a simplified, open-source reimplementation of GFS's core ideas. It powers the big data ecosystem -- every Hadoop MapReduce job, every Spark computation, every Hive query reads from and writes to HDFS.

Understanding HDFS alongside GFS is valuable because it shows how the same architectural principles get adapted when constraints change. HDFS made different trade-offs than GFS (no concurrent writers, simpler consistency model) and those differences reveal what's essential in the GFS design versus what's Google-specific.

Interview insight

HDFS is a cleaner, more approachable version of GFS. If an interviewer asks you to design a distributed file system, HDFS's architecture (NameNode + DataNodes, block replication, rack-aware placement) is the standard answer. You can then discuss how GFS differs for extra depth.

What is HDFS?

HDFS is a distributed file system designed to store unstructured data reliably across clusters of commodity hardware, streaming it at high bandwidth to user applications.

Property	Detail
Based on	Google File System (GFS)
Architecture	Leader-based: NameNode (master) + DataNodes (workers)
Optimized for	Large sequential reads and writes, batch processing
Block size	128MB (vs. GFS's 64MB chunks)
Replication	3 replicas per block (configurable), rack-aware placement
Consistency	Strong consistency; write-once, read-many
Access pattern	Write once, append only, read many times

HDFS vs. GFS: key differences

Aspect	GFS	HDFS
Chunk/block size	64MB	128MB
Concurrent appends	Yes (multiple writers)	No (single writer per file)
Random writes	Limited support	No support -- append-only
Consistency model	Relaxed (defined per operation)	Stronger (single-writer simplifies things)
Snapshots	Built-in	Supported (since later versions)
Open source	No	Yes

Why the differences?

GFS was designed for Google's specific workloads, which included concurrent appends from many producers (e.g., web crawlers writing to the same log file). HDFS's primary workload is MapReduce, where files are written by a single job and read by many. Dropping concurrent writes dramatically simplified HDFS's design.

When HDFS fits (and when it doesn't)

Good fit:

Batch processing -- MapReduce, Spark, Hive jobs that read and write large files sequentially
Data lakes -- Store raw, unprocessed data at massive scale for later analysis
Log storage -- Aggregate and store logs from thousands of servers
Write-once, read-many workloads -- Data that's produced once and analyzed repeatedly

Poor fit:

Anti-pattern	Why
Low-latency access	HDFS optimizes for throughput, not latency. For millisecond reads, use BigTable/HBase
Billions of small files	The NameNode holds all metadata in memory. Each file consumes ~150 bytes of memory. Billions of files would exhaust the NameNode
Concurrent writers	Unlike GFS, HDFS does not support multiple writers to the same file
Random writes	Writes are append-only; you cannot modify data at arbitrary offsets

The NameNode bottleneck

The NameNode is HDFS's biggest limitation. All metadata lives in its memory, which caps the number of files the system can manage. This is a direct consequence of choosing a single-master architecture (like GFS) for simplicity. It works well for millions of large files but breaks down with billions of small ones.

APIs

Like GFS, HDFS does not provide POSIX-compliant APIs. It exposes user-level APIs through the Hadoop libraries:

Files and directories can be created, deleted, renamed, and moved
Symbolic links are supported
Reads and writes are append-only
All operations go through the HDFS client or Hadoop API calls

POSIX

The Portable Operating System Interface (POSIX) is an IEEE standard that defines uniform APIs for Unix-like operating systems. HDFS skips POSIX compliance for the same reason GFS did -- it would impose constraints that don't match the actual workload.

What's next

In the following chapters, we'll explore:

HDFS's high-level architecture -- NameNode, DataNodes, and block management
A deep dive into HDFS internals
The anatomy of read and write operations
Data integrity and caching
Fault tolerance -- what happens when nodes die
High availability -- solving the NameNode single-point-of-failure problem

Why HDFS matters​

What is HDFS?​

HDFS vs. GFS: key differences​

When HDFS fits (and when it doesn't)​

APIs​

What's next​