Skip to main content

Hadoop Distributed File System Introduction

Design a distributed system that can store huge files (terabytes and larger). The system should be scalable, reliable, and highly available.

Why HDFS matters

Google published the GFS paper in 2003. It described a brilliant file system, but it was proprietary -- no one outside Google could use it. The open-source world needed its own version.

HDFS is that version. Built as part of the Apache Hadoop project, HDFS is a simplified, open-source reimplementation of GFS's core ideas. It powers the big data ecosystem -- every Hadoop MapReduce job, every Spark computation, every Hive query reads from and writes to HDFS.

Understanding HDFS alongside GFS is valuable because it shows how the same architectural principles get adapted when constraints change. HDFS made different trade-offs than GFS (no concurrent writers, simpler consistency model) and those differences reveal what's essential in the GFS design versus what's Google-specific.

Interview insight

HDFS is a cleaner, more approachable version of GFS. If an interviewer asks you to design a distributed file system, HDFS's architecture (NameNode + DataNodes, block replication, rack-aware placement) is the standard answer. You can then discuss how GFS differs for extra depth.

What is HDFS?

HDFS is a distributed file system designed to store unstructured data reliably across clusters of commodity hardware, streaming it at high bandwidth to user applications.

PropertyDetail
Based onGoogle File System (GFS)
ArchitectureLeader-based: NameNode (master) + DataNodes (workers)
Optimized forLarge sequential reads and writes, batch processing
Block size128MB (vs. GFS's 64MB chunks)
Replication3 replicas per block (configurable), rack-aware placement
ConsistencyStrong consistency; write-once, read-many
Access patternWrite once, append only, read many times

HDFS vs. GFS: key differences

AspectGFSHDFS
Chunk/block size64MB128MB
Concurrent appendsYes (multiple writers)No (single writer per file)
Random writesLimited supportNo support -- append-only
Consistency modelRelaxed (defined per operation)Stronger (single-writer simplifies things)
SnapshotsBuilt-inSupported (since later versions)
Open sourceNoYes
Why the differences?

GFS was designed for Google's specific workloads, which included concurrent appends from many producers (e.g., web crawlers writing to the same log file). HDFS's primary workload is MapReduce, where files are written by a single job and read by many. Dropping concurrent writes dramatically simplified HDFS's design.

When HDFS fits (and when it doesn't)

Good fit:

  • Batch processing -- MapReduce, Spark, Hive jobs that read and write large files sequentially
  • Data lakes -- Store raw, unprocessed data at massive scale for later analysis
  • Log storage -- Aggregate and store logs from thousands of servers
  • Write-once, read-many workloads -- Data that's produced once and analyzed repeatedly

Poor fit:

Anti-patternWhy
Low-latency accessHDFS optimizes for throughput, not latency. For millisecond reads, use BigTable/HBase
Billions of small filesThe NameNode holds all metadata in memory. Each file consumes ~150 bytes of memory. Billions of files would exhaust the NameNode
Concurrent writersUnlike GFS, HDFS does not support multiple writers to the same file
Random writesWrites are append-only; you cannot modify data at arbitrary offsets
The NameNode bottleneck

The NameNode is HDFS's biggest limitation. All metadata lives in its memory, which caps the number of files the system can manage. This is a direct consequence of choosing a single-master architecture (like GFS) for simplicity. It works well for millions of large files but breaks down with billions of small ones.

APIs

Like GFS, HDFS does not provide POSIX-compliant APIs. It exposes user-level APIs through the Hadoop libraries:

  • Files and directories can be created, deleted, renamed, and moved
  • Symbolic links are supported
  • Reads and writes are append-only
  • All operations go through the HDFS client or Hadoop API calls
POSIX

The Portable Operating System Interface (POSIX) is an IEEE standard that defines uniform APIs for Unix-like operating systems. HDFS skips POSIX compliance for the same reason GFS did -- it would impose constraints that don't match the actual workload.

What's next

In the following chapters, we'll explore: