Hadoop Distributed File System Introduction
Design a distributed system that can store huge files (terabytes and larger). The system should be scalable, reliable, and highly available.
Why HDFS matters
Google published the GFS paper in 2003. It described a brilliant file system, but it was proprietary -- no one outside Google could use it. The open-source world needed its own version.
HDFS is that version. Built as part of the Apache Hadoop project, HDFS is a simplified, open-source reimplementation of GFS's core ideas. It powers the big data ecosystem -- every Hadoop MapReduce job, every Spark computation, every Hive query reads from and writes to HDFS.
Understanding HDFS alongside GFS is valuable because it shows how the same architectural principles get adapted when constraints change. HDFS made different trade-offs than GFS (no concurrent writers, simpler consistency model) and those differences reveal what's essential in the GFS design versus what's Google-specific.
HDFS is a cleaner, more approachable version of GFS. If an interviewer asks you to design a distributed file system, HDFS's architecture (NameNode + DataNodes, block replication, rack-aware placement) is the standard answer. You can then discuss how GFS differs for extra depth.
What is HDFS?
HDFS is a distributed file system designed to store unstructured data reliably across clusters of commodity hardware, streaming it at high bandwidth to user applications.
| Property | Detail |
|---|---|
| Based on | Google File System (GFS) |
| Architecture | Leader-based: NameNode (master) + DataNodes (workers) |
| Optimized for | Large sequential reads and writes, batch processing |
| Block size | 128MB (vs. GFS's 64MB chunks) |
| Replication | 3 replicas per block (configurable), rack-aware placement |
| Consistency | Strong consistency; write-once, read-many |
| Access pattern | Write once, append only, read many times |
HDFS vs. GFS: key differences
| Aspect | GFS | HDFS |
|---|---|---|
| Chunk/block size | 64MB | 128MB |
| Concurrent appends | Yes (multiple writers) | No (single writer per file) |
| Random writes | Limited support | No support -- append-only |
| Consistency model | Relaxed (defined per operation) | Stronger (single-writer simplifies things) |
| Snapshots | Built-in | Supported (since later versions) |
| Open source | No | Yes |
GFS was designed for Google's specific workloads, which included concurrent appends from many producers (e.g., web crawlers writing to the same log file). HDFS's primary workload is MapReduce, where files are written by a single job and read by many. Dropping concurrent writes dramatically simplified HDFS's design.
When HDFS fits (and when it doesn't)
Good fit:
- Batch processing -- MapReduce, Spark, Hive jobs that read and write large files sequentially
- Data lakes -- Store raw, unprocessed data at massive scale for later analysis
- Log storage -- Aggregate and store logs from thousands of servers
- Write-once, read-many workloads -- Data that's produced once and analyzed repeatedly
Poor fit:
| Anti-pattern | Why |
|---|---|
| Low-latency access | HDFS optimizes for throughput, not latency. For millisecond reads, use BigTable/HBase |
| Billions of small files | The NameNode holds all metadata in memory. Each file consumes ~150 bytes of memory. Billions of files would exhaust the NameNode |
| Concurrent writers | Unlike GFS, HDFS does not support multiple writers to the same file |
| Random writes | Writes are append-only; you cannot modify data at arbitrary offsets |
The NameNode is HDFS's biggest limitation. All metadata lives in its memory, which caps the number of files the system can manage. This is a direct consequence of choosing a single-master architecture (like GFS) for simplicity. It works well for millions of large files but breaks down with billions of small ones.
APIs
Like GFS, HDFS does not provide POSIX-compliant APIs. It exposes user-level APIs through the Hadoop libraries:
- Files and directories can be created, deleted, renamed, and moved
- Symbolic links are supported
- Reads and writes are append-only
- All operations go through the HDFS client or Hadoop API calls
The Portable Operating System Interface (POSIX) is an IEEE standard that defines uniform APIs for Unix-like operating systems. HDFS skips POSIX compliance for the same reason GFS did -- it would impose constraints that don't match the actual workload.
What's next
In the following chapters, we'll explore:
- HDFS's high-level architecture -- NameNode, DataNodes, and block management
- A deep dive into HDFS internals
- The anatomy of read and write operations
- Data integrity and caching
- Fault tolerance -- what happens when nodes die
- High availability -- solving the NameNode single-point-of-failure problem