High-level Architecture
How do you store a 10TB file on commodity machines with 1TB disks? You split it into fixed-size blocks, scatter those blocks across a cluster, and keep a single node responsible for tracking where everything lives. That is HDFS in one sentence.
HDFS architecture
Every file in HDFS is split into fixed-size blocks (128MB by default, configurable per file). The system stores two things for each file: the actual data (blocks on DataNodes) and the metadata (block locations, file size, replication info on the NameNode).
| Component | Role |
|---|---|
| NameNode | Single master that manages all file system metadata |
| DataNodes | Worker nodes that store actual data blocks as local files |
| HDFS Client | Application-facing library; talks to NameNode for metadata, directly to DataNodes for data |
Key design details:
- All blocks of a file are the same size except the last one.
- HDFS uses large block sizes because it targets extremely large files processed by MapReduce jobs -- large blocks reduce metadata overhead and improve sequential throughput.
- Each block is identified by a unique 64-bit BlockID.
- All read/write operations operate at the block level.
- DataNodes store each block as a separate local file and serve read/write requests.
- On startup, each DataNode scans its local file system and sends a BlockReport (list of hosted blocks) to the NameNode.
- The NameNode persists file system state using two on-disk structures: FsImage (a checkpoint of metadata at a point in time) and EditLog (a write-ahead log of all metadata changes since the last checkpoint). Together they enable crash recovery.
- The client never sends data through the NameNode -- all data transfers happen directly between the client and DataNodes.
- HDFS replicates every block to multiple DataNodes (default: 3) for high availability.
When asked "how would you design a distributed file system," start with this architecture: a single metadata server (NameNode) plus many storage servers (DataNodes). Then immediately call out the trade-off -- the NameNode is a single point of failure and a scalability bottleneck (all metadata must fit in its memory). This shows you understand the design's strengths and limits.
The NameNode holds all metadata in memory. Every file, directory, and block consumes approximately 150 bytes. At billions of small files, the NameNode runs out of RAM -- this is a fundamental scalability ceiling, not a bug.
Comparison between GFS and HDFS
HDFS's architecture mirrors GFS closely, but the terminology and some design choices differ:
| Feature | GFS | HDFS |
|---|---|---|
| Full Name | Google File System | Hadoop Distributed File System |
| Storage Node | ChunkServer | DataNode |
| File Unit | Chunk | Block |
| Default Size | 64 MB, adjustable | 128 MB, adjustable |
| Metadata Checkpoint | Checkpoint image | FsImage |
| Write Ahead Log | Operation log | EditLog |
| Platform | Linux | Cross platform |
| Language | C++ | Java |
| Availability | Internal to Google | Open source |
| Monitoring | Master receives HeartBeat from ChunkServers | NameNode receives HeartBeat from DataNodes |
| Concurrency Model | Multiple writers, multiple readers | Write once, read many. No multiple writers |
| File Operations | Append and random writes supported | Append only |
| Garbage Collection | Deleted files renamed to special folder for later GC | Deleted files renamed to hidden name for later GC |
| Master Communication | RPC over TCP | RPC over TCP |
| Data Transfer | Pipelining and streaming over TCP | Pipelining and streaming over TCP |
| Cache Management | Clients cache metadata. ChunkServers rely on Linux buffer cache | Distributed cache. Off heap block cache in DataNode |
| Replication Strategy | Replicas spread across racks. Master auto re replicates when replicas fall below threshold | Rack aware replication. 2 copies in same rack, 1 in different rack by default |
| Default Replication | User configurable | 3 by default, configurable |
| Namespace | Hierarchical directory structure | Hierarchical directory structure. Supports S3 and Cloud Store |
| Database Using It | Bigtable | HBase |