Skip to main content

High-level Architecture

How do you store a 10TB file on commodity machines with 1TB disks? You split it into fixed-size blocks, scatter those blocks across a cluster, and keep a single node responsible for tracking where everything lives. That is HDFS in one sentence.

Think first
HDFS was inspired by GFS. If you were building an open-source version of GFS, what would you keep the same and what might you simplify? Think about the concurrency model -- do most Hadoop workloads need concurrent writers to the same file?

HDFS architecture

Every file in HDFS is split into fixed-size blocks (128MB by default, configurable per file). The system stores two things for each file: the actual data (blocks on DataNodes) and the metadata (block locations, file size, replication info on the NameNode).

ComponentRole
NameNodeSingle master that manages all file system metadata
DataNodesWorker nodes that store actual data blocks as local files
HDFS ClientApplication-facing library; talks to NameNode for metadata, directly to DataNodes for data

Key design details:

  • All blocks of a file are the same size except the last one.
  • HDFS uses large block sizes because it targets extremely large files processed by MapReduce jobs -- large blocks reduce metadata overhead and improve sequential throughput.
  • Each block is identified by a unique 64-bit BlockID.
  • All read/write operations operate at the block level.
  • DataNodes store each block as a separate local file and serve read/write requests.
  • On startup, each DataNode scans its local file system and sends a BlockReport (list of hosted blocks) to the NameNode.
  • The NameNode persists file system state using two on-disk structures: FsImage (a checkpoint of metadata at a point in time) and EditLog (a write-ahead log of all metadata changes since the last checkpoint). Together they enable crash recovery.
  • The client never sends data through the NameNode -- all data transfers happen directly between the client and DataNodes.
  • HDFS replicates every block to multiple DataNodes (default: 3) for high availability.
Interview angle

When asked "how would you design a distributed file system," start with this architecture: a single metadata server (NameNode) plus many storage servers (DataNodes). Then immediately call out the trade-off -- the NameNode is a single point of failure and a scalability bottleneck (all metadata must fit in its memory). This shows you understand the design's strengths and limits.

warning

The NameNode holds all metadata in memory. Every file, directory, and block consumes approximately 150 bytes. At billions of small files, the NameNode runs out of RAM -- this is a fundamental scalability ceiling, not a bug.

Think first
HDFS uses 128MB blocks (double GFS's 64MB chunks). Both systems also use a single metadata server. Knowing what you've learned about GFS's limitations, what problems do you expect HDFS to share with GFS, and which ones might it avoid by making different design choices?

Comparison between GFS and HDFS

HDFS's architecture mirrors GFS closely, but the terminology and some design choices differ:

FeatureGFSHDFS
Full NameGoogle File SystemHadoop Distributed File System
Storage NodeChunkServerDataNode
File UnitChunkBlock
Default Size64 MB, adjustable128 MB, adjustable
Metadata CheckpointCheckpoint imageFsImage
Write Ahead LogOperation logEditLog
PlatformLinuxCross platform
LanguageC++Java
AvailabilityInternal to GoogleOpen source
MonitoringMaster receives HeartBeat from ChunkServersNameNode receives HeartBeat from DataNodes
Concurrency ModelMultiple writers, multiple readersWrite once, read many. No multiple writers
File OperationsAppend and random writes supportedAppend only
Garbage CollectionDeleted files renamed to special folder for later GCDeleted files renamed to hidden name for later GC
Master CommunicationRPC over TCPRPC over TCP
Data TransferPipelining and streaming over TCPPipelining and streaming over TCP
Cache ManagementClients cache metadata. ChunkServers rely on Linux buffer cacheDistributed cache. Off heap block cache in DataNode
Replication StrategyReplicas spread across racks. Master auto re replicates when replicas fall below thresholdRack aware replication. 2 copies in same rack, 1 in different rack by default
Default ReplicationUser configurable3 by default, configurable
NamespaceHierarchical directory structureHierarchical directory structure. Supports S3 and Cloud Store
Database Using ItBigtableHBase
Quiz
HDFS uses a write-once, read-many model while GFS supports concurrent writers. What would happen if HDFS allowed multiple concurrent writers to the same file?