Design a File System for Petabyte-Scale Data
The problem
You work at a company that processes massive amounts of data: web crawl archives, satellite imagery, genomics datasets, and application logs. Your data pipeline generates petabytes of files. Individual files are large -- typically hundreds of megabytes to several gigabytes each. Files are written once and then read many times, usually sequentially from beginning to end (e.g., a MapReduce job scanning an entire file).
You run a cluster of 10,000 commodity machines. At this scale, hardware failure is not an exception -- it is a statistical certainty. On any given day, several machines will crash, several disks will fail, and a network switch will flap. Your file system must handle all of this transparently.
The requirements:
- Petabyte-scale storage across thousands of machines
- Large files (100 MB to multi-GB) -- small files are rare
- Write-once, read-many access pattern -- no random writes, no concurrent writers to the same file
- Sequential read throughput is far more important than random read latency
- Automatic fault tolerance -- data survives machine failures without manual intervention
- High aggregate bandwidth -- the system should saturate the network when many clients read simultaneously
What you do NOT need:
- POSIX compliance or low-latency file access
- Support for millions of small files
- Random writes within a file
- File locking or concurrent writer support
Key requirements to identify
| Requirement | What it implies |
|---|---|
| Petabyte scale, 10K machines | Data must be chunked and distributed across machines |
| Large files, sequential reads | Use large chunk sizes (64 MB+) to amortize metadata overhead and reduce seeks |
| Write-once, read-many | Simplifies consistency -- no concurrent write conflicts |
| Hardware failures daily | Replicate each chunk to 3+ machines across different racks |
| High aggregate bandwidth | Clients should read from the nearest replica to distribute load |
| Simple metadata (not millions of small files) | A single master can hold all metadata in memory |
The design approach
Split every file into large, fixed-size chunks (say 64 MB). Store each chunk on a regular Linux file system on a commodity machine (a "chunk server" or "data node"). A central master node maintains the mapping from file names to chunk lists and from chunks to their physical locations. The master holds all of this metadata in memory for fast lookups.
Why a single master? Because the metadata is small relative to the data. If files average 256 MB and chunks are 64 MB, a petabyte of data is roughly 4 million chunks. The metadata for 4 million chunks fits comfortably in RAM. The master is the simplest correct design -- it avoids the complexity of distributed consensus for metadata operations.
For fault tolerance, replicate each chunk on three different machines, preferably across two racks. When a chunk server fails (detected via missed heartbeats), the master knows which chunks it held and re-replicates them from surviving replicas to other servers.
Writes use a pipeline rather than fan-out: the client sends data to the nearest replica, which forwards it to the next, which forwards it to the third. This avoids the client's network link becoming a bottleneck.
A write-ahead log on the master (called the operation log) records every metadata change before it is applied. If the master crashes, it replays the log to recover state. Periodic checkpoints truncate the log.
How the industry solved it
Google built GFS (Google File System) in 2003, and the open-source community built HDFS (Hadoop Distributed File System) as a faithful reimplementation. Both follow the same fundamental architecture.
Google File System (GFS)
GFS introduced the single-master, many-chunkserver architecture that defined a generation of distributed storage systems.
Start here: GFS Introduction
Key design decisions:
- 64 MB chunk size -- much larger than typical file system blocks. Reduces metadata size and the number of interactions with the master. See Single Master and Large Chunk Size.
- Operation log -- the master's WAL and the definitive record of metadata changes. See Metadata.
- Lease-based mutation ordering -- the master grants a lease to one replica (the "primary"), which determines the order of mutations. See Write Operation.
- Record append -- GFS's most distinctive operation, allowing multiple clients to append to the same file concurrently with at-least-once semantics. See Append Operation.
Deep dive on fault tolerance: Fault Tolerance, HA, and Data Integrity
Hadoop Distributed File System (HDFS)
HDFS is the open-source implementation of the GFS paper, built as the storage layer for the Hadoop ecosystem. The architecture is nearly identical: a NameNode (master) holds metadata in memory, and DataNodes (chunk servers) store blocks of data.
Start here: HDFS Introduction
Key differences from GFS:
- HDFS focuses on write-once, read-many without GFS's record append operation
- HDFS added a high-availability mode with a standby NameNode that replays the edit log in real-time. See HDFS High Availability.
- Block size is typically 128 MB (larger than GFS's 64 MB)
- Uses rack-aware replication: first replica on the local node, second on a different node in the same rack, third on a node in a different rack. See Deep Dive.
Deep dive: Write Operation | Read Operation | Fault Tolerance
Key patterns used
| Pattern | Why it is needed | Reference |
|---|---|---|
| Write-Ahead Log | Master durability -- recover metadata after crash by replaying the operation log | Pattern |
| Heartbeat | Detect failed chunk/data nodes and trigger re-replication | Pattern |
| Checksum | Verify data integrity -- detect bit rot on disk | Pattern |
| Leader and Follower | Primary replica orders mutations; followers replicate | Pattern |
| Lease | Master grants time-limited authority to the primary chunk server | Pattern |
| Split Brain | Prevent two NameNodes from both acting as active (HDFS HA fencing) | Pattern |
| Fencing | Kill the old active NameNode before the standby takes over | Pattern |
The single-master tradeoff
Both GFS and HDFS bet on a single master, and this bet has consequences:
Advantages: Simplicity, strong consistency for metadata, easy reasoning about correctness.
Disadvantages: The master is a scalability bottleneck (it limits the total number of files) and a single point of failure (mitigated by standby/shadow masters). Google eventually outgrew single-master GFS and built successors with distributed metadata. See Criticism on GFS.