Skip to main content

Design a File System for Petabyte-Scale Data

The problem

You work at a company that processes massive amounts of data: web crawl archives, satellite imagery, genomics datasets, and application logs. Your data pipeline generates petabytes of files. Individual files are large -- typically hundreds of megabytes to several gigabytes each. Files are written once and then read many times, usually sequentially from beginning to end (e.g., a MapReduce job scanning an entire file).

You run a cluster of 10,000 commodity machines. At this scale, hardware failure is not an exception -- it is a statistical certainty. On any given day, several machines will crash, several disks will fail, and a network switch will flap. Your file system must handle all of this transparently.

The requirements:

  • Petabyte-scale storage across thousands of machines
  • Large files (100 MB to multi-GB) -- small files are rare
  • Write-once, read-many access pattern -- no random writes, no concurrent writers to the same file
  • Sequential read throughput is far more important than random read latency
  • Automatic fault tolerance -- data survives machine failures without manual intervention
  • High aggregate bandwidth -- the system should saturate the network when many clients read simultaneously

What you do NOT need:

  • POSIX compliance or low-latency file access
  • Support for millions of small files
  • Random writes within a file
  • File locking or concurrent writer support

Key requirements to identify

RequirementWhat it implies
Petabyte scale, 10K machinesData must be chunked and distributed across machines
Large files, sequential readsUse large chunk sizes (64 MB+) to amortize metadata overhead and reduce seeks
Write-once, read-manySimplifies consistency -- no concurrent write conflicts
Hardware failures dailyReplicate each chunk to 3+ machines across different racks
High aggregate bandwidthClients should read from the nearest replica to distribute load
Simple metadata (not millions of small files)A single master can hold all metadata in memory

The design approach

Split every file into large, fixed-size chunks (say 64 MB). Store each chunk on a regular Linux file system on a commodity machine (a "chunk server" or "data node"). A central master node maintains the mapping from file names to chunk lists and from chunks to their physical locations. The master holds all of this metadata in memory for fast lookups.

Why a single master? Because the metadata is small relative to the data. If files average 256 MB and chunks are 64 MB, a petabyte of data is roughly 4 million chunks. The metadata for 4 million chunks fits comfortably in RAM. The master is the simplest correct design -- it avoids the complexity of distributed consensus for metadata operations.

For fault tolerance, replicate each chunk on three different machines, preferably across two racks. When a chunk server fails (detected via missed heartbeats), the master knows which chunks it held and re-replicates them from surviving replicas to other servers.

Writes use a pipeline rather than fan-out: the client sends data to the nearest replica, which forwards it to the next, which forwards it to the third. This avoids the client's network link becoming a bottleneck.

A write-ahead log on the master (called the operation log) records every metadata change before it is applied. If the master crashes, it replays the log to recover state. Periodic checkpoints truncate the log.

Think first
Before reading the solution, think about the single-master design. It simplifies consistency enormously, but what are its failure modes? What happens when the master crashes? What happens when the master's memory is full? How would you make the master highly available?

How the industry solved it

Google built GFS (Google File System) in 2003, and the open-source community built HDFS (Hadoop Distributed File System) as a faithful reimplementation. Both follow the same fundamental architecture.

Google File System (GFS)

GFS introduced the single-master, many-chunkserver architecture that defined a generation of distributed storage systems.

Start here: GFS Introduction

Key design decisions:

  • 64 MB chunk size -- much larger than typical file system blocks. Reduces metadata size and the number of interactions with the master. See Single Master and Large Chunk Size.
  • Operation log -- the master's WAL and the definitive record of metadata changes. See Metadata.
  • Lease-based mutation ordering -- the master grants a lease to one replica (the "primary"), which determines the order of mutations. See Write Operation.
  • Record append -- GFS's most distinctive operation, allowing multiple clients to append to the same file concurrently with at-least-once semantics. See Append Operation.

Deep dive on fault tolerance: Fault Tolerance, HA, and Data Integrity

Hadoop Distributed File System (HDFS)

HDFS is the open-source implementation of the GFS paper, built as the storage layer for the Hadoop ecosystem. The architecture is nearly identical: a NameNode (master) holds metadata in memory, and DataNodes (chunk servers) store blocks of data.

Start here: HDFS Introduction

Key differences from GFS:

  • HDFS focuses on write-once, read-many without GFS's record append operation
  • HDFS added a high-availability mode with a standby NameNode that replays the edit log in real-time. See HDFS High Availability.
  • Block size is typically 128 MB (larger than GFS's 64 MB)
  • Uses rack-aware replication: first replica on the local node, second on a different node in the same rack, third on a node in a different rack. See Deep Dive.

Deep dive: Write Operation | Read Operation | Fault Tolerance

Key patterns used

PatternWhy it is neededReference
Write-Ahead LogMaster durability -- recover metadata after crash by replaying the operation logPattern
HeartbeatDetect failed chunk/data nodes and trigger re-replicationPattern
ChecksumVerify data integrity -- detect bit rot on diskPattern
Leader and FollowerPrimary replica orders mutations; followers replicatePattern
LeaseMaster grants time-limited authority to the primary chunk serverPattern
Split BrainPrevent two NameNodes from both acting as active (HDFS HA fencing)Pattern
FencingKill the old active NameNode before the standby takes overPattern

The single-master tradeoff

Both GFS and HDFS bet on a single master, and this bet has consequences:

Advantages: Simplicity, strong consistency for metadata, easy reasoning about correctness.

Disadvantages: The master is a scalability bottleneck (it limits the total number of files) and a single point of failure (mitigated by standby/shadow masters). Google eventually outgrew single-master GFS and built successors with distributed metadata. See Criticism on GFS.

Design Challenge

Variation: Design a distributed file system for small files

Your platform hosts 500 million user-uploaded photos, each 200 KB to 5 MB. Users upload 10,000 photos per second and read 100,000 per second. Reads must be low-latency (under 50 ms). Unlike GFS/HDFS, you have many small files rather than few large files.
Hints (0/4)