Skip to main content

Metadata

The master is only as good as its metadata management. Every file lookup, every chunk placement decision, and every failure recovery depends on metadata being fast, consistent, and durable. GFS handles these three requirements with three distinct strategies.

The master stores three types of metadata:

Metadata typeExamplePersisted to disk?
File and chunk namespacesDirectory hierarchy, file namesYes
File-to-chunk mapping"File X consists of chunks A, B, C"Yes
Chunk replica locations"Chunk A lives on servers 1, 3, 7"No -- rebuilt at startup

The first two types are persisted via an operation log. The third is intentionally not persisted. Why? Because ChunkServers are the source of truth for their own chunks -- the master simply asks them at startup.

Think first
Why would GFS choose NOT to persist chunk locations to disk, instead rebuilding them at startup by polling ChunkServers?

In-memory metadata

All metadata lives in the master's main memory, giving two benefits:

  1. Speed -- metadata lookups are memory-speed, not disk-speed
  2. Background scanning -- the master periodically scans its entire state to perform:
    • Chunk garbage collection
    • Re-replication after ChunkServer failures
    • Chunk migration for load balancing

The memory-only concern is whether the master runs out of RAM. In practice, this is a non-issue: the master uses less than 64 bytes of metadata per 64MB chunk. File names are stored compactly using prefix compression, keeping namespace data under 64 bytes per file. If the cluster outgrows available memory, adding RAM to the master is a trivial cost compared to the simplicity and performance gains.

Interview angle

"Doesn't keeping everything in memory limit the system?" is a common follow-up question. The math shuts it down: at 64 bytes per chunk and 64MB per chunk, you need roughly 1GB of master memory per petabyte of stored data. That's well within the range of commodity hardware.

Chunk locations

The master does not persist chunk locations. Instead:

  1. At startup, the master polls every ChunkServer for its chunk inventory
  2. As the cluster runs, HeartBeat messages keep the master updated
  3. The master controls all chunk placements, so it learns about new chunks as they're created

This design eliminates the problem of keeping the master and ChunkServers in sync. In a cluster with hundreds of machines, disks fail, servers get renamed, and chunks vanish spontaneously. Trying to maintain a persistent, consistent view of chunk locations on the master would create an endless synchronization headache.

warning

This means that after a master restart, the system cannot serve reads until it has heard back from enough ChunkServers to know where chunks live. This is one reason master recovery time matters -- and why GFS uses checkpointing to minimize it.

Think first
If the master crashes and restarts, how would you recover its state while minimizing recovery time?

Operation log

The operation log records every metadata change (namespace mutations and file-to-chunk mapping updates) in a persistent, ordered log. It serves two critical functions:

  1. Durability -- it is the persistent record of all metadata
  2. Ordering -- it defines a global timeline for concurrent operations

For fault tolerance, the operation log is replicated to multiple remote machines. No metadata change becomes visible to clients until it has been flushed to all replicas. The master batches log records before flushing to reduce the performance impact.

On restart, the master replays the operation log to reconstruct its state. To keep this fast, GFS periodically checkpoints the log.

Checkpointing

The master serializes its state into a compact B-tree-like format that can be memory-mapped directly -- no parsing needed on recovery. The checkpoint process works as follows:

  1. The master switches to a new log file
  2. A background thread creates the checkpoint from the current state
  3. New mutations write to the new log file concurrently -- no blocking

On recovery, the master loads the most recent checkpoint into memory and replays only the mutations recorded after that checkpoint. This keeps startup time short regardless of how long the system has been running.

Interview angle

The operation log + checkpoint pattern appears everywhere in distributed systems. BigTable uses a similar write-ahead log, and HDFS NameNode uses an equivalent fsimage + edit log structure. When discussing any system with a single coordinator, explaining how you'd persist and recover its state is essential.

Quiz
What would happen if the GFS master persisted chunk locations to disk instead of rebuilding them from ChunkServer reports at startup?