Skip to main content

Master Operations

The master does far more than store metadata. It actively manages the cluster: acquiring locks, placing replicas, rebalancing load, and detecting stale data. Each of these operations must work correctly even under concurrent access and frequent hardware failure.

The master's key responsibilities:

  • Making replica placement decisions
  • Creating new chunks and replicas
  • Ensuring chunks meet their replication factor
  • Balancing load across ChunkServers
  • Reclaiming unused storage
Think first
If the file system required a write lock on the parent directory for every file creation, what would happen when hundreds of MapReduce tasks create files in the same directory simultaneously?

Namespace management and locking

GFS does not use a traditional i-node tree structure for its namespace. Instead, it uses a flat hash map that maps full pathnames to metadata, with reader-writer locks on each entry for synchronization.

inode

An 'i-node' (index node) is a data structure in a Unix-style file system to manage metadata about a file or directory. It stores information about any file except its name and data. A file is stored on the disk in the form of fixed-size blocks. Any file bigger than the size of a block is split into multiple blocks. 'i-nodes' store the information about all the storage blocks on which the file's data can be found. Additionally, an 'i-node' stores file metadata like size, owner, times of last change, permissions, etc.

The locking scheme works as follows:

Operation on /dir1/dir2/leafLocks acquired
Read/write leafRead lock on /dir1, read lock on /dir1/dir2, read or write lock on /dir1/dir2/leaf
Create file in /dir1/dir2/Read lock on /dir1, read lock on /dir1/dir2 (no write lock on parent!)
Delete /dir1/dir2/Requires write lock on /dir1/dir2 -- blocked if any child holds a read lock

Key design choices:

  • Concurrent writes to the same leaf are prevented by the write lock
  • Concurrent modifications to files in the same directory are allowed (file creation only needs a read lock on the parent)
  • Deadlock prevention: locks are acquired in a consistent order -- first by depth in the tree, then lexicographically within the same level
Interview angle

The insight that file creation needs only a read lock on the parent directory (not a write lock) is a concurrency optimization worth calling out. It allows multiple files to be created concurrently in the same directory -- critical for workloads where many MapReduce tasks write output files to the same directory simultaneously.

Replica placement

The master places replicas across different racks, not just different machines. This ensures data survives rack-level failures (e.g., a top-of-rack switch dies) and allows clients to read from multiple racks in parallel.

GoalBenefitTradeoff
Cross-rack placementSurvives rack failures; parallel reads from multiple racksWrites must travel across racks (higher latency)
Low-utilization servers firstBalances disk usageMay conflict with locality
Limit recent creations per serverPrevents write-traffic storms after chunk creationSlightly less optimal placement

Replica creation and re-replication

When the available replica count drops below the replication factor (due to corruption, server failure, or disk death), the master triggers re-replication. Instead of cloning all under-replicated chunks at once (which would overwhelm the network), the master prioritizes:

  1. Distance from replication goal -- a chunk missing two replicas gets priority over one missing a single replica
  2. Live files over deleted files -- chunks belonging to active files take precedence over chunks from files pending garbage collection

The master also throttles re-replication bandwidth per ChunkServer to avoid starving client requests.

Replica rebalancing

The master periodically rebalances replicas to even out disk usage and load. When new ChunkServers join the cluster, the master fills them gradually rather than flooding them with writes.

Think first
How would the master detect that a ChunkServer's data is stale after it comes back online, without comparing data byte-by-byte?

Stale replica detection

A ChunkServer that was down during a mutation will have a stale chunk replica. GFS detects this using chunk version numbers:

  1. Every time the master grants a lease for a chunk, it increments the version number
  2. All up-to-date replicas (and the master) record the new version
  3. A ChunkServer that missed the update still has the old version number
  4. On restart, the ChunkServer reports its chunks and versions -- the master detects the mismatch

Stale replicas are excluded from client reads and mutations. They are eventually removed during garbage collection.

warning

Clients may still briefly read stale data from cached chunk locations pointing to a stale replica. However, because GFS workloads are append-dominant, a stale replica typically returns a premature end-of-chunk rather than incorrect data -- the client sees less data, not wrong data. This is a deliberate consistency tradeoff.

Quiz
What would happen if GFS placed all three replicas of a chunk on the same rack instead of spreading them across different racks?