Master Operations

The master does far more than store metadata. It actively manages the cluster: acquiring locks, placing replicas, rebalancing load, and detecting stale data. Each of these operations must work correctly even under concurrent access and frequent hardware failure.

The master's key responsibilities:

Making replica placement decisions
Creating new chunks and replicas
Ensuring chunks meet their replication factor
Balancing load across ChunkServers
Reclaiming unused storage

Think first

If the file system required a write lock on the parent directory for every file creation, what would happen when hundreds of MapReduce tasks create files in the same directory simultaneously?

Namespace management and locking

GFS does not use a traditional i-node tree structure for its namespace. Instead, it uses a flat hash map that maps full pathnames to metadata, with reader-writer locks on each entry for synchronization.

inode

An 'i-node' (index node) is a data structure in a Unix-style file system to manage metadata about a file or directory. It stores information about any file except its name and data. A file is stored on the disk in the form of fixed-size blocks. Any file bigger than the size of a block is split into multiple blocks. 'i-nodes' store the information about all the storage blocks on which the file's data can be found. Additionally, an 'i-node' stores file metadata like size, owner, times of last change, permissions, etc.

The locking scheme works as follows:

Operation on `/dir1/dir2/leaf`	Locks acquired
Read/write `leaf`	Read lock on `/dir1`, read lock on `/dir1/dir2`, read or write lock on `/dir1/dir2/leaf`
Create file in `/dir1/dir2/`	Read lock on `/dir1`, read lock on `/dir1/dir2` (no write lock on parent!)
Delete `/dir1/dir2/`	Requires write lock on `/dir1/dir2` -- blocked if any child holds a read lock

Key design choices:

Concurrent writes to the same leaf are prevented by the write lock
Concurrent modifications to files in the same directory are allowed (file creation only needs a read lock on the parent)
Deadlock prevention: locks are acquired in a consistent order -- first by depth in the tree, then lexicographically within the same level

Interview angle

The insight that file creation needs only a read lock on the parent directory (not a write lock) is a concurrency optimization worth calling out. It allows multiple files to be created concurrently in the same directory -- critical for workloads where many MapReduce tasks write output files to the same directory simultaneously.

Replica placement

The master places replicas across different racks, not just different machines. This ensures data survives rack-level failures (e.g., a top-of-rack switch dies) and allows clients to read from multiple racks in parallel.

Goal	Benefit	Tradeoff
Cross-rack placement	Survives rack failures; parallel reads from multiple racks	Writes must travel across racks (higher latency)
Low-utilization servers first	Balances disk usage	May conflict with locality
Limit recent creations per server	Prevents write-traffic storms after chunk creation	Slightly less optimal placement

Replica creation and re-replication

When the available replica count drops below the replication factor (due to corruption, server failure, or disk death), the master triggers re-replication. Instead of cloning all under-replicated chunks at once (which would overwhelm the network), the master prioritizes:

Distance from replication goal -- a chunk missing two replicas gets priority over one missing a single replica
Live files over deleted files -- chunks belonging to active files take precedence over chunks from files pending garbage collection

The master also throttles re-replication bandwidth per ChunkServer to avoid starving client requests.

Replica rebalancing

The master periodically rebalances replicas to even out disk usage and load. When new ChunkServers join the cluster, the master fills them gradually rather than flooding them with writes.

Think first

How would the master detect that a ChunkServer's data is stale after it comes back online, without comparing data byte-by-byte?

Stale replica detection

A ChunkServer that was down during a mutation will have a stale chunk replica. GFS detects this using chunk version numbers:

Every time the master grants a lease for a chunk, it increments the version number
All up-to-date replicas (and the master) record the new version
A ChunkServer that missed the update still has the old version number
On restart, the ChunkServer reports its chunks and versions -- the master detects the mismatch

Stale replicas are excluded from client reads and mutations. They are eventually removed during garbage collection.

warning

Clients may still briefly read stale data from cached chunk locations pointing to a stale replica. However, because GFS workloads are append-dominant, a stale replica typically returns a premature end-of-chunk rather than incorrect data -- the client sees less data, not wrong data. This is a deliberate consistency tradeoff.

Quiz

What would happen if GFS placed all three replicas of a chunk on the same rack instead of spreading them across different racks?

Write latency would decrease but a single rack failure (e.g., top-of-rack switch dying) would make the chunk completely unavailable

The system would be more fault-tolerant because replicas are physically closer together

The master would need less metadata to track replica locations

Read performance would decrease because clients can only read from one rack

Namespace management and locking​

Replica placement​

Replica creation and re-replication​

Replica rebalancing​

Stale replica detection​

Namespace management and locking

Replica placement

Replica creation and re-replication

Replica rebalancing

Stale replica detection