Master Operations
The master does far more than store metadata. It actively manages the cluster: acquiring locks, placing replicas, rebalancing load, and detecting stale data. Each of these operations must work correctly even under concurrent access and frequent hardware failure.
The master's key responsibilities:
- Making replica placement decisions
- Creating new chunks and replicas
- Ensuring chunks meet their replication factor
- Balancing load across ChunkServers
- Reclaiming unused storage
Namespace management and locking
GFS does not use a traditional i-node tree structure for its namespace. Instead, it uses a flat hash map that maps full pathnames to metadata, with reader-writer locks on each entry for synchronization.
An 'i-node' (index node) is a data structure in a Unix-style file system to manage metadata about a file or directory. It stores information about any file except its name and data. A file is stored on the disk in the form of fixed-size blocks. Any file bigger than the size of a block is split into multiple blocks. 'i-nodes' store the information about all the storage blocks on which the file's data can be found. Additionally, an 'i-node' stores file metadata like size, owner, times of last change, permissions, etc.
The locking scheme works as follows:
Operation on /dir1/dir2/leaf | Locks acquired |
|---|---|
Read/write leaf | Read lock on /dir1, read lock on /dir1/dir2, read or write lock on /dir1/dir2/leaf |
Create file in /dir1/dir2/ | Read lock on /dir1, read lock on /dir1/dir2 (no write lock on parent!) |
Delete /dir1/dir2/ | Requires write lock on /dir1/dir2 -- blocked if any child holds a read lock |
Key design choices:
- Concurrent writes to the same leaf are prevented by the write lock
- Concurrent modifications to files in the same directory are allowed (file creation only needs a read lock on the parent)
- Deadlock prevention: locks are acquired in a consistent order -- first by depth in the tree, then lexicographically within the same level
The insight that file creation needs only a read lock on the parent directory (not a write lock) is a concurrency optimization worth calling out. It allows multiple files to be created concurrently in the same directory -- critical for workloads where many MapReduce tasks write output files to the same directory simultaneously.
Replica placement
The master places replicas across different racks, not just different machines. This ensures data survives rack-level failures (e.g., a top-of-rack switch dies) and allows clients to read from multiple racks in parallel.
| Goal | Benefit | Tradeoff |
|---|---|---|
| Cross-rack placement | Survives rack failures; parallel reads from multiple racks | Writes must travel across racks (higher latency) |
| Low-utilization servers first | Balances disk usage | May conflict with locality |
| Limit recent creations per server | Prevents write-traffic storms after chunk creation | Slightly less optimal placement |
Replica creation and re-replication
When the available replica count drops below the replication factor (due to corruption, server failure, or disk death), the master triggers re-replication. Instead of cloning all under-replicated chunks at once (which would overwhelm the network), the master prioritizes:
- Distance from replication goal -- a chunk missing two replicas gets priority over one missing a single replica
- Live files over deleted files -- chunks belonging to active files take precedence over chunks from files pending garbage collection
The master also throttles re-replication bandwidth per ChunkServer to avoid starving client requests.
Replica rebalancing
The master periodically rebalances replicas to even out disk usage and load. When new ChunkServers join the cluster, the master fills them gradually rather than flooding them with writes.
Stale replica detection
A ChunkServer that was down during a mutation will have a stale chunk replica. GFS detects this using chunk version numbers:
- Every time the master grants a lease for a chunk, it increments the version number
- All up-to-date replicas (and the master) record the new version
- A ChunkServer that missed the update still has the old version number
- On restart, the ChunkServer reports its chunks and versions -- the master detects the mismatch
Stale replicas are excluded from client reads and mutations. They are eventually removed during garbage collection.
Clients may still briefly read stale data from cached chunk locations pointing to a stale replica. However, because GFS workloads are append-dominant, a stale replica typically returns a premature end-of-chunk rather than incorrect data -- the client sees less data, not wrong data. This is a deliberate consistency tradeoff.