Anatomy of a Read Operation
When a client wants to read a file, how does it find the right data across a cluster of hundreds of ChunkServers? The answer involves exactly one metadata lookup and then direct data transfer -- the master is touched briefly and then stays out of the way.
Step-by-step read flow
| Step | Actor | Action |
|---|---|---|
| 1 | Client | Translates the file name + byte offset into a chunk index (trivial math given the fixed 64MB chunk size) |
| 2 | Client -> Master | Sends an RPC with the file name and chunk index |
| 3 | Master -> Client | Returns the chunk handle and locations of all replicas holding that chunk |
| 4 | Client | Caches the metadata (keyed by file name + chunk index) for future reads |
| 5 | Client -> ChunkServer | Sends a read request to the closest replica, specifying the chunk handle and byte range |
| 6 | ChunkServer -> Client | Returns the requested data |
Key optimizations
- Batch metadata requests: The client typically requests metadata for multiple chunks in a single RPC. The master also proactively includes information for chunks immediately following the requested ones, anticipating sequential reads.
- Metadata caching: Cached chunk locations eliminate repeated master lookups. The cache uses timeouts to avoid serving stale location data.
- Closest replica selection: The client picks the nearest ChunkServer (by network topology), minimizing read latency.
- No data caching: Neither clients nor ChunkServers cache file data at the application level. The OS buffer cache on ChunkServers handles frequently accessed data. This avoids cache coherence complexity entirely.
The GFS read path is a textbook example of separating control flow from data flow. The master handles the lightweight "where is my data?" question, then the heavy "give me the bytes" transfer happens directly between client and ChunkServer. This pattern appears in virtually every distributed storage system -- HDFS NameNode works the same way, and even object stores like S3 separate metadata routing from data transfer. Always call out this separation when designing storage systems in interviews.
The read path has no strong consistency guarantee. A client could read from a replica that is slightly stale (if it missed a recent mutation and the client has a cached location). For GFS's append-heavy workloads, this typically means the client sees a premature end-of-chunk rather than corrupted data -- but applications must tolerate this.