Criticism on GFS
GFS solved Google's storage problem in 2003. But as Google's data and workloads grew by orders of magnitude, the very design choices that made GFS simple started to break. Understanding these failure modes is just as important as understanding the original design -- they reveal the limits of specific architectural tradeoffs.
Problems with the single master
The single master was GFS's most deliberate simplification, and it became its biggest liability at scale:
| Problem | Root cause | Consequence |
|---|---|---|
| CPU bottleneck | All metadata operations route through one machine | As client count grows, the master cannot serve them all -- CPU saturates |
| Memory limit | All metadata must fit in master RAM | Despite the large chunk size keeping metadata compact, the sheer volume of data eventually outgrows what a single machine can hold |
The separation of control and data flow helped, but it only delays the bottleneck -- it doesn't eliminate it. Every file open, every chunk lookup, every lease grant still hits the single master.
The single-master bottleneck is one of the most frequently discussed GFS limitations in interviews. Google's successor system, Colossus, addresses this by distributing metadata across multiple servers. HDFS tackled the same problem with HDFS Federation (multiple NameNodes, each managing a portion of the namespace). When designing a distributed file system in an interview, explain that a single master works at moderate scale but needs partitioning or federation at large scale. Reference the leader and follower pattern and discuss when a single leader stops being sufficient.
Problems with the large chunk size
The 64MB chunk size optimized for large sequential I/O but created problems for small files:
| Problem | Why it happens | Workaround |
|---|---|---|
| Hotspots | A small file occupies one chunk on one set of ChunkServers. High read traffic concentrates on those servers. | Store small popular files with a higher replication factor |
| Wasted metadata overhead | A 1KB file still requires a full chunk entry in the master's metadata | Minimal -- metadata per chunk is only ~64 bytes |
| Staggered access | Even with extra replicas, burst traffic can overwhelm ChunkServers | GFS adds a random delay to application start times to spread load |
The hotspot problem is not unique to GFS. Any system that maps small objects to large fixed-size storage units risks uneven load. HDFS inherited this problem and recommends storing small files in archive formats (HAR files, SequenceFiles) rather than as individual entries. In your designs, consider whether your chunk/block size matches your expected file size distribution.
The broader lesson
GFS's limitations are not design flaws -- they are the natural consequences of optimizing for a specific workload at a specific scale. When that workload and scale changed, the design needed to evolve. This is the lifecycle of every distributed system: design for today's constraints, measure where the design breaks, and iterate.
Google's response was not to patch GFS, but to build Colossus -- a ground-up redesign that distributed metadata, supported smaller files, and scaled to exabytes. The lesson for interviews: every design has a shelf life. The mark of a good engineer is knowing when the current design will break, not just how it works today.