Data Integrity & Caching
A block stored on disk today might silently corrupt tomorrow -- bit rot, faulty network hardware, and software bugs all introduce errors. HDFS cannot blindly trust that the bytes a DataNode returns are the bytes a client originally wrote. It uses checksums to catch corruption and a centralized caching layer to accelerate hot data.
Data integrity
When a client writes a file to HDFS, it computes a checksum for each block and stores these checksums in a separate hidden file within the same HDFS namespace. On every read, the client verifies the received data against the stored checksum. If the checksum does not match, the client retrieves that block from a different replica.
| Event | Action |
|---|---|
| Client writes a block | Checksum computed and stored alongside the block |
| Client reads a block | Data verified against stored checksum |
| Checksum mismatch on read | Client fetches block from another replica |
| Corruption detected | NameNode notified; corrupt replica marked; new good replica created from a healthy copy |
Block scanner
Corruption is not always caught at read time -- some blocks may sit unread for weeks. To catch these cases, each DataNode runs a block scanner process that periodically verifies every block against its checksum.
Additionally, when a client reads a complete block and checksum verification succeeds, it informs the DataNode, which treats this as a successful verification of that replica.
Checksums are a recurring theme across distributed systems. GFS uses checksums at the chunkserver level. HDFS uses them at both the client and DataNode level. In an interview, mention that passive verification (checking on reads) is not enough -- you also need active background scanning because some data may never be read before it's needed for recovery.
Checksums detect corruption but do not prevent it. If all three replicas of a block become corrupted before the block scanner or a client read catches it, the data is permanently lost. This is rare but possible, which is why HDFS also supports erasure coding as an alternative to pure replication.
Caching
By default, every block read goes to disk. For frequently accessed files, this disk I/O becomes a bottleneck. HDFS provides a Centralized Cache Management scheme that lets users explicitly specify paths to cache in DataNode memory.
How it works
- A client (or administrator) tells the NameNode which files or directories to cache.
- The NameNode identifies the DataNodes holding those blocks and instructs them to load the blocks into off-heap memory caches.
- Cached block locations are tracked by the NameNode and exposed to schedulers.
Why centralized cache management matters
| Benefit | Explanation |
|---|---|
| No eviction of hot data | Explicitly pinned blocks stay in memory even when the working set exceeds DataNode RAM (which it usually does for HDFS workloads) |
| Locality-aware scheduling | MapReduce schedulers can query the NameNode for cached block locations and co-locate tasks with cached data |
| Zero-copy reads | When a block is cached and already checksum-verified by the DataNode, clients use a zero-copy read API with essentially zero overhead |
| Efficient memory utilization | Instead of the OS buffer cache pulling all n replicas into memory on repeated reads, centralized caching pins only m of n replicas -- saving (n - m) worth of memory across the cluster |
Centralized cache management is a good example of a system-wide optimization that a distributed OS buffer cache cannot provide. The NameNode has a global view of the cluster, so it can decide which replicas to cache and avoid redundant caching across nodes. This is a strong talking point when discussing the advantages of a centralized metadata server.