Data Integrity & Caching

A block stored on disk today might silently corrupt tomorrow -- bit rot, faulty network hardware, and software bugs all introduce errors. HDFS cannot blindly trust that the bytes a DataNode returns are the bytes a client originally wrote. It uses checksums to catch corruption and a centralized caching layer to accelerate hot data.

Think first

HDFS stores blocks on multiple DataNodes, but replication alone does not detect corruption -- it only provides redundancy. How would you verify that the bytes a DataNode returns are the bytes originally written, without comparing against other replicas?

Data integrity

When a client writes a file to HDFS, it computes a checksum for each block and stores these checksums in a separate hidden file within the same HDFS namespace. On every read, the client verifies the received data against the stored checksum. If the checksum does not match, the client retrieves that block from a different replica.

Event	Action
Client writes a block	Checksum computed and stored alongside the block
Client reads a block	Data verified against stored checksum
Checksum mismatch on read	Client fetches block from another replica
Corruption detected	NameNode notified; corrupt replica marked; new good replica created from a healthy copy

Block scanner

Corruption is not always caught at read time -- some blocks may sit unread for weeks. To catch these cases, each DataNode runs a block scanner process that periodically verifies every block against its checksum.

Additionally, when a client reads a complete block and checksum verification succeeds, it informs the DataNode, which treats this as a successful verification of that replica.

Interview angle

Checksums are a recurring theme across distributed systems. GFS uses checksums at the chunkserver level. HDFS uses them at both the client and DataNode level. In an interview, mention that passive verification (checking on reads) is not enough -- you also need active background scanning because some data may never be read before it's needed for recovery.

warning

Checksums detect corruption but do not prevent it. If all three replicas of a block become corrupted before the block scanner or a client read catches it, the data is permanently lost. This is rare but possible, which is why HDFS also supports erasure coding as an alternative to pure replication.

Think first

Most distributed file systems rely on the OS buffer cache for caching, but the OS cache has no global view. What advantage would a centralized cache managed by the NameNode provide over independent OS caches on each DataNode?

Caching

By default, every block read goes to disk. For frequently accessed files, this disk I/O becomes a bottleneck. HDFS provides a Centralized Cache Management scheme that lets users explicitly specify paths to cache in DataNode memory.

How it works

A client (or administrator) tells the NameNode which files or directories to cache.
The NameNode identifies the DataNodes holding those blocks and instructs them to load the blocks into off-heap memory caches.
Cached block locations are tracked by the NameNode and exposed to schedulers.

Why centralized cache management matters

Benefit	Explanation
No eviction of hot data	Explicitly pinned blocks stay in memory even when the working set exceeds DataNode RAM (which it usually does for HDFS workloads)
Locality-aware scheduling	MapReduce schedulers can query the NameNode for cached block locations and co-locate tasks with cached data
Zero-copy reads	When a block is cached and already checksum-verified by the DataNode, clients use a zero-copy read API with essentially zero overhead
Efficient memory utilization	Instead of the OS buffer cache pulling all n replicas into memory on repeated reads, centralized caching pins only m of n replicas -- saving (n - m) worth of memory across the cluster

Interview angle

Centralized cache management is a good example of a system-wide optimization that a distributed OS buffer cache cannot provide. The NameNode has a global view of the cluster, so it can decide which replicas to cache and avoid redundant caching across nodes. This is a strong talking point when discussing the advantages of a centralized metadata server.

Quiz

What would happen if HDFS relied solely on OS buffer caches on each DataNode instead of providing centralized cache management?

Performance would be similar since the OS cache is efficient at caching frequently accessed data

All N replicas of a hot block would be cached independently in RAM across N DataNodes, wasting (N-1) copies worth of memory, and MapReduce schedulers could not make cache-aware task placement decisions

DataNodes would use less memory because the OS cache is more space-efficient

The NameNode would need to track cache state on every DataNode

Data integrity​

Block scanner​

Caching​

How it works​

Why centralized cache management matters​

Data integrity

Block scanner

Caching

How it works

Why centralized cache management matters