Skip to main content

Data Integrity & Caching

A block stored on disk today might silently corrupt tomorrow -- bit rot, faulty network hardware, and software bugs all introduce errors. HDFS cannot blindly trust that the bytes a DataNode returns are the bytes a client originally wrote. It uses checksums to catch corruption and a centralized caching layer to accelerate hot data.

Think first
HDFS stores blocks on multiple DataNodes, but replication alone does not detect corruption -- it only provides redundancy. How would you verify that the bytes a DataNode returns are the bytes originally written, without comparing against other replicas?

Data integrity

When a client writes a file to HDFS, it computes a checksum for each block and stores these checksums in a separate hidden file within the same HDFS namespace. On every read, the client verifies the received data against the stored checksum. If the checksum does not match, the client retrieves that block from a different replica.

EventAction
Client writes a blockChecksum computed and stored alongside the block
Client reads a blockData verified against stored checksum
Checksum mismatch on readClient fetches block from another replica
Corruption detectedNameNode notified; corrupt replica marked; new good replica created from a healthy copy

Block scanner

Corruption is not always caught at read time -- some blocks may sit unread for weeks. To catch these cases, each DataNode runs a block scanner process that periodically verifies every block against its checksum.

Additionally, when a client reads a complete block and checksum verification succeeds, it informs the DataNode, which treats this as a successful verification of that replica.

Interview angle

Checksums are a recurring theme across distributed systems. GFS uses checksums at the chunkserver level. HDFS uses them at both the client and DataNode level. In an interview, mention that passive verification (checking on reads) is not enough -- you also need active background scanning because some data may never be read before it's needed for recovery.

warning

Checksums detect corruption but do not prevent it. If all three replicas of a block become corrupted before the block scanner or a client read catches it, the data is permanently lost. This is rare but possible, which is why HDFS also supports erasure coding as an alternative to pure replication.

Think first
Most distributed file systems rely on the OS buffer cache for caching, but the OS cache has no global view. What advantage would a centralized cache managed by the NameNode provide over independent OS caches on each DataNode?

Caching

By default, every block read goes to disk. For frequently accessed files, this disk I/O becomes a bottleneck. HDFS provides a Centralized Cache Management scheme that lets users explicitly specify paths to cache in DataNode memory.

How it works

  1. A client (or administrator) tells the NameNode which files or directories to cache.
  2. The NameNode identifies the DataNodes holding those blocks and instructs them to load the blocks into off-heap memory caches.
  3. Cached block locations are tracked by the NameNode and exposed to schedulers.

Why centralized cache management matters

BenefitExplanation
No eviction of hot dataExplicitly pinned blocks stay in memory even when the working set exceeds DataNode RAM (which it usually does for HDFS workloads)
Locality-aware schedulingMapReduce schedulers can query the NameNode for cached block locations and co-locate tasks with cached data
Zero-copy readsWhen a block is cached and already checksum-verified by the DataNode, clients use a zero-copy read API with essentially zero overhead
Efficient memory utilizationInstead of the OS buffer cache pulling all n replicas into memory on repeated reads, centralized caching pins only m of n replicas -- saving (n - m) worth of memory across the cluster
Interview angle

Centralized cache management is a good example of a system-wide optimization that a distributed OS buffer cache cannot provide. The NameNode has a global view of the cluster, so it can decide which replicas to cache and avoid redundant caching across nodes. This is a strong talking point when discussing the advantages of a centralized metadata server.

Quiz
What would happen if HDFS relied solely on OS buffer caches on each DataNode instead of providing centralized cache management?