Skip to main content

14 Checksum

You read a file from a remote server. It arrives intact -- or does it? Bits can flip in memory, on disk, or in transit. A single corrupted byte in a database record can cascade into application-level bugs that are incredibly hard to diagnose. How do you know the data you received is the data that was sent?

Think first
Data corruption can happen silently -- bit rot on disk, flipped bits in memory, corrupted packets in transit. The system doesn't crash; it just returns wrong data. How would you detect this before it causes application-level damage?

Background

In a distributed system, data moves constantly -- between disks, across networks, through memory. At every step, corruption is possible:

  • Disk degradation -- sectors go bad silently (this is called "bit rot")
  • Network errors -- packets can arrive corrupted despite TCP checksums (which only cover headers)
  • Memory faults -- cosmic rays and hardware defects can flip individual bits
  • Software bugs -- serialization errors, truncated writes, race conditions

The insidious part: most of these corruptions are silent. The system doesn't crash -- it just returns wrong data. Without explicit verification, you might never know.

Definition

Compute a checksum (a fixed-size fingerprint) of the data and store it alongside the data. When the data is read, recompute the checksum and compare. If they don't match, the data is corrupt.

How it works

  1. On write: Compute a hash of the data using a function like MD5, SHA-1, SHA-256, or CRC32. Store the resulting checksum alongside the data.
  2. On read: Retrieve both the data and the stored checksum. Recompute the checksum from the data. If the computed checksum matches the stored one, the data is intact. If not, the data is corrupt.
  3. On corruption detection: Retrieve the data from another replica, log the corruption event, and potentially mark the corrupted copy for repair.
Hash functionSpeedCollision resistanceTypical use
CRC32Very fastLowNetwork packets, quick integrity checks
MD5FastMedium (broken for cryptography, fine for integrity)Data distribution (Dynamo key hashing)
SHA-256ModerateHighWhen stronger guarantees are needed
Checksums vs. cryptographic hashes

For data integrity, you don't need cryptographic strength -- you just need collision resistance against accidental corruption, not adversarial attacks. CRC32 is often sufficient and much faster. Use stronger hashes (SHA-256) when you need to detect intentional tampering.

Where checksums live in a system

Checksums operate at multiple layers, and distributed systems often stack them:

  • Per-block/chunk checksums: GFS and HDFS compute checksums for each 64KB block within a chunk. This lets them pinpoint which part of a large file is corrupt without re-reading everything.
  • Per-file checksums: The entire file's content is hashed and stored as metadata.
  • Per-message checksums: Kafka includes a CRC in each message record so consumers can verify data integrity end-to-end.
  • Per-SSTable checksums: BigTable and Cassandra compute checksums for SSTable blocks.

Examples

HDFS

HDFS computes checksums for every 512 bytes of data (configurable) and stores them in a separate hidden file. When a client reads data, it verifies the checksums. If corruption is detected, the client reports it to the NameNode and reads from another replica. The NameNode then schedules re-replication of the corrupted block.

GFS

ChunkServers verify checksums during reads. Each 64KB block within a 64MB chunk has a 32-bit checksum. Checksums are kept in memory and persisted in logs. For idle chunks, the master runs background "patrol reads" to detect corruption proactively.

Kafka

Each message in Kafka includes a CRC32 checksum. Brokers verify checksums on receipt. Consumers verify again on read. This end-to-end checking catches corruption at any point in the pipeline.

Chubby

Chubby stores checksums with each file to verify integrity of the small metadata objects it manages.

Interview angle

Checksums are the answer to any interview question about "How do you ensure data integrity?" The key insight to show: checksums alone detect corruption, but you need replication to recover from it. The two patterns work together -- checksums tell you which copy is bad, replication gives you a good copy to fall back on.

Quiz
A distributed storage system uses per-file checksums but not per-block checksums. A file is 1 GB. The per-file checksum fails. What problem does this create for repair?