14 Checksum

You read a file from a remote server. It arrives intact -- or does it? Bits can flip in memory, on disk, or in transit. A single corrupted byte in a database record can cascade into application-level bugs that are incredibly hard to diagnose. How do you know the data you received is the data that was sent?

Think first

Data corruption can happen silently -- bit rot on disk, flipped bits in memory, corrupted packets in transit. The system doesn't crash; it just returns wrong data. How would you detect this before it causes application-level damage?

Background

In a distributed system, data moves constantly -- between disks, across networks, through memory. At every step, corruption is possible:

Disk degradation -- sectors go bad silently (this is called "bit rot")
Network errors -- packets can arrive corrupted despite TCP checksums (which only cover headers)
Memory faults -- cosmic rays and hardware defects can flip individual bits
Software bugs -- serialization errors, truncated writes, race conditions

The insidious part: most of these corruptions are silent. The system doesn't crash -- it just returns wrong data. Without explicit verification, you might never know.

Definition

Compute a checksum (a fixed-size fingerprint) of the data and store it alongside the data. When the data is read, recompute the checksum and compare. If they don't match, the data is corrupt.

How it works

On write: Compute a hash of the data using a function like MD5, SHA-1, SHA-256, or CRC32. Store the resulting checksum alongside the data.
On read: Retrieve both the data and the stored checksum. Recompute the checksum from the data. If the computed checksum matches the stored one, the data is intact. If not, the data is corrupt.
On corruption detection: Retrieve the data from another replica, log the corruption event, and potentially mark the corrupted copy for repair.

Hash function	Speed	Collision resistance	Typical use
CRC32	Very fast	Low	Network packets, quick integrity checks
MD5	Fast	Medium (broken for cryptography, fine for integrity)	Data distribution (Dynamo key hashing)
SHA-256	Moderate	High	When stronger guarantees are needed

Checksums vs. cryptographic hashes

For data integrity, you don't need cryptographic strength -- you just need collision resistance against accidental corruption, not adversarial attacks. CRC32 is often sufficient and much faster. Use stronger hashes (SHA-256) when you need to detect intentional tampering.

Where checksums live in a system

Checksums operate at multiple layers, and distributed systems often stack them:

Per-block/chunk checksums: GFS and HDFS compute checksums for each 64KB block within a chunk. This lets them pinpoint which part of a large file is corrupt without re-reading everything.
Per-file checksums: The entire file's content is hashed and stored as metadata.
Per-message checksums: Kafka includes a CRC in each message record so consumers can verify data integrity end-to-end.
Per-SSTable checksums: BigTable and Cassandra compute checksums for SSTable blocks.

Examples

HDFS

HDFS computes checksums for every 512 bytes of data (configurable) and stores them in a separate hidden file. When a client reads data, it verifies the checksums. If corruption is detected, the client reports it to the NameNode and reads from another replica. The NameNode then schedules re-replication of the corrupted block.

GFS

ChunkServers verify checksums during reads. Each 64KB block within a 64MB chunk has a 32-bit checksum. Checksums are kept in memory and persisted in logs. For idle chunks, the master runs background "patrol reads" to detect corruption proactively.

Kafka

Each message in Kafka includes a CRC32 checksum. Brokers verify checksums on receipt. Consumers verify again on read. This end-to-end checking catches corruption at any point in the pipeline.

Chubby

Chubby stores checksums with each file to verify integrity of the small metadata objects it manages.

Interview angle

Checksums are the answer to any interview question about "How do you ensure data integrity?" The key insight to show: checksums alone detect corruption, but you need replication to recover from it. The two patterns work together -- checksums tell you which copy is bad, replication gives you a good copy to fall back on.

Quiz

A distributed storage system uses per-file checksums but not per-block checksums. A file is 1 GB. The per-file checksum fails. What problem does this create for repair?

No problem -- just re-download the entire 1 GB file from a healthy replica.

You know the file is corrupt but not where -- you must transfer the entire 1 GB from a healthy replica instead of just the corrupted blocks, wasting bandwidth and time that per-block checksums would avoid by pinpointing the exact location of corruption.

The per-file checksum is sufficient because corruption usually affects the entire file.

Per-block checksums would use too much storage to be practical.

Background​

Definition​

How it works​

Where checksums live in a system​

Examples​

HDFS​

GFS​

Kafka​

Chubby​