14 Checksum
You read a file from a remote server. It arrives intact -- or does it? Bits can flip in memory, on disk, or in transit. A single corrupted byte in a database record can cascade into application-level bugs that are incredibly hard to diagnose. How do you know the data you received is the data that was sent?
Background
In a distributed system, data moves constantly -- between disks, across networks, through memory. At every step, corruption is possible:
- Disk degradation -- sectors go bad silently (this is called "bit rot")
- Network errors -- packets can arrive corrupted despite TCP checksums (which only cover headers)
- Memory faults -- cosmic rays and hardware defects can flip individual bits
- Software bugs -- serialization errors, truncated writes, race conditions
The insidious part: most of these corruptions are silent. The system doesn't crash -- it just returns wrong data. Without explicit verification, you might never know.
Definition
Compute a checksum (a fixed-size fingerprint) of the data and store it alongside the data. When the data is read, recompute the checksum and compare. If they don't match, the data is corrupt.
How it works
- On write: Compute a hash of the data using a function like MD5, SHA-1, SHA-256, or CRC32. Store the resulting checksum alongside the data.
- On read: Retrieve both the data and the stored checksum. Recompute the checksum from the data. If the computed checksum matches the stored one, the data is intact. If not, the data is corrupt.
- On corruption detection: Retrieve the data from another replica, log the corruption event, and potentially mark the corrupted copy for repair.
| Hash function | Speed | Collision resistance | Typical use |
|---|---|---|---|
| CRC32 | Very fast | Low | Network packets, quick integrity checks |
| MD5 | Fast | Medium (broken for cryptography, fine for integrity) | Data distribution (Dynamo key hashing) |
| SHA-256 | Moderate | High | When stronger guarantees are needed |
For data integrity, you don't need cryptographic strength -- you just need collision resistance against accidental corruption, not adversarial attacks. CRC32 is often sufficient and much faster. Use stronger hashes (SHA-256) when you need to detect intentional tampering.
Where checksums live in a system
Checksums operate at multiple layers, and distributed systems often stack them:
- Per-block/chunk checksums: GFS and HDFS compute checksums for each 64KB block within a chunk. This lets them pinpoint which part of a large file is corrupt without re-reading everything.
- Per-file checksums: The entire file's content is hashed and stored as metadata.
- Per-message checksums: Kafka includes a CRC in each message record so consumers can verify data integrity end-to-end.
- Per-SSTable checksums: BigTable and Cassandra compute checksums for SSTable blocks.
Examples
HDFS
HDFS computes checksums for every 512 bytes of data (configurable) and stores them in a separate hidden file. When a client reads data, it verifies the checksums. If corruption is detected, the client reports it to the NameNode and reads from another replica. The NameNode then schedules re-replication of the corrupted block.
GFS
ChunkServers verify checksums during reads. Each 64KB block within a 64MB chunk has a 32-bit checksum. Checksums are kept in memory and persisted in logs. For idle chunks, the master runs background "patrol reads" to detect corruption proactively.
Kafka
Each message in Kafka includes a CRC32 checksum. Brokers verify checksums on receipt. Consumers verify again on read. This end-to-end checking catches corruption at any point in the pipeline.
Chubby
Chubby stores checksums with each file to verify integrity of the small metadata objects it manages.
Checksums are the answer to any interview question about "How do you ensure data integrity?" The key insight to show: checksums alone detect corruption, but you need replication to recover from it. The two patterns work together -- checksums tell you which copy is bad, replication gives you a good copy to fall back on.