5 Write-ahead Log
Your database is halfway through writing a record when the power goes out. When the machine restarts, what state is the data in? Was the write applied completely, partially, or not at all? Without some form of recovery mechanism, you have no way to know.
Background
Machines crash. Disks fail. Power gets cut. In any system that modifies data, you need a way to recover from mid-operation failures. The core challenge: how do you ensure that either a modification is fully applied, or it can be fully redone from scratch?
Definition
Before applying any modification to the actual data, first write the modification to an append-only log file on disk. This log -- the write-ahead log (WAL), also called a commit log or transaction log -- serves as the system's source of truth. If the system crashes, it replays the log on restart to recover to a consistent state.
How it works
- Client sends a write to the system
- System appends the write to the WAL on disk (sequential I/O -- fast)
- System acknowledges the write to the client (the data is now durable)
- System applies the write to the in-memory data structure (e.g., MemTable, cache)
- On crash: restart, read the WAL, replay any entries that weren't yet applied
The key insight: the WAL is append-only and sequential. This makes it extremely fast to write to (sequential disk writes are orders of magnitude faster than random writes). And since it's sequential, replaying it on recovery is straightforward -- just read from start to end.
Why not just write directly to the data files?
Data files support random access (update any record at any position), which means random disk I/O -- slow and complex. If you crash during a random write, you might have a partially written record. With a WAL:
- Every write is a sequential append (fast, atomic at the OS level for small writes)
- The log is the sole authority -- even if the data files are corrupted, you can rebuild from the log
- Recovery is deterministic: replay the log and you're back to a consistent state
WAL in the bigger picture
WAL interacts with several other patterns:
- Segmented Log -- Splits the WAL into manageable segments for cleanup and management
- High-Water Mark -- Tracks how much of the WAL has been replicated to followers
- Checksum -- Each log entry can include a checksum to detect corruption
Examples
Cassandra
When Cassandra receives a write, it first appends it to the commit log (WAL), then writes it to the in-memory MemTable. This ensures durability: if the node crashes before the MemTable is flushed to an SSTable, the commit log is replayed on startup to recover the data.
Kafka
Kafka is essentially a distributed WAL. Every message produced to Kafka is appended to a commit log (partition file on disk). Kafka's entire value proposition -- durable, replayable, ordered message storage -- is built on the WAL pattern.
GFS
The GFS master writes all metadata changes to an operation log before applying them. The operation log is replicated to remote machines for fault tolerance. On recovery, the master replays the log to reconstruct its metadata state.
Chubby
Chubby stores all database transactions in a transaction log (WAL) for fault tolerance. If the Chubby master crashes, the new master replays the log to recover the database state.
HDFS
The HDFS NameNode writes all metadata changes to an EditLog (WAL). The EditLog is replicated to a secondary NameNode or shared storage for recovery.
WAL is the answer to any interview question about "How do you ensure durability?" or "How do you recover from crashes?" The key points: (1) write to the log before acknowledging the client, (2) sequential I/O makes it fast, (3) replay the log on recovery. Nearly every database and distributed system uses this pattern -- it's as fundamental as it gets.