5 Write-ahead Log

Your database is halfway through writing a record when the power goes out. When the machine restarts, what state is the data in? Was the write applied completely, partially, or not at all? Without some form of recovery mechanism, you have no way to know.

Think first

Your database writes data to an in-memory structure for speed, then periodically flushes to disk. The machine crashes between flushes. How would you ensure no acknowledged writes are lost?

Background

Machines crash. Disks fail. Power gets cut. In any system that modifies data, you need a way to recover from mid-operation failures. The core challenge: how do you ensure that either a modification is fully applied, or it can be fully redone from scratch?

Definition

Before applying any modification to the actual data, first write the modification to an append-only log file on disk. This log -- the write-ahead log (WAL), also called a commit log or transaction log -- serves as the system's source of truth. If the system crashes, it replays the log on restart to recover to a consistent state.

How it works

Client sends a write to the system
System appends the write to the WAL on disk (sequential I/O -- fast)
System acknowledges the write to the client (the data is now durable)
System applies the write to the in-memory data structure (e.g., MemTable, cache)
On crash: restart, read the WAL, replay any entries that weren't yet applied

The key insight: the WAL is append-only and sequential. This makes it extremely fast to write to (sequential disk writes are orders of magnitude faster than random writes). And since it's sequential, replaying it on recovery is straightforward -- just read from start to end.

Why not just write directly to the data files?

Data files support random access (update any record at any position), which means random disk I/O -- slow and complex. If you crash during a random write, you might have a partially written record. With a WAL:

Every write is a sequential append (fast, atomic at the OS level for small writes)
The log is the sole authority -- even if the data files are corrupted, you can rebuild from the log
Recovery is deterministic: replay the log and you're back to a consistent state

WAL in the bigger picture

WAL interacts with several other patterns:

Segmented Log -- Splits the WAL into manageable segments for cleanup and management
High-Water Mark -- Tracks how much of the WAL has been replicated to followers
Checksum -- Each log entry can include a checksum to detect corruption

Examples

Cassandra

When Cassandra receives a write, it first appends it to the commit log (WAL), then writes it to the in-memory MemTable. This ensures durability: if the node crashes before the MemTable is flushed to an SSTable, the commit log is replayed on startup to recover the data.

Kafka

Kafka is essentially a distributed WAL. Every message produced to Kafka is appended to a commit log (partition file on disk). Kafka's entire value proposition -- durable, replayable, ordered message storage -- is built on the WAL pattern.

GFS

The GFS master writes all metadata changes to an operation log before applying them. The operation log is replicated to remote machines for fault tolerance. On recovery, the master replays the log to reconstruct its metadata state.

Chubby

Chubby stores all database transactions in a transaction log (WAL) for fault tolerance. If the Chubby master crashes, the new master replays the log to recover the database state.

HDFS

The HDFS NameNode writes all metadata changes to an EditLog (WAL). The EditLog is replicated to a secondary NameNode or shared storage for recovery.

Interview angle

WAL is the answer to any interview question about "How do you ensure durability?" or "How do you recover from crashes?" The key points: (1) write to the log before acknowledging the client, (2) sequential I/O makes it fast, (3) replay the log on recovery. Nearly every database and distributed system uses this pattern -- it's as fundamental as it gets.

Quiz

A system uses a WAL but writes to the in-memory data structure BEFORE appending to the WAL (reversing the standard order). What failure scenario does this create?

No issue -- the in-memory write is faster, and the WAL write follows immediately.

Reads could return data that will not survive a crash, because the in-memory state includes a modification that has not yet been durably recorded in the WAL -- violating the durability guarantee.

Write performance would improve because in-memory writes are faster than disk writes.

The WAL would contain duplicate entries.

Background​

Definition​

How it works​

Why not just write directly to the data files?​

WAL in the bigger picture​

Examples​

Cassandra​

Kafka​

GFS​

Chubby​

HDFS​