6 Segmented Log

Your write-ahead log has been running for months. It's now a single, multi-gigabyte file. Startup takes forever because the system replays the entire log. Purging old entries requires rewriting the whole file. Searching for a specific entry means scanning from the beginning.

Think first

Your WAL has been running for months and is now a single multi-gigabyte file. What operational problems does this create, and how would you solve them?

Background

A WAL is essential for durability, but a single, ever-growing log file creates practical problems:

Slow startup -- replaying the entire log from scratch takes longer as it grows
Difficult cleanup -- you can't easily delete old entries from the middle of a file
Performance degradation -- a single huge file makes seeking, rotating, and managing the log expensive
Error-prone operations -- truncating or compacting a single large file is risky

Definition

Instead of one monolithic log file, split the log into smaller, fixed-size segments. New writes go to the current active segment. When it reaches a size or time threshold, it's closed and a new segment begins. Old segments can be independently cleaned up, archived, or deleted.

How it works

The system writes to the active segment (the newest one)
When the active segment reaches a threshold (e.g., 1GB or 4 hours), it's sealed (made read-only) and a new active segment is created
Sealed segments can be independently:
- Deleted -- if all their data has been flushed elsewhere or is beyond the retention window
- Archived -- moved to cold storage
- Compacted -- merged with other segments to remove duplicate or superseded entries

Aspect	Single log file	Segmented log
Cleanup	Rewrite entire file	Delete individual segments
Startup recovery	Replay entire log	Replay only recent segments (older data is in checkpoints)
Disk management	One growing file	Many small, manageable files
Concurrent access	Lock contention on single file	Old segments are immutable (no locks needed)

Examples

Cassandra

Cassandra splits its commit log into segments. When a segment's data has been fully flushed from the MemTable to SSTables, that segment can be archived, deleted, or recycled. This prevents the commit log from growing unboundedly and reduces disk seeks.

Kafka

Kafka uses segmented logs for each partition. Each partition's log is split into fixed-size segment files. Kafka needs to regularly find messages on disk for purging (based on retention policy), and segmented files make this efficient -- delete an entire segment instead of editing a single large file. Segments also enable Kafka's time-based and size-based log compaction.

Interview angle

Segmented log is the natural follow-up whenever you mention write-ahead logs in an interview. The interviewer may ask "What happens when the log gets too large?" The answer: split it into segments, use checkpoints so you only replay recent segments, and independently clean up old ones. This shows you think about operational concerns, not just correctness.

Quiz

In Kafka, each partition's log is split into segments with a time-based retention policy (e.g., 7 days). What would happen if Kafka used a single unsegmented log file per partition instead?

Retention enforcement would require rewriting the entire log file to remove expired messages, which is an expensive I/O operation that blocks writes and grows linearly with log size -- compared to simply deleting entire segment files.

There would be no difference because Kafka compresses old messages.

Read performance would improve because there's only one file to search.

Kafka's replication would handle the cleanup automatically.

Background​

Definition​

How it works​

Examples​

Cassandra​

Kafka​