Skip to main content

6 Segmented Log

Your write-ahead log has been running for months. It's now a single, multi-gigabyte file. Startup takes forever because the system replays the entire log. Purging old entries requires rewriting the whole file. Searching for a specific entry means scanning from the beginning.

Think first
Your WAL has been running for months and is now a single multi-gigabyte file. What operational problems does this create, and how would you solve them?

Background

A WAL is essential for durability, but a single, ever-growing log file creates practical problems:

  • Slow startup -- replaying the entire log from scratch takes longer as it grows
  • Difficult cleanup -- you can't easily delete old entries from the middle of a file
  • Performance degradation -- a single huge file makes seeking, rotating, and managing the log expensive
  • Error-prone operations -- truncating or compacting a single large file is risky

Definition

Instead of one monolithic log file, split the log into smaller, fixed-size segments. New writes go to the current active segment. When it reaches a size or time threshold, it's closed and a new segment begins. Old segments can be independently cleaned up, archived, or deleted.

How it works

  1. The system writes to the active segment (the newest one)
  2. When the active segment reaches a threshold (e.g., 1GB or 4 hours), it's sealed (made read-only) and a new active segment is created
  3. Sealed segments can be independently:
    • Deleted -- if all their data has been flushed elsewhere or is beyond the retention window
    • Archived -- moved to cold storage
    • Compacted -- merged with other segments to remove duplicate or superseded entries
AspectSingle log fileSegmented log
CleanupRewrite entire fileDelete individual segments
Startup recoveryReplay entire logReplay only recent segments (older data is in checkpoints)
Disk managementOne growing fileMany small, manageable files
Concurrent accessLock contention on single fileOld segments are immutable (no locks needed)

Examples

Cassandra

Cassandra splits its commit log into segments. When a segment's data has been fully flushed from the MemTable to SSTables, that segment can be archived, deleted, or recycled. This prevents the commit log from growing unboundedly and reduces disk seeks.

Kafka

Kafka uses segmented logs for each partition. Each partition's log is split into fixed-size segment files. Kafka needs to regularly find messages on disk for purging (based on retention policy), and segmented files make this efficient -- delete an entire segment instead of editing a single large file. Segments also enable Kafka's time-based and size-based log compaction.

Interview angle

Segmented log is the natural follow-up whenever you mention write-ahead logs in an interview. The interviewer may ask "What happens when the log gets too large?" The answer: split it into segments, use checkpoints so you only replay recent segments, and independently clean up old ones. This shows you think about operational concerns, not just correctness.

Quiz
In Kafka, each partition's log is split into segments with a time-based retention policy (e.g., 7 days). What would happen if Kafka used a single unsegmented log file per partition instead?