6 Segmented Log
Your write-ahead log has been running for months. It's now a single, multi-gigabyte file. Startup takes forever because the system replays the entire log. Purging old entries requires rewriting the whole file. Searching for a specific entry means scanning from the beginning.
Background
A WAL is essential for durability, but a single, ever-growing log file creates practical problems:
- Slow startup -- replaying the entire log from scratch takes longer as it grows
- Difficult cleanup -- you can't easily delete old entries from the middle of a file
- Performance degradation -- a single huge file makes seeking, rotating, and managing the log expensive
- Error-prone operations -- truncating or compacting a single large file is risky
Definition
Instead of one monolithic log file, split the log into smaller, fixed-size segments. New writes go to the current active segment. When it reaches a size or time threshold, it's closed and a new segment begins. Old segments can be independently cleaned up, archived, or deleted.
How it works
- The system writes to the active segment (the newest one)
- When the active segment reaches a threshold (e.g., 1GB or 4 hours), it's sealed (made read-only) and a new active segment is created
- Sealed segments can be independently:
- Deleted -- if all their data has been flushed elsewhere or is beyond the retention window
- Archived -- moved to cold storage
- Compacted -- merged with other segments to remove duplicate or superseded entries
| Aspect | Single log file | Segmented log |
|---|---|---|
| Cleanup | Rewrite entire file | Delete individual segments |
| Startup recovery | Replay entire log | Replay only recent segments (older data is in checkpoints) |
| Disk management | One growing file | Many small, manageable files |
| Concurrent access | Lock contention on single file | Old segments are immutable (no locks needed) |
Examples
Cassandra
Cassandra splits its commit log into segments. When a segment's data has been fully flushed from the MemTable to SSTables, that segment can be archived, deleted, or recycled. This prevents the commit log from growing unboundedly and reduces disk seeks.
Kafka
Kafka uses segmented logs for each partition. Each partition's log is split into fixed-size segment files. Kafka needs to regularly find messages on disk for purging (based on retention policy), and segmented files make this efficient -- delete an entire segment instead of editing a single large file. Segments also enable Kafka's time-based and size-based log compaction.
Segmented log is the natural follow-up whenever you mention write-ahead logs in an interview. The interviewer may ask "What happens when the log gets too large?" The answer: split it into segments, use checkpoints so you only replay recent segments, and independently clean up old ones. This shows you think about operational concerns, not just correctness.