Skip to main content

Anatomy of Cassandra's Write Operation

Cassandra's write path is designed for extreme write performance. Every write is sequential I/O -- no seeks, no reads, no random access. Here's the complete path:

  1. Write to the commit log (on disk) -- durability
  2. Write to the MemTable (in memory) -- fast access
  3. Acknowledge to the client -- done!
  4. Later: flush MemTable to SSTable (on disk) -- permanent storage
  5. Later: compaction merges SSTables -- cleanup
Think first
Cassandra writes data to both a commit log and a MemTable. Why not skip the commit log and write directly to the MemTable, then flush to disk later?

Step 1: Commit log

The first thing a node does with a write is append it to the commit log -- a write-ahead log stored on disk. The write is not considered successful until it's in the commit log. This ensures that if the node crashes before the MemTable is flushed, the data can be recovered by replaying the log.

The commit log is append-only and sequential -- the fastest possible disk operation.

Step 2: MemTable

After the commit log, data goes to the MemTable -- an in-memory, sorted data structure (one per table per node).

PropertyDetail
Stored inMemory
Sorted byPartition key + clustering columns
PurposeFast reads for recently written data
LifetimeUntil flushed to an SSTable

After writing to both the commit log and MemTable, the node acknowledges success to the coordinator.

Think first
After flushing a MemTable to an SSTable on disk, why does Cassandra make SSTables immutable rather than updating them in place when new writes arrive?

Step 3: SSTable flush

When the MemTable reaches a size threshold, it's flushed to disk as an SSTable (Sorted String Table -- a term borrowed from BigTable). At this point:

  • A new empty MemTable is created for subsequent writes
  • The corresponding commit log entries are removed (the data is now safely on disk in the SSTable)
  • The SSTable is immutable -- once written, it's never modified
Why immutable SSTables?

Immutability is key to Cassandra's write performance. Since SSTables are never modified after creation, there's no need for locks, no risk of partial updates, and reads can happen without coordination. The cost: updates and deletes create new entries rather than modifying existing ones, which is handled by compaction and tombstones.

The current state of any Cassandra table is the union of its MemTable (in memory) and all its SSTables (on disk). Reads must check both.

Why writes are so fast

Every step in the write path is sequential I/O: commit log append → MemTable insert (memory, sorted) → SSTable flush (sequential write). No reads. No seeks. No random access. This is why Cassandra can achieve hundreds of thousands of writes per second per node.

Quiz
A Cassandra node crashes immediately after acknowledging a write to the client but before the MemTable is flushed to an SSTable. What happens to that write when the node restarts?