Anatomy of Cassandra's Write Operation

Cassandra's write path is designed for extreme write performance. Every write is sequential I/O -- no seeks, no reads, no random access. Here's the complete path:

Write to the commit log (on disk) -- durability
Write to the MemTable (in memory) -- fast access
Acknowledge to the client -- done!
Later: flush MemTable to SSTable (on disk) -- permanent storage
Later: compaction merges SSTables -- cleanup

Think first

Cassandra writes data to both a commit log and a MemTable. Why not skip the commit log and write directly to the MemTable, then flush to disk later?

Step 1: Commit log

The first thing a node does with a write is append it to the commit log -- a write-ahead log stored on disk. The write is not considered successful until it's in the commit log. This ensures that if the node crashes before the MemTable is flushed, the data can be recovered by replaying the log.

The commit log is append-only and sequential -- the fastest possible disk operation.

Step 2: MemTable

After the commit log, data goes to the MemTable -- an in-memory, sorted data structure (one per table per node).

Property	Detail
Stored in	Memory
Sorted by	Partition key + clustering columns
Purpose	Fast reads for recently written data
Lifetime	Until flushed to an SSTable

After writing to both the commit log and MemTable, the node acknowledges success to the coordinator.

Think first

After flushing a MemTable to an SSTable on disk, why does Cassandra make SSTables immutable rather than updating them in place when new writes arrive?

Step 3: SSTable flush

When the MemTable reaches a size threshold, it's flushed to disk as an SSTable (Sorted String Table -- a term borrowed from BigTable). At this point:

A new empty MemTable is created for subsequent writes
The corresponding commit log entries are removed (the data is now safely on disk in the SSTable)
The SSTable is immutable -- once written, it's never modified

Why immutable SSTables?

Immutability is key to Cassandra's write performance. Since SSTables are never modified after creation, there's no need for locks, no risk of partial updates, and reads can happen without coordination. The cost: updates and deletes create new entries rather than modifying existing ones, which is handled by compaction and tombstones.

The current state of any Cassandra table is the union of its MemTable (in memory) and all its SSTables (on disk). Reads must check both.

Why writes are so fast

Every step in the write path is sequential I/O: commit log append → MemTable insert (memory, sorted) → SSTable flush (sequential write). No reads. No seeks. No random access. This is why Cassandra can achieve hundreds of thousands of writes per second per node.

Quiz

A Cassandra node crashes immediately after acknowledging a write to the client but before the MemTable is flushed to an SSTable. What happens to that write when the node restarts?

The write is lost because the MemTable was in memory and is gone after the crash.

The write is recovered by replaying the commit log, which reconstructs the MemTable entry.

The write is recovered from other replicas via read repair.

The write is lost, but the coordinator retries it automatically when the node comes back.

Step 1: Commit log​

Step 2: MemTable​

Step 3: SSTable flush​

Step 1: Commit log

Step 2: MemTable

Step 3: SSTable flush