Anatomy of Cassandra's Write Operation
Cassandra's write path is designed for extreme write performance. Every write is sequential I/O -- no seeks, no reads, no random access. Here's the complete path:
- Write to the commit log (on disk) -- durability
- Write to the MemTable (in memory) -- fast access
- Acknowledge to the client -- done!
- Later: flush MemTable to SSTable (on disk) -- permanent storage
- Later: compaction merges SSTables -- cleanup
Step 1: Commit log
The first thing a node does with a write is append it to the commit log -- a write-ahead log stored on disk. The write is not considered successful until it's in the commit log. This ensures that if the node crashes before the MemTable is flushed, the data can be recovered by replaying the log.
The commit log is append-only and sequential -- the fastest possible disk operation.
Step 2: MemTable
After the commit log, data goes to the MemTable -- an in-memory, sorted data structure (one per table per node).
| Property | Detail |
|---|---|
| Stored in | Memory |
| Sorted by | Partition key + clustering columns |
| Purpose | Fast reads for recently written data |
| Lifetime | Until flushed to an SSTable |
After writing to both the commit log and MemTable, the node acknowledges success to the coordinator.
Step 3: SSTable flush
When the MemTable reaches a size threshold, it's flushed to disk as an SSTable (Sorted String Table -- a term borrowed from BigTable). At this point:
- A new empty MemTable is created for subsequent writes
- The corresponding commit log entries are removed (the data is now safely on disk in the SSTable)
- The SSTable is immutable -- once written, it's never modified
Immutability is key to Cassandra's write performance. Since SSTables are never modified after creation, there's no need for locks, no risk of partial updates, and reads can happen without coordination. The cost: updates and deletes create new entries rather than modifying existing ones, which is handled by compaction and tombstones.
The current state of any Cassandra table is the union of its MemTable (in memory) and all its SSTables (on disk). Reads must check both.
Every step in the write path is sequential I/O: commit log append → MemTable insert (memory, sorted) → SSTable flush (sequential write). No reads. No seeks. No random access. This is why Cassandra can achieve hundreds of thousands of writes per second per node.