Tombstones
In a distributed system, deleting data is surprisingly dangerous. If you delete a record on Node A while Node B is down, Node B still has the old data. When Node B comes back online and syncs with Node A, the deleted data could reappear -- Node B thinks it has data that A is missing and helpfully "repairs" it.
Cassandra prevents this with tombstones -- markers that say "this data was intentionally deleted."
How tombstones work
When you execute a DELETE in Cassandra, the data isn't actually removed. Instead, Cassandra writes a tombstone -- a special marker with:
- The key of the deleted data
- A timestamp (when the delete happened)
- A TTL (time-to-live) -- how long the tombstone persists (default: 10 days)
The tombstone is treated like any other write: it goes to the commit log, the MemTable, and eventually an SSTable. During reads, tombstones override the data they mark -- the system knows this data was deleted, not missing.
Why the 10-day default?
The TTL gives downed nodes time to recover and receive the tombstone via read repair or hinted handoff. If a node is down for less than 10 days, it will learn about the deletion when it comes back. If it's down for longer, it should be treated as permanently failed and replaced.
Tombstone removal
Tombstones are removed during compaction. When compaction encounters a tombstone whose TTL has expired, it discards both the tombstone and the underlying data permanently.
Common problems with tombstones
Tombstones are a necessary mechanism, but they create real operational challenges:
| Problem | Why it happens | Impact |
|---|---|---|
| Increased storage | A delete adds data (the tombstone) instead of removing it | Disk usage temporarily increases after deletes |
| Read performance degradation | Reads must scan through tombstones to find live data | Tables with many tombstones become slow to read; can cause timeouts |
| Tombstone accumulation | Frequent deletes without compaction build up tombstones | Memory and disk pressure; degraded read latency |
If your application frequently deletes and re-inserts data in the same partition, tombstones accumulate. A read on that partition must scan through all tombstones to find live data. This can cause read timeouts and is one of the most common Cassandra performance issues in production. The fix: model your data to minimize deletes, or use Time-Window Compaction to age out entire SSTables.
Tombstones illustrate a fundamental challenge in distributed systems: you can't just "delete" data because replicas that are offline might bring it back. The solution (soft deletes with TTLs) is a trade-off: you prevent zombie data but temporarily use more space and slow down reads. This applies beyond Cassandra -- any eventually consistent system with background repair faces the same problem.