Skip to main content

Tombstones

In a distributed system, deleting data is surprisingly dangerous. If you delete a record on Node A while Node B is down, Node B still has the old data. When Node B comes back online and syncs with Node A, the deleted data could reappear -- Node B thinks it has data that A is missing and helpfully "repairs" it.

Cassandra prevents this with tombstones -- markers that say "this data was intentionally deleted."

Think first
In a distributed database with eventual consistency, why can't you simply delete a record by removing it from disk? What could go wrong when an offline replica comes back?

How tombstones work

When you execute a DELETE in Cassandra, the data isn't actually removed. Instead, Cassandra writes a tombstone -- a special marker with:

  • The key of the deleted data
  • A timestamp (when the delete happened)
  • A TTL (time-to-live) -- how long the tombstone persists (default: 10 days)

The tombstone is treated like any other write: it goes to the commit log, the MemTable, and eventually an SSTable. During reads, tombstones override the data they mark -- the system knows this data was deleted, not missing.

Why the 10-day default?

The TTL gives downed nodes time to recover and receive the tombstone via read repair or hinted handoff. If a node is down for less than 10 days, it will learn about the deletion when it comes back. If it's down for longer, it should be treated as permanently failed and replaced.

Tombstone removal

Tombstones are removed during compaction. When compaction encounters a tombstone whose TTL has expired, it discards both the tombstone and the underlying data permanently.

Think first
Tombstones have a default TTL of 10 days. What would happen if you set the TTL to just 1 hour to save disk space?

Common problems with tombstones

Tombstones are a necessary mechanism, but they create real operational challenges:

ProblemWhy it happensImpact
Increased storageA delete adds data (the tombstone) instead of removing itDisk usage temporarily increases after deletes
Read performance degradationReads must scan through tombstones to find live dataTables with many tombstones become slow to read; can cause timeouts
Tombstone accumulationFrequent deletes without compaction build up tombstonesMemory and disk pressure; degraded read latency
The tombstone read problem

If your application frequently deletes and re-inserts data in the same partition, tombstones accumulate. A read on that partition must scan through all tombstones to find live data. This can cause read timeouts and is one of the most common Cassandra performance issues in production. The fix: model your data to minimize deletes, or use Time-Window Compaction to age out entire SSTables.

Interview angle

Tombstones illustrate a fundamental challenge in distributed systems: you can't just "delete" data because replicas that are offline might bring it back. The solution (soft deletes with TTLs) is a trade-off: you prevent zombie data but temporarily use more space and slow down reads. This applies beyond Cassandra -- any eventually consistent system with background repair faces the same problem.

Quiz
Your application deletes and re-inserts rows in the same partition thousands of times per day. Users report increasing read timeouts on that partition. What is the most likely cause and the best fix?