Tombstones

In a distributed system, deleting data is surprisingly dangerous. If you delete a record on Node A while Node B is down, Node B still has the old data. When Node B comes back online and syncs with Node A, the deleted data could reappear -- Node B thinks it has data that A is missing and helpfully "repairs" it.

Cassandra prevents this with tombstones -- markers that say "this data was intentionally deleted."

Think first

In a distributed database with eventual consistency, why can't you simply delete a record by removing it from disk? What could go wrong when an offline replica comes back?

How tombstones work

When you execute a DELETE in Cassandra, the data isn't actually removed. Instead, Cassandra writes a tombstone -- a special marker with:

The key of the deleted data
A timestamp (when the delete happened)
A TTL (time-to-live) -- how long the tombstone persists (default: 10 days)

The tombstone is treated like any other write: it goes to the commit log, the MemTable, and eventually an SSTable. During reads, tombstones override the data they mark -- the system knows this data was deleted, not missing.

Why the 10-day default?

The TTL gives downed nodes time to recover and receive the tombstone via read repair or hinted handoff. If a node is down for less than 10 days, it will learn about the deletion when it comes back. If it's down for longer, it should be treated as permanently failed and replaced.

Tombstone removal

Tombstones are removed during compaction. When compaction encounters a tombstone whose TTL has expired, it discards both the tombstone and the underlying data permanently.

Think first

Tombstones have a default TTL of 10 days. What would happen if you set the TTL to just 1 hour to save disk space?

Common problems with tombstones

Tombstones are a necessary mechanism, but they create real operational challenges:

Problem	Why it happens	Impact
Increased storage	A delete adds data (the tombstone) instead of removing it	Disk usage temporarily increases after deletes
Read performance degradation	Reads must scan through tombstones to find live data	Tables with many tombstones become slow to read; can cause timeouts
Tombstone accumulation	Frequent deletes without compaction build up tombstones	Memory and disk pressure; degraded read latency

The tombstone read problem

If your application frequently deletes and re-inserts data in the same partition, tombstones accumulate. A read on that partition must scan through all tombstones to find live data. This can cause read timeouts and is one of the most common Cassandra performance issues in production. The fix: model your data to minimize deletes, or use Time-Window Compaction to age out entire SSTables.

Interview angle

Tombstones illustrate a fundamental challenge in distributed systems: you can't just "delete" data because replicas that are offline might bring it back. The solution (soft deletes with TTLs) is a trade-off: you prevent zombie data but temporarily use more space and slow down reads. This applies beyond Cassandra -- any eventually consistent system with background repair faces the same problem.

Quiz

Your application deletes and re-inserts rows in the same partition thousands of times per day. Users report increasing read timeouts on that partition. What is the most likely cause and the best fix?

The partition is too large. Split it by adding a second partition key column.

Tombstones from the frequent deletes accumulate in the partition, and reads must scan through all of them to find live data. Redesign the data model to avoid deletes -- for example, use TTLs or overwrite rows instead of deleting them.

The compaction strategy is wrong. Switching to Leveled Compaction will fix the problem.

Cassandra's read timeout is too low. Increase it to handle the larger partition.

How tombstones work​

Why the 10-day default?​

Tombstone removal​

Common problems with tombstones​

How tombstones work

Why the 10-day default?

Tombstone removal

Common problems with tombstones