Skip to main content

SSTable

Every Tablet in BigTable ultimately lives on disk as a set of SSTables (Sorted String Tables). Understanding the SSTable format is essential because it determines how BigTable achieves fast reads from disk -- often with a single disk seek.

Think first
You need a file format for storing sorted key-value data on disk that supports fast lookups (ideally one disk seek) and efficient range scans. The file should never be modified after creation. How would you design it?

How Tablets are stored in GFS

BigTable persists data in GFS using the SSTable file format:

  • An SSTable is a persistent, ordered, immutable map from keys to values (both arbitrary byte strings).
  • Each Tablet is stored as a sequence of SSTables in GFS.
  • An SSTable consists of a sequence of data blocks (typically 64 KB each).

Reading from an SSTable

A block index at the end of the SSTable maps keys to block offsets. When the SSTable is opened, this index loads into memory.

StepOperation
1Binary search the in-memory index to find the target block
2Single disk seek to read that block

Two loading strategies exist:

StrategyTrade-off
Load entire SSTable into memoryNo disk seeks for lookups, but higher memory usage
Load only the indexOne disk seek per lookup, lower memory footprint

SSTables expose two operations:

  • Get: retrieve the value for a given key
  • Iterate: scan values over a key range

Immutability

Once written to GFS, an SSTable is read-only. New data produces a new SSTable; old SSTables are garbage-collected when no longer needed. This immutability provides:

  • No synchronization needed for concurrent reads
  • Simpler Tablet splitting -- SSTables can be shared across Tablets
  • Clean garbage collection -- the GC handles permanent removal of deleted or stale data
Interview angle

SSTable immutability is the foundation of the Log-Structured Merge Tree (LSM) pattern used by BigTable, HBase, Cassandra, RocksDB, and LevelDB. In an interview, explain the key insight: writes are always sequential (fast), while reads may need to merge across multiple SSTables (mitigated by Bloom Filters and compaction).

Table vs. Tablet vs. SSTable

ConceptDescription
TableThe logical container; composed of multiple Tablets
TabletA contiguous row range; the unit of distribution
SSTableAn immutable on-disk file; the unit of storage

Key relationships:

  • Multiple Tablets make up a Table.
  • SSTables can be shared by multiple Tablets.
  • Tablets do not overlap; SSTables can overlap.

MemTable and the write path

To avoid writing a new SSTable for every mutation, BigTable buffers recent writes in a MemTable -- an in-memory, mutable, sorted buffer:

  1. New writes go into the MemTable.
  2. When the MemTable reaches a size threshold, it is frozen and flushed to GFS as a new SSTable.
  3. A fresh MemTable is created for subsequent writes.

Every mutation is also written to a commit log on GFS -- a Write-Ahead Log that contains redo records for recovery if a Tablet server crashes before flushing the MemTable.

For reads, BigTable merges data from both the MemTable and SSTables. Since both are sorted, the merge is efficient.

warning

The MemTable is the only mutable data structure in the storage path. If a Tablet server crashes before flushing, the MemTable contents are lost -- the commit log is the sole recovery mechanism. This is why the Write-Ahead Log pattern is non-negotiable in LSM-based systems.

Quiz
What would happen if BigTable made SSTables mutable -- allowing in-place updates instead of always creating new SSTables?