SSTable
Every Tablet in BigTable ultimately lives on disk as a set of SSTables (Sorted String Tables). Understanding the SSTable format is essential because it determines how BigTable achieves fast reads from disk -- often with a single disk seek.
How Tablets are stored in GFS
BigTable persists data in GFS using the SSTable file format:
- An SSTable is a persistent, ordered, immutable map from keys to values (both arbitrary byte strings).
- Each Tablet is stored as a sequence of SSTables in GFS.
- An SSTable consists of a sequence of data blocks (typically 64 KB each).
Reading from an SSTable
A block index at the end of the SSTable maps keys to block offsets. When the SSTable is opened, this index loads into memory.
| Step | Operation |
|---|---|
| 1 | Binary search the in-memory index to find the target block |
| 2 | Single disk seek to read that block |
Two loading strategies exist:
| Strategy | Trade-off |
|---|---|
| Load entire SSTable into memory | No disk seeks for lookups, but higher memory usage |
| Load only the index | One disk seek per lookup, lower memory footprint |
SSTables expose two operations:
- Get: retrieve the value for a given key
- Iterate: scan values over a key range
Immutability
Once written to GFS, an SSTable is read-only. New data produces a new SSTable; old SSTables are garbage-collected when no longer needed. This immutability provides:
- No synchronization needed for concurrent reads
- Simpler Tablet splitting -- SSTables can be shared across Tablets
- Clean garbage collection -- the GC handles permanent removal of deleted or stale data
SSTable immutability is the foundation of the Log-Structured Merge Tree (LSM) pattern used by BigTable, HBase, Cassandra, RocksDB, and LevelDB. In an interview, explain the key insight: writes are always sequential (fast), while reads may need to merge across multiple SSTables (mitigated by Bloom Filters and compaction).
Table vs. Tablet vs. SSTable
| Concept | Description |
|---|---|
| Table | The logical container; composed of multiple Tablets |
| Tablet | A contiguous row range; the unit of distribution |
| SSTable | An immutable on-disk file; the unit of storage |
Key relationships:
- Multiple Tablets make up a Table.
- SSTables can be shared by multiple Tablets.
- Tablets do not overlap; SSTables can overlap.
MemTable and the write path
To avoid writing a new SSTable for every mutation, BigTable buffers recent writes in a MemTable -- an in-memory, mutable, sorted buffer:
- New writes go into the MemTable.
- When the MemTable reaches a size threshold, it is frozen and flushed to GFS as a new SSTable.
- A fresh MemTable is created for subsequent writes.
Every mutation is also written to a commit log on GFS -- a Write-Ahead Log that contains redo records for recovery if a Tablet server crashes before flushing the MemTable.
For reads, BigTable merges data from both the MemTable and SSTables. Since both are sorted, the merge is efficient.
The MemTable is the only mutable data structure in the storage path. If a Tablet server crashes before flushing, the MemTable contents are lost -- the commit log is the sole recovery mechanism. This is why the Write-Ahead Log pattern is non-negotiable in LSM-based systems.