BigTable Refinements
The core BigTable design -- MemTable, SSTables, commit log -- works, but raw performance requires additional engineering. Google implemented several refinements to squeeze out better throughput, lower latency, and faster recovery. Each refinement targets a specific bottleneck.
Locality groups
Clients can group multiple column families into a locality group. BigTable generates a separate SSTable for each locality group.
| Benefit | Detail |
|---|---|
| Read performance | Columns accessed together are stored together on disk |
| In-memory access | Clients can declare a locality group as in-memory for frequently accessed, small column families |
| Scan efficiency | Scanning one locality group reads bytes_in_locality_group, not bytes_in_table |
Locality groups are BigTable's answer to the "wide row" problem. If a row has 100 column families but a query only needs 2, reading the entire row wastes I/O. Locality groups ensure that only the relevant SSTables are read. This same concept appears as "column groups" in Cassandra and as "column families" in HBase -- it's a universal optimization for wide-column stores.
Compression
Clients can enable compression on a per-locality-group basis:
- Compression technique is client-configurable based on application requirements
- Compression is applied to each SSTable block independently (enables random access without decompressing the entire file)
- Compression ratios improve significantly when multiple versions of the same data are stored (common in BigTable's versioned cells)
Caching
Tablet servers use two levels of caching to reduce disk reads:
| Cache | What it stores | Best for |
|---|---|---|
| Scan Cache | (key, value) pairs returned by SSTable lookups | Applications that re-read the same data |
| Block Cache | Raw SSTable blocks read from GFS | Sequential reads or reads of nearby keys within the same locality group |
Bloom filters
Without optimization, every read must check all SSTables that make up a Tablet. Bloom Filters solve this by predicting whether an SSTable might contain a given (row, column) pair.
- Created per SSTable (specifically per locality group)
- Small memory footprint with dramatic read performance improvement
- Eliminate disk accesses for SSTables that definitely don't contain the target key
Bloom filters produce false positives (saying a key might exist when it doesn't) but never false negatives. This means a Bloom filter can save you from reading an SSTable, but it can't guarantee you'll find the key in an SSTable it flags. The false-positive rate is tunable via the filter's size.
Unified commit log
Instead of maintaining a separate commit log per Tablet, BigTable keeps one log file per Tablet server.
| Aspect | Detail |
|---|---|
| Write performance | One log file means sequential writes; per-Tablet logs would cause disk seeks across many files |
| Recovery cost | When a Tablet server dies, all Tablets' mutations are interleaved in one log. Recovering a single Tablet requires reading the entire log. |
Recovery optimization
To avoid reading the full log N times when N Tablets are reassigned to N different servers, BigTable sorts the commit log by <table, row name, log sequence number>. After sorting, all mutations for a single Tablet are contiguous and can be read in one sequential pass.
Dual log-writing threads
Each Tablet server maintains two log-writing threads, each writing to its own log file. Only one is active at a time. If the active thread performs poorly (e.g., due to network congestion), writing switches to the other thread. Log entries carry sequence numbers so the recovery process can reconcile both logs.
Speeding up Tablet recovery
When the master moves a Tablet between servers, the source server performs compactions to minimize recovery work on the destination:
| Step | Action |
|---|---|
| 1 | Source performs a minor compaction (flush MemTable to SSTable), reducing uncommitted data in the log |
| 2 | Source stops serving the Tablet |
| 3 | Source performs a second minor compaction to flush any mutations that arrived during step 1 |
| 4 | Destination loads the Tablet with no log replay needed |
This three-step compaction during Tablet transfer is a great example of minimizing recovery time through proactive work. The same principle appears in database failover (WAL shipping), VM live migration (iterative memory copy), and leader handoff protocols. When discussing availability in interviews, mention that planned transfers can be made nearly instantaneous by doing the heavy lifting before the handoff.