BigTable Refinements

The core BigTable design -- MemTable, SSTables, commit log -- works, but raw performance requires additional engineering. Google implemented several refinements to squeeze out better throughput, lower latency, and faster recovery. Each refinement targets a specific bottleneck.

Think first

A BigTable row might have 100 column families, but a typical query only needs 2-3 of them. If all column families are stored in the same SSTable, every read wastes I/O on irrelevant data. How would you optimize storage layout for this access pattern?

Locality groups

Clients can group multiple column families into a locality group. BigTable generates a separate SSTable for each locality group.

Benefit	Detail
Read performance	Columns accessed together are stored together on disk
In-memory access	Clients can declare a locality group as in-memory for frequently accessed, small column families
Scan efficiency	Scanning one locality group reads `bytes_in_locality_group`, not `bytes_in_table`

Interview angle

Locality groups are BigTable's answer to the "wide row" problem. If a row has 100 column families but a query only needs 2, reading the entire row wastes I/O. Locality groups ensure that only the relevant SSTables are read. This same concept appears as "column groups" in Cassandra and as "column families" in HBase -- it's a universal optimization for wide-column stores.

Compression

Clients can enable compression on a per-locality-group basis:

Compression technique is client-configurable based on application requirements
Compression is applied to each SSTable block independently (enables random access without decompressing the entire file)
Compression ratios improve significantly when multiple versions of the same data are stored (common in BigTable's versioned cells)

Caching

Tablet servers use two levels of caching to reduce disk reads:

Cache	What it stores	Best for
Scan Cache	(key, value) pairs returned by SSTable lookups	Applications that re-read the same data
Block Cache	Raw SSTable blocks read from GFS	Sequential reads or reads of nearby keys within the same locality group

Bloom filters

Without optimization, every read must check all SSTables that make up a Tablet. Bloom Filters solve this by predicting whether an SSTable might contain a given (row, column) pair.

Created per SSTable (specifically per locality group)
Small memory footprint with dramatic read performance improvement
Eliminate disk accesses for SSTables that definitely don't contain the target key

warning

Bloom filters produce false positives (saying a key might exist when it doesn't) but never false negatives. This means a Bloom filter can save you from reading an SSTable, but it can't guarantee you'll find the key in an SSTable it flags. The false-positive rate is tunable via the filter's size.

Unified commit log

Instead of maintaining a separate commit log per Tablet, BigTable keeps one log file per Tablet server.

Aspect	Detail
Write performance	One log file means sequential writes; per-Tablet logs would cause disk seeks across many files
Recovery cost	When a Tablet server dies, all Tablets' mutations are interleaved in one log. Recovering a single Tablet requires reading the entire log.

Recovery optimization

To avoid reading the full log N times when N Tablets are reassigned to N different servers, BigTable sorts the commit log by <table, row name, log sequence number>. After sorting, all mutations for a single Tablet are contiguous and can be read in one sequential pass.

Dual log-writing threads

Each Tablet server maintains two log-writing threads, each writing to its own log file. Only one is active at a time. If the active thread performs poorly (e.g., due to network congestion), writing switches to the other thread. Log entries carry sequence numbers so the recovery process can reconcile both logs.

Speeding up Tablet recovery

When the master moves a Tablet between servers, the source server performs compactions to minimize recovery work on the destination:

Step	Action
1	Source performs a minor compaction (flush MemTable to SSTable), reducing uncommitted data in the log
2	Source stops serving the Tablet
3	Source performs a second minor compaction to flush any mutations that arrived during step 1
4	Destination loads the Tablet with no log replay needed

Interview angle

This three-step compaction during Tablet transfer is a great example of minimizing recovery time through proactive work. The same principle appears in database failover (WAL shipping), VM live migration (iterative memory copy), and leader handoff protocols. When discussing availability in interviews, mention that planned transfers can be made nearly instantaneous by doing the heavy lifting before the handoff.

Quiz

BigTable uses a single commit log per tablet server instead of one per tablet. What would happen if you switched to per-tablet commit logs?

Recovery would be faster because each tablet's log is independent and smaller.

Write throughput would drop significantly because the disk would need to seek between many concurrent log files instead of writing sequentially to a single file, even though recovery of individual tablets would become simpler.

There would be no significant difference because modern SSDs handle random writes well.

Memory usage would decrease because each log would be smaller.

Locality groups​

Compression​

Caching​

Bloom filters​

Unified commit log​

Recovery optimization​

Dual log-writing threads​

Speeding up Tablet recovery​