BigTable Data Model
How do you organize petabytes of structured data so that any cell is accessible with a single lookup? Traditional relational databases use a two-dimensional model -- row ID plus column name. BigTable adds two more dimensions: column families and timestamps. This four-dimensional design is what makes BigTable a sparse, distributed, persistent, multidimensional, sorted map.
Two dimensions vs. four dimensions
Traditional databases identify every cell by its row ID and column name:
BigTable's four-dimensional data model uses:
| Dimension | Purpose |
|---|---|
| Row key | Uniquely identifies a row (arbitrary string, up to 64 KB) |
| Column family | Groups related columns together |
| Column qualifier | Identifies a specific column within a family |
| Timestamp | Versions each cell value (64-bit, real time or client-assigned) |
Data is indexed by row key, column key, and timestamp. Accessing a cell requires all three:
(
row_key: string,column_name: string,timestamp: int64 ) → cell contents (string)
If no timestamp is specified, BigTable returns the most recent version.
When asked "How does BigTable's data model differ from a relational database?", emphasize three things: (1) the model is sparse -- empty columns consume no storage, (2) every cell is versioned via timestamps, and (3) there are no secondary indexes -- the row key is the only index, so row key design is critical.
Rows
- Each row is uniquely identified by its row key (an arbitrary string, typically much smaller than the 64 KB max).
- Single-row atomicity: all reads and writes under one row key are atomic. Atomicity across rows is not guaranteed -- one row update can succeed while another fails.
- The only index is the row key. There are no secondary indexes, which means row key design drives query performance.
A column is a key-value pair: the key is the column key, the value is the cell content.
Column families
Column keys are grouped into column families. All data in a family is typically the same type. BigTable enforces access control and tracks disk/memory usage at the column-family level.
The following figure shows row 294 with two column families (personal_info and work_info) and three columns under personal_info:
| Rule | Detail |
|---|---|
| Format | family:optional_qualifier |
| Count | Small number per table (hundreds at most); rarely change after creation |
| Uniformity | All rows share the same set of column families |
| Efficiency | BigTable retrieves data from a single column family efficiently |
| Naming | Short names are better -- family names travel with every data transfer |
Column families must be declared before writing data, and changing them in production is expensive. Treat column families as schema and column qualifiers as dynamic data.
Columns
- Columns exist within a column family.
- A table can have an unbounded number of columns -- new columns can appear on the fly.
- Short column names reduce transfer overhead (format:
ColumnFamily:ColumnName, e.g.,Work:Dept). - Empty columns are not stored, making BigTable well-suited for sparse data.
Timestamps
Each cell can hold multiple versions of a value. A 64-bit timestamp identifies each version -- either real time or a custom value assigned by the client.
- Read without timestamp: returns the most recent version.
- Read with timestamp: returns the latest version at or before the specified timestamp.
BigTable supports two per-column-family garbage-collection policies for automatic version cleanup:
| Policy | Behavior |
|---|---|
| Keep last N | Retain only the N most recent versions |
| Keep by age | Retain only versions newer than a threshold (e.g., last 7 days) |
Versioning via timestamps is a BigTable concept that directly influenced Cassandra and HBase. In an interview, connect this to the broader pattern: storing multiple versions avoids the need for locks during reads, since readers always see a consistent snapshot. This is the same principle behind MVCC in traditional databases.