BigTable Data Model

How do you organize petabytes of structured data so that any cell is accessible with a single lookup? Traditional relational databases use a two-dimensional model -- row ID plus column name. BigTable adds two more dimensions: column families and timestamps. This four-dimensional design is what makes BigTable a sparse, distributed, persistent, multidimensional, sorted map.

Think first

A relational database identifies a cell by row ID and column name -- two dimensions. If you need to store multiple versions of the same cell and group columns for access control, what additional dimensions would you add to the data model?

Two dimensions vs. four dimensions

Traditional databases identify every cell by its row ID and column name:

BigTable's four-dimensional data model uses:

Dimension	Purpose
Row key	Uniquely identifies a row (arbitrary string, up to 64 KB)
Column family	Groups related columns together
Column qualifier	Identifies a specific column within a family
Timestamp	Versions each cell value (64-bit, real time or client-assigned)

Data is indexed by row key, column key, and timestamp. Accessing a cell requires all three:

( row_key: string, column_name: string, timestamp: int64 ) → cell contents (string)

If no timestamp is specified, BigTable returns the most recent version.

Interview angle

When asked "How does BigTable's data model differ from a relational database?", emphasize three things: (1) the model is sparse -- empty columns consume no storage, (2) every cell is versioned via timestamps, and (3) there are no secondary indexes -- the row key is the only index, so row key design is critical.

Rows

Each row is uniquely identified by its row key (an arbitrary string, typically much smaller than the 64 KB max).
Single-row atomicity: all reads and writes under one row key are atomic. Atomicity across rows is not guaranteed -- one row update can succeed while another fails.
The only index is the row key. There are no secondary indexes, which means row key design drives query performance.

A column is a key-value pair: the key is the column key, the value is the cell content.

Column families

Column keys are grouped into column families. All data in a family is typically the same type. BigTable enforces access control and tracks disk/memory usage at the column-family level.

The following figure shows row 294 with two column families (personal_info and work_info) and three columns under personal_info:

Rule	Detail
Format	`family:optional_qualifier`
Count	Small number per table (hundreds at most); rarely change after creation
Uniformity	All rows share the same set of column families
Efficiency	BigTable retrieves data from a single column family efficiently
Naming	Short names are better -- family names travel with every data transfer

warning

Column families must be declared before writing data, and changing them in production is expensive. Treat column families as schema and column qualifiers as dynamic data.

Columns

Columns exist within a column family.
A table can have an unbounded number of columns -- new columns can appear on the fly.
Short column names reduce transfer overhead (format: ColumnFamily:ColumnName, e.g., Work:Dept).
Empty columns are not stored, making BigTable well-suited for sparse data.

Timestamps

Each cell can hold multiple versions of a value. A 64-bit timestamp identifies each version -- either real time or a custom value assigned by the client.

Read without timestamp: returns the most recent version.
Read with timestamp: returns the latest version at or before the specified timestamp.

BigTable supports two per-column-family garbage-collection policies for automatic version cleanup:

Policy	Behavior
Keep last N	Retain only the N most recent versions
Keep by age	Retain only versions newer than a threshold (e.g., last 7 days)

Interview angle

Versioning via timestamps is a BigTable concept that directly influenced Cassandra and HBase. In an interview, connect this to the broader pattern: storing multiple versions avoids the need for locks during reads, since readers always see a consistent snapshot. This is the same principle behind MVCC in traditional databases.

Quiz

What would happen if BigTable used secondary indexes on columns instead of relying solely on the row key for lookups?

Read performance would improve because queries could skip full-table scans.

Write throughput would drop significantly because every mutation would need to update both the primary data and the secondary indexes across distributed nodes, and maintaining index consistency at scale would add coordination overhead.

There would be no meaningful impact since indexes are just metadata.

BigTable would become a relational database.

Two dimensions vs. four dimensions​

Rows​

Column families​

Columns​

Timestamps​

Two dimensions vs. four dimensions

Rows

Column families

Columns

Timestamps