Skip to main content

BigTable Data Model

How do you organize petabytes of structured data so that any cell is accessible with a single lookup? Traditional relational databases use a two-dimensional model -- row ID plus column name. BigTable adds two more dimensions: column families and timestamps. This four-dimensional design is what makes BigTable a sparse, distributed, persistent, multidimensional, sorted map.

Think first
A relational database identifies a cell by row ID and column name -- two dimensions. If you need to store multiple versions of the same cell and group columns for access control, what additional dimensions would you add to the data model?

Two dimensions vs. four dimensions

Traditional databases identify every cell by its row ID and column name:

BigTable's four-dimensional data model uses:

DimensionPurpose
Row keyUniquely identifies a row (arbitrary string, up to 64 KB)
Column familyGroups related columns together
Column qualifierIdentifies a specific column within a family
TimestampVersions each cell value (64-bit, real time or client-assigned)

Data is indexed by row key, column key, and timestamp. Accessing a cell requires all three:

( row_key: string, column_name: string, timestamp: int64 )cell contents (string)

If no timestamp is specified, BigTable returns the most recent version.

Interview angle

When asked "How does BigTable's data model differ from a relational database?", emphasize three things: (1) the model is sparse -- empty columns consume no storage, (2) every cell is versioned via timestamps, and (3) there are no secondary indexes -- the row key is the only index, so row key design is critical.

Rows

  • Each row is uniquely identified by its row key (an arbitrary string, typically much smaller than the 64 KB max).
  • Single-row atomicity: all reads and writes under one row key are atomic. Atomicity across rows is not guaranteed -- one row update can succeed while another fails.
  • The only index is the row key. There are no secondary indexes, which means row key design drives query performance.

A column is a key-value pair: the key is the column key, the value is the cell content.

Column families

Column keys are grouped into column families. All data in a family is typically the same type. BigTable enforces access control and tracks disk/memory usage at the column-family level.

The following figure shows row 294 with two column families (personal_info and work_info) and three columns under personal_info:

RuleDetail
Formatfamily:optional_qualifier
CountSmall number per table (hundreds at most); rarely change after creation
UniformityAll rows share the same set of column families
EfficiencyBigTable retrieves data from a single column family efficiently
NamingShort names are better -- family names travel with every data transfer
warning

Column families must be declared before writing data, and changing them in production is expensive. Treat column families as schema and column qualifiers as dynamic data.

Columns

  • Columns exist within a column family.
  • A table can have an unbounded number of columns -- new columns can appear on the fly.
  • Short column names reduce transfer overhead (format: ColumnFamily:ColumnName, e.g., Work:Dept).
  • Empty columns are not stored, making BigTable well-suited for sparse data.

Timestamps

Each cell can hold multiple versions of a value. A 64-bit timestamp identifies each version -- either real time or a custom value assigned by the client.

  • Read without timestamp: returns the most recent version.
  • Read with timestamp: returns the latest version at or before the specified timestamp.

BigTable supports two per-column-family garbage-collection policies for automatic version cleanup:

PolicyBehavior
Keep last NRetain only the N most recent versions
Keep by ageRetain only versions newer than a threshold (e.g., last 7 days)
Interview angle

Versioning via timestamps is a BigTable concept that directly influenced Cassandra and HBase. In an interview, connect this to the broader pattern: storing multiple versions avoids the need for locks during reads, since readers always see a consistent snapshot. This is the same principle behind MVCC in traditional databases.

Quiz
What would happen if BigTable used secondary indexes on columns instead of relying solely on the row key for lookups?