Skip to main content

BigTable Introduction

Goal

Design a distributed and scalable system that can store a huge amount of structured data, indexed by a row key, where each row can have an unbounded number of columns.

Why BigTable matters

Google Search indexes hundreds of billions of web pages. Google Earth serves satellite imagery for the entire planet. Google Analytics tracks events across millions of websites. Each of these services has different requirements -- different data sizes, different latency needs, different query patterns -- but they all share one thing: the data is massive and needs to be accessed fast.

In 2005, no commercial database could handle this. Relational databases couldn't scale to petabytes. Key-value stores like Dynamo lacked the ability to organize data into columns and rows. Google needed a system that combined the scalability of a distributed system with the structure of a database, without the overhead of SQL.

BigTable was the answer: a wide-column store that can handle petabytes of data across thousands of machines while providing millisecond-level reads. It trades away SQL, transactions, and joins in exchange for raw scalability and performance.

Interview insight

BigTable is the case study for CP systems that still achieve high performance. Where Dynamo chose availability, BigTable chose strong consistency -- and shows that with careful architecture (single master for metadata, distributed storage for data), you can have consistency without sacrificing practical availability.

What is BigTable?

BigTable is a distributed, massively scalable wide-column store designed to handle petabytes of structured data across thousands of commodity servers.

PropertyDetail
CAP classificationCP -- strictly consistent reads and writes
Data modelWide-column: rows identified by row key, data organized into column families
ArchitectureLeader-based: single master coordinates metadata, tablet servers handle data
Depends onGFS for storage, Chubby for coordination
ScalePetabytes of data, billions of rows, thousands of machines
The Google stack

BigTable doesn't exist in isolation -- it sits on top of two other Google systems we cover in this course:

  • GFS provides the underlying distributed file storage (where BigTable's SSTables live on disk)
  • Chubby provides distributed locking and coordination (master election, schema storage, bootstrap)

Understanding all three systems together gives you a complete picture of how Google builds infrastructure in layers.

Background

BigTable was developed at Google starting in 2004 and has been in production since 2005. Google couldn't use commercial databases at their scale, and the cost would have been prohibitive even if they could.

The BigTable paper (2006) was enormously influential, inspiring:

  • Apache HBase -- the most direct open-source clone
  • Cassandra -- which borrowed BigTable's data model but used Dynamo's architecture
  • Hypertable -- another open-source wide-column store
BigTable vs. Cassandra: same data model, opposite architectures

Both use column families and sparse rows. But BigTable has a single master (CP, strong consistency) while Cassandra is peer-to-peer (AP, eventual consistency). This is one of the most instructive comparisons in distributed systems -- the same data model can be built on fundamentally different architectures with very different trade-offs.

BigTable use cases

Good fit:

  • Web indexing -- Billions of URLs, each with page content, links, crawl metadata, PageRank. BigTable's row-per-URL model handles this naturally.
  • Time-series data -- Naturally ordered by row key (timestamp), with columns for different metrics.
  • IoT and sensor data -- High-throughput writes of structured streams.
  • User profile data -- Hundreds of millions of users, each with sparse, varying attributes.
  • Financial data -- Time-series with strong consistency requirements.

Poor fit:

  • OLTP with transactions -- BigTable provides no ACID guarantees or multi-row transactions.
  • Unstructured data -- Images, videos, and blobs should live in a file system (like GFS), not BigTable.
  • Small datasets -- The operational overhead of BigTable isn't justified for data that fits on one machine.
  • Complex relational queries -- No joins, no SQL, no secondary indexes (in the original design).

What's next

In the following chapters, we'll explore: