BigTable Introduction
Goal
Design a distributed and scalable system that can store a huge amount of structured data, indexed by a row key, where each row can have an unbounded number of columns.
Why BigTable matters
Google Search indexes hundreds of billions of web pages. Google Earth serves satellite imagery for the entire planet. Google Analytics tracks events across millions of websites. Each of these services has different requirements -- different data sizes, different latency needs, different query patterns -- but they all share one thing: the data is massive and needs to be accessed fast.
In 2005, no commercial database could handle this. Relational databases couldn't scale to petabytes. Key-value stores like Dynamo lacked the ability to organize data into columns and rows. Google needed a system that combined the scalability of a distributed system with the structure of a database, without the overhead of SQL.
BigTable was the answer: a wide-column store that can handle petabytes of data across thousands of machines while providing millisecond-level reads. It trades away SQL, transactions, and joins in exchange for raw scalability and performance.
BigTable is the case study for CP systems that still achieve high performance. Where Dynamo chose availability, BigTable chose strong consistency -- and shows that with careful architecture (single master for metadata, distributed storage for data), you can have consistency without sacrificing practical availability.
What is BigTable?
BigTable is a distributed, massively scalable wide-column store designed to handle petabytes of structured data across thousands of commodity servers.
| Property | Detail |
|---|---|
| CAP classification | CP -- strictly consistent reads and writes |
| Data model | Wide-column: rows identified by row key, data organized into column families |
| Architecture | Leader-based: single master coordinates metadata, tablet servers handle data |
| Depends on | GFS for storage, Chubby for coordination |
| Scale | Petabytes of data, billions of rows, thousands of machines |
BigTable doesn't exist in isolation -- it sits on top of two other Google systems we cover in this course:
- GFS provides the underlying distributed file storage (where BigTable's SSTables live on disk)
- Chubby provides distributed locking and coordination (master election, schema storage, bootstrap)
Understanding all three systems together gives you a complete picture of how Google builds infrastructure in layers.
Background
BigTable was developed at Google starting in 2004 and has been in production since 2005. Google couldn't use commercial databases at their scale, and the cost would have been prohibitive even if they could.
The BigTable paper (2006) was enormously influential, inspiring:
- Apache HBase -- the most direct open-source clone
- Cassandra -- which borrowed BigTable's data model but used Dynamo's architecture
- Hypertable -- another open-source wide-column store
Both use column families and sparse rows. But BigTable has a single master (CP, strong consistency) while Cassandra is peer-to-peer (AP, eventual consistency). This is one of the most instructive comparisons in distributed systems -- the same data model can be built on fundamentally different architectures with very different trade-offs.
BigTable use cases
Good fit:
- Web indexing -- Billions of URLs, each with page content, links, crawl metadata, PageRank. BigTable's row-per-URL model handles this naturally.
- Time-series data -- Naturally ordered by row key (timestamp), with columns for different metrics.
- IoT and sensor data -- High-throughput writes of structured streams.
- User profile data -- Hundreds of millions of users, each with sparse, varying attributes.
- Financial data -- Time-series with strong consistency requirements.
Poor fit:
- OLTP with transactions -- BigTable provides no ACID guarantees or multi-row transactions.
- Unstructured data -- Images, videos, and blobs should live in a file system (like GFS), not BigTable.
- Small datasets -- The operational overhead of BigTable isn't justified for data that fits on one machine.
- Complex relational queries -- No joins, no SQL, no secondary indexes (in the original design).
What's next
In the following chapters, we'll explore:
- BigTable's data model -- rows, column families, and timestamps
- System APIs for reading and writing data
- Partitioning and architecture -- how data is split into tablets
- SSTables -- the on-disk storage format
- How GFS and Chubby underpin BigTable
- The anatomy of read and write operations
- Fault tolerance and compaction