BigTable Characteristics
What properties make BigTable suitable for Google-scale workloads, and how does it compare to other distributed databases? Understanding these characteristics helps you decide when BigTable (or an HBase/Cassandra alternative) is the right tool -- and when it isn't.
BigTable performance characteristics
| Characteristic | Detail |
|---|---|
| Distributed multi-level map | Runs across thousands of machines with data partitioned into Tablets |
| Horizontally scalable | Add nodes without downtime or manual rebalancing; achieves linear scalability on commodity hardware |
| Fault-tolerant | Data replicated via GFS across multiple ChunkServers on different racks |
| Durable | All data persisted to GFS with Write-Ahead Log guarantees |
| Centralized coordination | Single master maintains data consistency and a global view of cluster state (Leader and Follower pattern) |
| Separated control and data planes | Clients talk to the master for metadata only; all data reads/writes go directly to Tablet servers |
The separation of control and data planes is the single most important architectural decision in BigTable. It allows the master to be a single point of coordination without becoming a single point of contention. When designing your own system in an interview, always consider: "Can I separate metadata operations from data operations so they scale independently?"
Dynamo vs. BigTable
These two systems represent fundamentally different approaches to distributed storage:
| Category | Dynamo | BigTable |
|---|---|---|
| Architecture | Decentralized -- every node has equal responsibilities | Centralized -- master handles metadata, Tablet servers handle data |
| Data model | Key-value | Multidimensional sorted map (wide-column) |
| Security | No built-in fine-grained access control | Access rights at column-family level |
| Partitioning | Consistent hashing with virtual nodes | Range-based Tablets (contiguous row ranges) |
| Replication | Sloppy quorum -- each item replicated to N nodes | GFS chunk replication across ChunkServers |
| CAP stance | AP -- prioritizes availability | CP -- prioritizes consistency |
| Operations | By individual key | By key range (efficient scans) |
| Storage | Pluggable storage engine | SSTables in GFS |
| Membership | Gossip protocol | Master-initiated via Chubby |
"Dynamo vs. BigTable" is not about which is better -- it's about which trade-offs your application needs. If you need range scans and strong consistency, BigTable wins. If you need write availability during network partitions, Dynamo wins. Interviewers want you to articulate why, not pick a winner.
Systems inspired by BigTable
BigTable's design influenced an entire generation of NoSQL databases:
| System | Relationship to BigTable |
|---|---|
| HBase | Most direct open-source clone; runs on HDFS instead of GFS |
| Hypertable | Open-source C++ implementation; abstracts the file system layer to work with HDFS, GlusterFS, or CloudStore via a broker process |
| Cassandra | Hybrid architecture -- uses BigTable's data model (SSTables, MemTables, column families) on top of Dynamo's infrastructure (consistent hashing, gossip, decentralized) |
Cassandra's lineage is the ultimate interview talking point for distributed systems. It proves that data models and architectures are independent design dimensions. You can take BigTable's wide-column model and run it on a Dynamo-style ring -- or take Dynamo's key-value model and run it on a master-based architecture. Understanding this decomposition shows deep architectural thinking.