Skip to main content

High-level Architecture

Before diving into Cassandra's internals, let's establish the vocabulary and understand how the pieces fit together.

Cassandra's data hierarchy

TermWhat it isAnalogy
ColumnA key-value pair -- the most basic unit of dataA cell in a spreadsheet
RowA container for columns, identified by a primary keyA row in a spreadsheet (but columns can vary per row)
TableA container of rowsA spreadsheet
KeyspaceA container of tables that spans the clusterA database in RDBMS
ClusterThe full set of nodesThe entire deployment
NodeOne machine running CassandraOne server

Key difference from relational databases: Cassandra doesn't store null values. If a row doesn't have a particular column, that column simply doesn't exist for that row. This sparse storage model saves significant space with wide, variable-schema data.

NoSQL constraints

Cassandra has no joins, no foreign keys, and you can only filter on primary key columns in WHERE clauses (without secondary indexes). You model your tables around your query patterns, not around entity relationships.

Think first
Cassandra must spread data across nodes without a central coordinator. What technique would you use, and what problem does it solve compared to simple modular hashing (key % N)?

Data partitioning: the ring

Like Dynamo, Cassandra uses consistent hashing to distribute data across nodes. All the consistent hashing and vnode concepts from Dynamo apply here identically.

The interesting Cassandra-specific detail is how the primary key drives partitioning:

Primary key = partition key + clustering key

Consider a table with PRIMARY KEY (city_id, employee_id):

ComponentColumnPurpose
Partition keycity_idDetermines which node stores the data. All rows with the same city_id live on the same node.
Clustering keyemployee_idDetermines sort order within the node. Within each partition, rows are sorted by employee_id.
Interview insight

The partition key is the most critical modeling decision in Cassandra. A bad partition key leads to hotspots (one node gets all the traffic) or partitions that grow unboundedly. A good partition key distributes data evenly and keeps related queries hitting a single partition.

The partitioner

The partitioner hashes the partition key to determine placement on the ring. Cassandra uses Murmur3 by default (a fast, well-distributed hash function), producing a 64-bit token in the range 263-2^{63} to 26312^{63} - 1.

Murmur3

Murmur3 is a non-cryptographic hash function optimized for speed and distribution quality. The name comes from its internal operations: multiply-rotate, multiply-rotate. Once a cluster is initialized with a partitioner, it cannot be changed.

Think first
In a decentralized system like Cassandra, how can a client send a request to any node and still reach the correct data? What information must every node have?

Coordinator node

A client can connect to any Cassandra node. The contacted node becomes the coordinator for that request. Because every node knows the full ring topology (via gossip), any node can determine which nodes own a particular key and forward the request accordingly.

This is identical to Dynamo's approach -- no central routing. Any node can coordinate any request.

Quiz
You have a table with PRIMARY KEY (user_id, timestamp). Your application queries always filter by user_id and a timestamp range. What would happen if you changed the primary key to PRIMARY KEY (timestamp, user_id)?