High-level Architecture

Before diving into Cassandra's internals, let's establish the vocabulary and understand how the pieces fit together.

Cassandra's data hierarchy

Term	What it is	Analogy
Column	A key-value pair -- the most basic unit of data	A cell in a spreadsheet
Row	A container for columns, identified by a primary key	A row in a spreadsheet (but columns can vary per row)
Table	A container of rows	A spreadsheet
Keyspace	A container of tables that spans the cluster	A database in RDBMS
Cluster	The full set of nodes	The entire deployment
Node	One machine running Cassandra	One server

Key difference from relational databases: Cassandra doesn't store null values. If a row doesn't have a particular column, that column simply doesn't exist for that row. This sparse storage model saves significant space with wide, variable-schema data.

NoSQL constraints

Cassandra has no joins, no foreign keys, and you can only filter on primary key columns in WHERE clauses (without secondary indexes). You model your tables around your query patterns, not around entity relationships.

Think first

Cassandra must spread data across nodes without a central coordinator. What technique would you use, and what problem does it solve compared to simple modular hashing (key % N)?

Data partitioning: the ring

Like Dynamo, Cassandra uses consistent hashing to distribute data across nodes. All the consistent hashing and vnode concepts from Dynamo apply here identically.

The interesting Cassandra-specific detail is how the primary key drives partitioning:

Primary key = partition key + clustering key

Consider a table with PRIMARY KEY (city_id, employee_id):

Component	Column	Purpose
Partition key	`city_id`	Determines which node stores the data. All rows with the same `city_id` live on the same node.
Clustering key	`employee_id`	Determines sort order within the node. Within each partition, rows are sorted by `employee_id`.

Interview insight

The partition key is the most critical modeling decision in Cassandra. A bad partition key leads to hotspots (one node gets all the traffic) or partitions that grow unboundedly. A good partition key distributes data evenly and keeps related queries hitting a single partition.

The partitioner

The partitioner hashes the partition key to determine placement on the ring. Cassandra uses Murmur3 by default (a fast, well-distributed hash function), producing a 64-bit token in the range $-2^{63}$ to $2^{63} - 1$ .

Murmur3

Murmur3 is a non-cryptographic hash function optimized for speed and distribution quality. The name comes from its internal operations: multiply-rotate, multiply-rotate. Once a cluster is initialized with a partitioner, it cannot be changed.

Think first

In a decentralized system like Cassandra, how can a client send a request to any node and still reach the correct data? What information must every node have?

Coordinator node

A client can connect to any Cassandra node. The contacted node becomes the coordinator for that request. Because every node knows the full ring topology (via gossip), any node can determine which nodes own a particular key and forward the request accordingly.

This is identical to Dynamo's approach -- no central routing. Any node can coordinate any request.

Quiz

You have a table with PRIMARY KEY (user_id, timestamp). Your application queries always filter by user_id and a timestamp range. What would happen if you changed the primary key to PRIMARY KEY (timestamp, user_id)?

No difference -- both keys produce the same data distribution and query performance.

Queries by user_id would become inefficient because data for a single user would be scattered across many partitions (one per timestamp).

Cassandra would reject this schema because timestamp cannot be a partition key.

Write performance would improve because timestamps are naturally sequential.

Cassandra's data hierarchy​

Data partitioning: the ring​

Primary key = partition key + clustering key​

The partitioner​

Coordinator node​

Cassandra's data hierarchy

Data partitioning: the ring

Primary key = partition key + clustering key

The partitioner

Coordinator node