Cassandra Introduction

Design a distributed and scalable system that can store a huge amount of structured data, indexed by a row key, where each row can have an unbounded number of columns.

Why Cassandra matters

What if you could take the best ideas from two of the most influential distributed systems ever built and combine them? That's exactly what Facebook did in 2007.

Facebook needed a system for their inbox search -- billions of messages, constant writes, global distribution, and no tolerance for downtime. Relational databases couldn't scale. Pure key-value stores like Dynamo lacked the rich data model they needed. And BigTable-style systems required centralized coordination that created bottlenecks.

So Facebook took Dynamo's decentralized, peer-to-peer architecture and married it with BigTable's column-family data model. The result: Cassandra -- a system that scales linearly, has no single point of failure, and lets you tune consistency per-query.

Today, Cassandra powers some of the largest datasets in the world: Apple runs over 150,000 Cassandra instances. Netflix uses it as the backbone of their streaming infrastructure. Discord stores billions of messages in it.

Interview insight

Cassandra is the best case study for understanding hybrid design. When an interviewer asks you to design a system that needs both Dynamo's availability and BigTable's data model, you're essentially being asked to reason about the same trade-offs Cassandra navigates. Understanding Cassandra means understanding both its parent systems.

What is Cassandra?

Cassandra is a distributed, decentralized, scalable, and highly available NoSQL wide-column database. Here's what sets it apart:

Property	What it means
Peer-to-peer	Every node is equal -- no leader, no single point of failure. Any node can serve any request. (From Dynamo)
Wide-column model	Data is organized by row key into column families, where each row can have different columns. (From BigTable)
Tunable consistency	You choose per-query how many replicas must agree. High consistency or high availability -- your call.
Linear scalability	Add a node, get proportional throughput increase. No downtime, no manual rebalancing.

In CAP theorem terms, Cassandra is typically an AP system -- availability and partition tolerance take priority. But unlike Dynamo, Cassandra lets you tune toward CP behavior on a per-query basis by adjusting consistency levels.

Replication factor

The number of nodes that store copies of the same data. A replication factor of 3 means every piece of data exists on three different nodes.

Consistency level

The minimum number of replicas that must acknowledge a read or write before it's considered successful. Set this high for strong consistency, low for high availability.

The key trade-off

With replication factor N, read consistency R, and write consistency W:

If R + W > N → you get strong consistency (every read sees the latest write)
If R + W ≤ N → you get higher availability but may read stale data

This is the same quorum math used in Dynamo, but Cassandra exposes it as a first-class API concept.

Cassandra's DNA: Dynamo + BigTable

Component	Inherited from	What Cassandra took
Partitioning	Dynamo	Consistent hashing with virtual nodes
Replication	Dynamo	Quorum-based with tunable consistency
Failure detection	Dynamo	Gossip protocol for membership and node health
Temporary failure handling	Dynamo	Hinted handoff
Data model	BigTable	Column families, sparse rows, sorted columns
Storage engine	BigTable	SSTables with Bloom filters
Write path	BigTable	Memtable → flush to SSTable (log-structured)

What Cassandra dropped from Dynamo

Cassandra does not use vector clocks. Instead, it uses last-write-wins (LWW) based on timestamps. This simplifies the API dramatically -- clients never have to resolve conflicts -- but means concurrent writes to the same key can silently lose data. For Cassandra's typical workloads (time-series, event logs, append-heavy), this is an acceptable trade-off.

Cassandra use cases

Good fit:

Write-heavy workloads -- Cassandra's log-structured storage engine makes writes extremely fast. Time-series data, sensor logs, IoT telemetry, and event streaming are natural fits.
High availability with linear scaling -- Reddit and Digg use Cassandra because it scales horizontally without downtime.
Time-series data -- The column model maps naturally to time-indexed data, and the write performance handles high-frequency ingestion.

Poor fit:

Strong consistency requirements -- While tunable, strong consistency in Cassandra comes with performance costs. Financial transactions need a CP system.
Complex queries with joins -- Cassandra has no join support. If your access patterns require multi-table relationships, consider a relational database.
Small datasets -- Cassandra's operational overhead isn't justified for data that fits on a single machine.

Note: The following chapters are Cassandra version-agnostic and explore the general design and architectural principles rather than specific implementation details.

What's next

In the following chapters, we'll explore:

Cassandra's high-level architecture and ring topology
How data gets replicated across nodes
Consistency levels and how to tune them
The gossiper -- Cassandra's implementation of the gossip protocol
The anatomy of write and read operations
Compaction and tombstones -- how Cassandra manages storage

Why Cassandra matters​

What is Cassandra?​

Cassandra's DNA: Dynamo + BigTable​

Cassandra use cases​

What's next​

Why Cassandra matters

What is Cassandra?

Cassandra's DNA: Dynamo + BigTable

Cassandra use cases

What's next