1 Bloom Filters

Imagine you have terabytes of data spread across thousands of files on disk. A request comes in: "Does record #48291 exist?" You don't want to read every file to find out -- that's painfully slow. You could build an index, but even binary-searching an index means disk I/O for every lookup.

What if you could answer "definitely not here" or "probably here" using only a tiny amount of memory, in constant time?

Think first

You have terabytes of data across thousands of files. For each read request, you need to determine which files might contain the target key. How would you avoid checking every file, using only a small amount of memory per file?

Definition

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can tell you with certainty that an element is not in the set. But if it says an element is in the set, there's a small chance it's wrong (a false positive).

The trade-off: you get massive speed and memory savings in exchange for accepting a small, controllable false-positive rate. There are never false negatives -- if the Bloom filter says "no," it means no.

How it works

A Bloom filter is a bit array of m bits, all initially set to 0, combined with k independent hash functions.

Adding an element:

Feed the element through all k hash functions
Each hash function produces a position in the bit array
Set all k positions to 1

Checking membership:

Feed the element through the same k hash functions
Check all k positions in the bit array
If any position is 0 → the element is definitely not in the set
If all positions are 1 → the element might be in the set (could be a false positive from other elements setting those same bits)

Here is a Bloom filter with three elements P, Q, and R. It consists of 20 bits and uses three hash functions. The colored arrows point to the bits that the elements of the set are mapped to.

The element X definitely is not in the set, since it hashes to a bit position containing 0.
Adding a new element and testing for membership are both O(k) -- constant time, regardless of how many elements are in the filter.
A filter sized for n elements with a desired false-positive rate p requires approximately -(n × ln(p)) / (ln(2))² bits.

The intuition

As you add more elements, more bits get set to 1. Eventually, most bits are 1, and random elements start "matching" by coincidence -- that's when your false-positive rate climbs. The fix: use a larger bit array or more hash functions. In practice, even a few bytes per element gives you a false-positive rate under 1%.

When to use Bloom filters

Before expensive lookups: Check the Bloom filter first. If it says "no," skip the disk/network call entirely.
Caching negative lookups: If your system frequently queries for keys that don't exist, a Bloom filter eliminates most of those wasted lookups.
Network bandwidth savings: Instead of sending a full list of records to compare, send a compact Bloom filter.

Examples

BigTable

In BigTable, any read must check multiple SSTables on disk. Without optimization, this means multiple disk accesses per read. BigTable creates a Bloom filter for each SSTable. Before reading an SSTable from disk, it checks the Bloom filter: "Could the row I'm looking for be in this file?" If the answer is "definitely not," it skips that file entirely. This dramatically reduces disk I/O.

Cassandra

Cassandra uses the same technique -- Bloom filters on SSTables to avoid unnecessary disk reads during lookups.

CDNs and web caches

Content delivery networks use Bloom filters to track which objects are cached. A quick Bloom filter check can determine whether to serve from cache or fetch from origin, without maintaining a full index of cached content.

Interview angle

If an interviewer asks you to design a system where reads are slow because of scanning many files or partitions, Bloom filters should be one of the first tools you reach for. The canonical answer: "Before touching disk, check a Bloom filter in memory. If it says the data isn't there, skip the file."

Quiz

You are using Bloom filters on SSTables in an LSM-based storage system. What would happen if you made the Bloom filter bit array too small to save memory?

Read latency would decrease because smaller filters are faster to check.

The false-positive rate would increase dramatically, causing the system to unnecessarily read many SSTables from disk for keys that don't exist in them -- negating the performance benefit of having Bloom filters at all.

Write performance would degrade because updating the filter takes longer.

The filter would start producing false negatives, missing keys that actually exist.

Definition​

How it works​

When to use Bloom filters​

Examples​

BigTable​

Cassandra​

CDNs and web caches​