1 Bloom Filters
Imagine you have terabytes of data spread across thousands of files on disk. A request comes in: "Does record #48291 exist?" You don't want to read every file to find out -- that's painfully slow. You could build an index, but even binary-searching an index means disk I/O for every lookup.
What if you could answer "definitely not here" or "probably here" using only a tiny amount of memory, in constant time?
Definition
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can tell you with certainty that an element is not in the set. But if it says an element is in the set, there's a small chance it's wrong (a false positive).
The trade-off: you get massive speed and memory savings in exchange for accepting a small, controllable false-positive rate. There are never false negatives -- if the Bloom filter says "no," it means no.
How it works
A Bloom filter is a bit array of m bits, all initially set to 0, combined with k independent hash functions.
Adding an element:
- Feed the element through all
khash functions - Each hash function produces a position in the bit array
- Set all
kpositions to 1
Checking membership:
- Feed the element through the same
khash functions - Check all
kpositions in the bit array - If any position is 0 → the element is definitely not in the set
- If all positions are 1 → the element might be in the set (could be a false positive from other elements setting those same bits)
Here is a Bloom filter with three elements P, Q, and R. It consists of 20 bits and uses three hash functions. The colored arrows point to the bits that the elements of the set are mapped to.
- The element
Xdefinitely is not in the set, since it hashes to a bit position containing 0. - Adding a new element and testing for membership are both O(k) -- constant time, regardless of how many elements are in the filter.
- A filter sized for
nelements with a desired false-positive rateprequires approximately-(n × ln(p)) / (ln(2))²bits.
As you add more elements, more bits get set to 1. Eventually, most bits are 1, and random elements start "matching" by coincidence -- that's when your false-positive rate climbs. The fix: use a larger bit array or more hash functions. In practice, even a few bytes per element gives you a false-positive rate under 1%.
When to use Bloom filters
- Before expensive lookups: Check the Bloom filter first. If it says "no," skip the disk/network call entirely.
- Caching negative lookups: If your system frequently queries for keys that don't exist, a Bloom filter eliminates most of those wasted lookups.
- Network bandwidth savings: Instead of sending a full list of records to compare, send a compact Bloom filter.
Examples
BigTable
In BigTable, any read must check multiple SSTables on disk. Without optimization, this means multiple disk accesses per read. BigTable creates a Bloom filter for each SSTable. Before reading an SSTable from disk, it checks the Bloom filter: "Could the row I'm looking for be in this file?" If the answer is "definitely not," it skips that file entirely. This dramatically reduces disk I/O.
Cassandra
Cassandra uses the same technique -- Bloom filters on SSTables to avoid unnecessary disk reads during lookups.
CDNs and web caches
Content delivery networks use Bloom filters to track which objects are cached. A quick Bloom filter check can determine whether to serve from cache or fetch from origin, without maintaining a full index of cached content.
If an interviewer asks you to design a system where reads are slow because of scanning many files or partitions, Bloom filters should be one of the first tools you reach for. The canonical answer: "Before touching disk, check a Bloom filter in memory. If it says the data isn't there, skip the file."