Kafka Introduction
Why Kafka matters
In 2010, LinkedIn had a problem that every fast-growing tech company eventually faces: data was everywhere, and nothing could keep up.
User activity events, application metrics, server logs, database changelogs -- all of this data needed to flow between dozens of internal systems. The existing solutions (traditional message queues, direct database-to-database transfers, custom ETL pipelines) were breaking under the load. They were too slow, too fragile, or couldn't handle the volume.
LinkedIn needed something fundamentally different: a system that could ingest millions of events per second, store them durably, and let multiple independent consumers read the same data stream at their own pace. Traditional message queues (like RabbitMQ) deleted messages after delivery. Databases weren't designed for high-throughput streaming. Nothing fit.
So LinkedIn built Kafka -- and in doing so, created an entirely new category of infrastructure: the distributed commit log.
Kafka appears in system design interviews constantly -- not always by name, but by pattern. Any time an interviewer asks you to design a system that needs to "decouple producers from consumers," "handle bursty traffic," or "enable real-time analytics alongside batch processing," they're describing Kafka's use case. Understanding Kafka gives you the vocabulary for all of these.
What is Kafka?
Apache Kafka is a distributed, durable, fault-tolerant streaming platform. At its core, it's a system that:
- Takes in streams of messages from applications (producers)
- Stores them reliably on a cluster of machines (brokers)
- Delivers them to applications that process them (consumers)
What makes Kafka different from a traditional message queue:
| Traditional message queue | Kafka |
|---|---|
| Messages deleted after consumption | Messages persisted on disk, retained for a configurable period |
| One consumer gets each message | Multiple consumer groups can independently read the same data |
| Optimized for message delivery | Optimized for throughput -- millions of messages per second |
| Push-based delivery | Pull-based -- consumers read at their own pace |
| No replay | Consumers can replay from any point in the stream |
The commit log: Kafka's big idea
At a high level, Kafka is a distributed commit log (also known as a write-ahead log). This is an append-only data structure that stores a sequence of records:
- Records are always appended to the end of the log
- Once written, records are immutable -- they cannot be modified or deleted
- Reading always happens sequentially, from old to new
This simplicity is Kafka's superpower. Because all writes are sequential appends and all reads are sequential scans, Kafka achieves remarkable throughput by exploiting sequential disk I/O -- which, on modern hardware, can be faster than random-access memory operations.
Kafka stores everything on disk, yet achieves throughput measured in millions of messages per second. How? Sequential disk I/O. A modern disk can write 600MB/s sequentially but only 100KB/s randomly. By constraining its data structure to append-only sequential access, Kafka turns disk from a bottleneck into an advantage. This is why Kafka can retain messages for days or weeks with negligible performance impact.
Kafka use cases
| Use case | How Kafka fits | Example |
|---|---|---|
| Metrics & monitoring | Distributed services push operational metrics to Kafka topics; monitoring systems consume and aggregate | Datadog-style dashboards pulling from Kafka streams |
| Log aggregation | Collect logs from hundreds of services into a unified stream | Replacing scattered log files with a central, queryable log pipeline |
| Stream processing | Raw data flows through multiple transformation stages via topic-to-topic processing | Enriching clickstream data before loading into a data warehouse |
| Event sourcing | Use Kafka as the system of record -- every state change is an immutable event | Order lifecycle events: created → paid → shipped → delivered |
| Website activity tracking | Kafka's original use case at LinkedIn -- capture every page view, search, and click | Real-time analytics, A/B test measurement, recommendation engines |
| Change data capture | Database changes published to Kafka for downstream systems | Keeping a search index in sync with a primary database |
Kafka isn't just a "faster RabbitMQ." It's a fundamentally different abstraction. A message queue is about delivering messages. Kafka is about storing a log of events that multiple systems can independently consume. This makes it the backbone for event-driven architectures.
What's next
In the following chapters, we'll explore:
- Kafka's high-level architecture -- brokers, topics, partitions, and replication
- A deep dive into how messages are stored and indexed
- Consumer groups -- how multiple consumers coordinate
- The workflow of producing and consuming messages
- ZooKeeper's role in Kafka's coordination
- Delivery semantics -- at-most-once, at-least-once, and exactly-once