In this chapter, we will look into the Kafka Theory. Particularly we will be understanding key concepts like topics, partition, brokers & clusters, offsets, cluster discovery and Kafka producers and consumers.
Producer and Consumers
Producers are the services that publish or write a new message on a specific topic.
Consumers are the services that read messages and can subscribe to one or more topics. Consumers keep a track of consumed messages by maintaining a log of offsets.
Kafka Topic & Partition
Kafka Topic is the base of everything, all the data are stored in some topic. Topics are similar to Tables in Databases or Folders in the File system.
We can have as many topics as we want.
Topics should have a unique name, preferably the name of the data stream which we are going to put into the topic.
Topics are further split into Partitions. Partitions are basically the number of buckets distributed over different brokers. We will get back to what brokers are later, for now, you can think of it as a standalone Kafka server.
Every time a message is written in a topic in a particular partition gets an sequential id called offset. Offsets start at 0 and go to any number.
All the partitions are independent of each other and can have different numbers of offsets in them.
Let’s understand better with an example
I am currently working on Digital Signatures of Documents, sending them as envelopes, and getting back to the sender once the signature is complete.
We have many many numbers of envelopes. And every time some envelope is signed by any recipient I want to inform others and move the envelope for further signing or return the signed envelope back to the sender.
We can have the Topic name as `envelope_status`. We can have the envelope's status written in a message as Envelope ID and the Status of the envelope.
The envelope service can be one of the producers here and consumers here can be notification service or a dashboard where users can see the status of envelopes they sent.
So offset can have this message written in any partition.
Some important points about offsets are:
1. Order of offset is guaranteed only for a particular partition.
e.g. Offset 4 of Partition 3 will be written after offset 3 of Partition 3 but we can not comment on the order for offsets written in partition 2 or 1.
2. Offsets in different partitions are different regardless of their order.
e.g. Offset 3 of Partition 0 is different than offset 3 of partition 1.
3. Messages are immutable, they can not be modified
4. Data will be written in the partitions randomly unless we provide some key for partitions.
Kafka Broker & Clusters
Topics are kept in a Kafka Broker. Broker is nothing but a single Kafka server. Collections of Brokers are called Cluster.
Whenever a Producer client writes a message, there is a Kafka broker which interacts with the client to receive messages, assign them to the specified topic, store them in a partition and maintain offset. Brokers service the consumers and respond to all the fetch requests for partitions with messages.
Brokers are identified by Broker ID and contain certain topic partitions. Not all topics and partitions are stored in one broker.
Once we connect to some broker which is a part of a cluster, we get connected to the entire cluster.
Kafka automatically spread the topic across different brokers. Partition number and Broker number has no relationship, it can be assigned in any order.
Topic Replication Factor
Kafka is a distributed system, whenever there is a distributed system we need to have replication for reliability. If a machine goes down, there must be another machine that will keep it going. This needs replication of Topics.
A replication factor is the number of copies of data over multiple brokers. The replication factor value should be greater than 1 always (between 2 or 3). This helps to store a replica of the data in another broker from where the user can access it.
The replication factor of N means there are exactly N copies of data. The above diagram illustrates a replication factor of 1 as there is only one copy of data present.
If we lose one machine/broker, we still have other brokers to serve as they have the replicated data.
Just like the familiar master-slave architecture, there is a concept of leader in Kafka.
At any point in time, only one broker can be the leader for a given partition. Only the leader can serve or receive data for that partition. The other brokers will be passive replicas and will keep synchronizing data.
Hence, each partition has one leader and multiple In-sync replicas (ISR). Zookeeper decides the leader and ISRs. We will look into that separately.
Once a leader is down, an ISR will be assigned the new leader for a particular partition. After the original leader broker is up, it will try to become the leader again and set the other back to becoming an ISR.
Producer and Keys
Producers write data into the Topics. Producers know where to write the message. Kafka in the background manages it all for us. Produces tries to load the balance automatically when the messages are being written without keys.
Producer when writing the data can choose to receive acknowledgments. There is three types of acknowledgments, namely:
- acks=0: Producer won’t wait for an acknowledgment.
- acks=1: Producer will wait for leader’s acknowledgment.
- acks=all: Producer will wait for leader’s as well as replicas acknowledgment
acks=0 and acks=1 can have potential data loss as we don’t have the complete confirmations. By default Kafka has acks=1 acknowledgment.
The producer can use a key to send to a particular partition. Key can be a String or an Integer. All the messages with a specified key will go to the same partition every time. We keep a key to maintain the order of the field of data.
Keep in mind, we can not assure that messages with a particular key will go to a specific partition but the same partition. If a message with a key goes once to a specific partition, it will always go to the same partition further. This is done by Hashing.
Consumer reads the data from a topic which is identified by name and knows which partition to read data from. Similar to the producers, consumers know how recover in case of Broker failure, that’s how Kafka is designed.
Consumers can read from multiple partitions, they read data inorder for each partitions. However there is no specific order for which partition will be read first.
When multiple consumers reads data directly from the specific partitions are visioned to an application, are called consumer groups. In some cases, it might be possible that the number of consumers are greater than that of partitions, some of the consumers will be in an inactive state.
Every consumer groups has multiple consumers so which consumer reads first is decided by a GroupCoordinator and one ConsumerCoordinator, which assigns a consumer to a partition. This feature is already implemented in the Kafka.
In Kafka we can keep a check on offset value for a consumer group by storing it. Consumer offsets stores an offset value to know at which partition, the consumer group is reading the data. As soon as a consumer in a group reads data, Kafka automatically commits the offsets, or it can be programmed. These offsets are committed live in a topic known as __consumer_offsets. In case of a machine failure where a consumer dies, the consumer will be able to continue reading from where it left off.
Consumers have a choice of when the it wishes to commit the offsets. It is very similar to putting a bookmark on webpages in a browser which a user can put while browsing.
We have three delivery semantics in Kafka:
- At most once: Offsets are committed as soon as the consumer receives the message. In case of incorrect processing, the message will be lost, and the consumer will not be able to read further. Therefore, this semantic is the least preferred one.
- At least once: Offsets are committed after the message has been processed. If the processing goes wrong, then the message will be read again by the consumer. Hence, this is usually preferred to use. Also we need to keep in mind that a consumer can read the message twice, and duplicate processing of the messages can happen. Thus, it needs a system to be an idempotent system.
- Exactly once: Offsets can be achieved for Kafka to Kafka workflow only using the Kafka Streams API.
Kafka Broker Discovery
Every Kafka broker is called a bootstrap_server, that means once we connect to one Kafka broker, we will automatically get connected to the entire Kafka cluster. There is a concept of metadata, i.e. each broker knows all brokers, topics and partitons if all the brokers are a bootstrap_server.
Once a client connects to a broker, it gets metadata containing all the details of brokers, topics and partitions. That is broker discovery works and it works internally. ;)
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and managing all the brokers. Zookeeper helps in setting up leader, replacing and synching with kafka with all the events like failure, topic creation. Kafka essentially needs Zookeeper to run.
Similar to Kafka, Zookeeper works as a group of servers, which again has the concept of leaders and followers. Only difference in the operation is Zookeeper doesn’t store consumer offsets. It is stored in the Kafka topics itself.
This is pretty much everything you need to know before starting up with Kafka. In the upcoming parts of this series we will look into one more important concept called Kafka Guarantees followed by Setting up and working with Kafka.