Apache Kafka for Beginners

Anusha SP
11 min readDec 15, 2022

--

Hey Reader,

In this article, let us learn and know about the most useful and popular Messaging Queue System Apache Kafka, which is used widely in almost every platform of the IT industry. Also, Apache Kafka has a huge demand in and outside of India.

The Contents of the Apache Kafka article include:

  1. Introduction to Apache Kafka
  2. Why Apache Kafka?
  3. What is Apache Kafka?
  4. Components of Apache Kafka
  5. What are Streams?
  6. What are Zookeepers?
  7. Core terminologies of Apache Kafka

Introduction to Apache Kafka

Kafka is an open-source distributed event streaming platform, where

  • event streaming refers to Creating Real-Time Stream and Processing Real-Time Stream.
  • Example — Consider you are using a Paytm application and at a time multiple and millions of people will be hitting paytem server at a same time, is known as Creating or generating Real-Time stream of data. And we want to limit the transaction in the Paytm application and then we want to process the transaction this is known as Processing Real-Time Data
  • Next comes distributing refers to distribute muliple computers to multiple nodes or regions to balance the node and to avoid the downtime.
  • As we know Kafka is a distributed event streaming platform, we can also distribute our kafka server.
  • In case if any server goes down, another server will come up to pick up that traffic to avoid application downtime.

Kafka which was originally developed at LinkedIn and later open-sourced in 2011. Since then it has evolved and established as a popular tool for building real-time data in pipeline systems.

Apache Kafka has now become a boon to Tech industries as it is more secure and sharable for real-time applications. The official Kafka document says it is a distributed application system, which is similar to an enterprise application system.

Why Apache Kafka?

Now, let us understand why Apache Kafka has been introduced to the real world. During the older days, for large enterprise applications, we used to have multiple backend systems where the sender and receivers are huge and it was difficult to identify which particular sender is sending to which resources.

  • Without Kafka
Without Kafka

So, if you can observe the above image, the integration part is completely messed up. We can see there are multiple source systems and multiple destination systems. Consider, if you are given a task to create a data pipeline to move data among these systems, in such case some of the pipelines will keep on breaking every day for sure.

But if we use the messaging system for solving this kind of integration problem the solution can be crystal clear as seen in the below image.

And the above idea has been proposed by the LinkedIn team. Then the team started evaluating the existing messaging systems but none of it meet the required criteria. And we are also reducing the connection count here.

So, they came up with creating a new Messaging system, Kafka. Let us have a look at the official Kafka diagram.

How does it work?

  • Kafka specifically works on Pub-Sub model. Pub stands of Publisher and Sub stands for Subscriber.
  • Usually there will be three components. Publisher, Subscriber, Messaging system/Messaging Broker.
  • Publisher is the person who will publish the messages or events to the messaging system like kafka.
  • And the message will go and store in the messaging broker.
  • Now the Subscriber will go into that messaging broker and will ask for the messages. Or Subscriber will simply listen to that message broker to get the messages.

Components of Apache Kafka

In a typical messaging system, there are mainly 3 types of components.

  1. Producer / Publisher — producers are the client application and they send messages.
  2. Broker — The broker receives the message from the publisher and stores the messages in it.
  3. Consumer — Consumers read the message, that they received from the broker.

Here, we also use Kafka clusters — Kafka clusters are nothing but a bunch of brokers running in a group of computers.

As you can see in the above image, Producers are sending messages to the Kafka Clusters. Here, Kafka clusters take messages from Producers and store them in Kafka message logs.

At the bottom of the image, we can see Consumers, who read Kafka clusters and they use them as per their requirements. They may send it to Hadoop, or Hbase or might send it back to Kafka so someone else can read and modify it.

What are Streams?

Stream is a continuous flow of data, Kafka was developed in such a way that, it can handle the flow of data from the Kafka processor. So it can be stored back into Kafka or can send to other systems.

Kafka also provides Stream processing APIs, these APIs can be used in streaming framework technologies like Spark, Storm and Streaming API.

Core terminologies Or Components of Apache Kafka

  1. Producer
  2. Consumer
  3. Broker
  4. Cluster
  5. Topic
  6. Partitions
  7. Offset
  8. Consumer Group
  9. Zookeeper

Producer :

Producers are an application that sends data or message or record. It can be of any size of data. The message which producer sends might be having different meaning or schema to the reader but it is a simple Array of Bytes.

For example, if we want to send a file to Kafka, it will create a producer application and send each line as a message, in this case, the message is one line of text but for Kafka, it is an Array of Bytes.

Similarly, if we want to send all records from a table, it will send each row as a message or we can create a Producer application fire a query against the required database collect the data and start sending each row as a message.

So, while working with Kafka, if you want to send some data, you have to create a Producer Application.

Consumer:

The consumer is again an application that reads or receives data from Kafka, if producers are sending data, Consumes are the recipients.

Remember Producers do not send data, directly to the recipient address. They send it to Kafka Server, so whoever is interested in that data, can use those data, so whichever application request that data will become a Consumer.

Broker:

Broker is a Kafka server, it is said as a name given to a Kafka server. Because all that Kafka server does is act as a Broker between Producer and Consumer.

Producers and Consumers do not interact directly, they use the Kafka server as an agent or Broker to Exchange messages.

To conclude Producer, Consumer, Broker

Producer are the source of data who will publish the messages or events. Consumer acts as a receiver which is responsible in receiving or consuming the messages. But they do not directly communicate with each other.

To produce a message from Producer to Consumer there will be a middle man that is known as Kafka Server

Cluster:

If you have little idea or concept of Distributed systems, you will be knowing what Clusters are. A cluster is a group of computers, acting together for a common purpose.

Cluster is a common terminology in the distributed system, it is nothing but just a group of computers or servers that are running for a common purpose.

Since Kafka is a distributed system, so the cluster has the same meaning as Kafka. It is simply a group of computers each executed in one instance of Kafka Broker.

Topic:

The topic is a Unique name given to a data set or Kafka stream.

Example:

Consider we have a paytm system, where producer produces a huge set of data and consumer asks server to give all the which is been received. Then broker checks its storage, and reply to the consumer that — “I have many mesages in my server which one do you need?” Again Consumer asks — “Give me all payment specific messages” Now again Broker attacked with questions “Hey Consumer I have different types of payment messages with me which one you need?” Now consumer lost to answer its question.

So, topic came into picture, to categorize different type of messages. How topic will help here is by creating multiple topics to store different types of messages.

So, if producer sending payment specific messages, then message will boost to payment specific topic. If producer sending booking related messages then we can store in Booking Topic. This way you can segregate different type to different topic.

  • So consumer can no need to have back and forth communication to brokers.
  • Whatever the topic consumer wants, they can simply need to subscribe that topic.
  • Alternatively topic acts as a database table in Kafka eco-system.

Partitions:

As we know Broker can send data to a Topic, this data can be huge in most cases, and it can be larger than the storage capacity of a single system. In such a case, the broker might have to face challenges in storing such data.

  • Suppose with the above image, Paytm is producing huge amount of data per second.
  • Then it will be a huge storage problem and it is challenging also to store such large of information, to overcome these problems — as we already know Kafka is a distributed system, to handle this scenario we can break the topic into multiple partitions and distribute those parts into different machines or servers.
  • Also in cases when one partition is down, the other will be able to take up the load.

So, one of the solutions could be breaking such huge data into Partitions. The data can be divided into two or more parts and distributed into multiple systems.

Now, the question comes, How many number of partitions can are required?

The answer is Kafka does not decide how many partitions would require, it is the decision to be taken by us.

During the creation of a Topic, we have to decide so Kafka Broker will create that many numbers of Partitions.

Offset:

An offset is a sequence number of messages in a Partition. This number is assigned as the message arrived in the Partition. And these numbers once assigned cannot be changed. They are Immutable.

The Sequencing in Offset means that Kafka stores messages in the order of arrival within a Partition and these are local to the Partitions.

  • The purpose of offset is to keep track of which message have already been consumed by the consumer.
  • For example — Consider after reading 4 message from the partition the consumer went down
  • then after sometime, when consumer comes online, then this offset value help user to know exactly from where the consumer has to start consuming the messages.
  • here if you notice, consumer starts reading from offset 4 not from 0. This is the key role of Offset.

Connectors:

Connectors are one of the powerful features, which are like ready to use. Using these we can import data from Database to Kafka or export data from Kafka to Database.

These are not OOTB connectors, but also frameworks to connect to any other applications.

Consumer Group:

A consumer group is a group of Consumers, i.e number of consumers form a group to share the worth. Consider dividing a large work and working on it individually.

  • In the above diagram, one consumer is consuming information from all the partitions of a topic. So this will surely impact the performance.

To over come this situation

  • The easy solution is to Increase the Number of Consumers
  • We cannot guarantee which partition is coordinating to which consumer group.
  • If any new consumer group is added, then it has to sit idle as no work has been assigned to it. As all consumers are assigned to all available consumer instances.

Zookeeper:

  • Zookeeper is a key component or a prerequisite of Kafka
  • Kafka is a distributed system, and it uses zookeeper for coordination and to track the status of kafka cluster nodes.
  • It also keep track of Kafka topics, partitions, offsets etc…
  • Zookeeper acts as a manager for Kafka Cluster.

Role of Zookeeper:

  • Kafka uses zookeeper to store and manage all the metadata information about Kafka cluster. It also uses Zookeeper as a centralized controller that manages all the Kafka broker / server.
  • In the new Kafka version , Instead of storing all the server config information in zookeeper we can store as topic partition inside the Kafka server itself.
  • This is because of possible because of KRaft mode coming in new Kafka version

These are the basic concepts and components of Kafka to be known and further more will be followed in the coming articles.

Thank you for reading the article. Please clap, share and comment. it will encourage me to write more such articles. Do share your valuable suggestions, I appreciate your honest feedback!!!

--

--

Anusha SP

Software Developer | Technical Content Writer | Freelancer | Aiming to Understand the technologies well