Apache Kafka Overview

May 26, 2021 Apache Kafka

In big data, a lot of data is used. /b10> We have two main challenges with data. T he first challenge is how to collect large amounts of data, and the second challenge is to analyze the collected data. /b11> To overcome these challenges, you must need a messaging system.

Kafka is designed for distributed high throughput systems. /b10> Kafka tends to work well as a more traditional alternative to messaging agents. /b11>Kafka has better throughput, built-in partitioning, replication, and inherent fault tolerance than other messaging systems, making it ideal for large-scale messaging applications.

What is a messaging system?

The messaging system is responsible for transferring data from one application to another, so the application can focus on the data without worrying about how to share it. /b10> Distributed messaging is based on the concept of reliable message queues. /b11> Messages are queued asynchronously between the client application and the messaging system. /b12>There are two types of message patterns available - one is point-to-point and the other is publish-subscribe (pub-sub) messaging systems. /b13> Most message patterns follow pub-sub.

Point-to-point messaging system

In a point-to-point system, messages are kept in the queue. /b10> One or more consumers can consume messages in the queue, but a particular message can only be consumed by up to one consumer. /b11> Once the consumer reads the message in the queue, it disappears from the queue. /b12> A typical example of this system is an order processing system where each order is processed by one order processor, but multiple order processors can also work at the same time. /b13> The following image describes the structure.

Publish - Subscribe to the messaging system

In the publish-subscribe system, messages are retained in the topic. /b10> Unlike a point-to-point system, consumers can subscribe to one or more topics and use all messages in that topic. /b11> In a publish-subscribe system, a message producer is called a publisher and a message consumer is called a subscriber. /b12>A real-life example is Dish TV, which publishes different channels such as sports, movies, music, etc., and anyone can subscribe to their own channel set and get their subscription to the channel when available.

What is Kafka?

Apache Kafka is a distributed publishing -subscription messaging system and a powerful queue that can process large amounts of data and enable you to deliver messages from one endpoint to another. /b10> Kafka is suitable for offline and online messaging consumption. /b11> Kafka messages remain on disk and are replicated within the cluster to prevent data loss. /b12> Kafka is built on the ZooKeeper sync service. /b13> It integrates well with Apache Storm and Spark for real-time streaming data analysis.

Benefits

Here are a few of Kafka's benefits -

Reliability - Kafka is distributed, partitioned, replicated and fault-05med.
Scalability - Kafka messaging systems are easily scaled without downtime.
Durability - Kafka uses distributed commit logs, which means that messages remain on disk as quickly as possible, so it is persistent.
Performance - Kafka has high throughput for both publishing and subscribing messages. /b10> Even if many TB messages are stored, it maintains stable performance.

Kafka is very fast and guarantees zero downtime and zero data loss.

Case

Kafka can be used in many use cases. Some of them are listed below -

Metrics - Kafka is typically used to manipulate monitoring data. /b10> This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
Log aggregation solution - Kafka can be used to collect logs from multiple services across organizations and make them available to multiple servers in a standard format.
Streaming - Popular frameworks, such as Storm and Spark Streaming, read data from topics, process them, and write processed data to new topics for use by users and applications. /b10> Kafka's durability is also useful in the context of stream processing.

Kafka is needed

Kafka is a unified platform for processing all real-time data feeds. /b10> Kafka supports low-latency messaging and provides a guarantee of fault tolerance in the event of a machine failure. /b11> It has the ability to handle a large number of different consumers. /b12> Kafka is very fast, executing 2 million writes per second. /b13> Kafka saves all the data to disk, which essentially means that all writes go to the page cache of the operating system (RAM). /b14> This makes it very efficient to transfer data from the page cache to the network socket.