Intellipaat
Intellipaat

What is Apache Kafka?

Apache kafka is a highly scalable, fast and fault-tolerant messaging application used for streaming applications and data processing. This application is written in Java and Scala programming languages.

What is Apache Kafka?
March 23, 2018      4863 Views     2 Comments

Introduction to Apache Kafka

Apache Kafka is a fast, scalable, fault-tolerant publish-subscribe messaging system which enables communication between producers and consumers using message based topics. It designs a platform for high-end new generation distributed applications.Kafka permits a large number of permanent or ad-hoc consumers. Kafka is highly available and resilient to node failures and supports automatic recovery. These characteristics make Kafka ideal for communication and integration between components of large scale data systems in real world data systems. Did you know that AWS is providing Kafka as a service

The ability to give higher throughput, reliability and replication has made this technology replace the conventional message brokers such as JMS, AMQP, etc.

A Kafka broker is a node on the Kafka cluster that is used to persist and replicate the data. A Kafka Producer pushes the message into the message container called the Kafka Topic and a Kafka Consumer pulls the message from the Kafka Topic. You can find more on the Kafka protocol on the Apache Software foundation site.

Visit this informative video on Apache Kafka :

Some of the other applications offering the similar functionality are ActiveMQ, RabbitMQ, Apache Flume, Storm and Spark. But why should you go for Apache Kafka instead of Flume?

Apache Kafka Apache Flume
General-purpose tool for multiple producers and consumers. Special-purpose tool for specific applications.
Replicates the events using ingest pipelines. Does not replicate the events.

Master Apache Kafka Today

Comparing RabbitMQ with Apache Kafka

RabbitMQ is considered as one among the foremost Apache Kafka alternatives. Let’s see how they differ from one another in terms of their properties and performance.

Apache Kafka RabbitMQ
Kafka is distributed. The data is sharded, replicated with guaranteed durability and availability. RabbitMQ provides relatively less support for these features
Performance rate high to the tune of 100,000 messages/second Performance rate is around 20,000 messages/second
Kafka also comes with consumer frameworks which allow reliable log distributed processing. There is a stream processing semantics built into the Kafka Streams. The consumer in RabbitMQ is just FIFO based, reading from the HEAD and processing 1 by 1.

How Kafka differs from other queueing systems?

Kafka is new age message streaming service incepted by LinkedIn. Let’s see how it fares as when compared to traditional queueing systems.

Traditional queuing systems Apache Kafka
We are not stereotyping but most queueing systems remove the messages after it has been processed typically from the end of queue. In Kafka the messages persist after being processed and they don’t get removed as consumers receive them.
Similar to imperative programming where the originating system decides a system should take an action that occurs in a downstream system. Reactive programming is employed in that there is a reactive publisher/subscriber architecture.
It is not possible to process logic based on similar messages or events. It is possible.

Why should we use Apache Kafka cluster?

One of the biggest challenges with big data is processing and analyzing it. Also for a system to process impeccably it should be able to grasp and make the data available to the users. It is when Apache Kafka has proved its utility. It provides numerous benefits like:

  • Tracking web activities by storing/sending the events for real-time processes
  • Alerting and reporting the operational metrics
  • Transforming data into standard format
  • Continuous processing of streaming data to the topics

Due to its wide application, this technology is giving a tough competition to some of the most popular applications such as ActiveMQ, RabbitMQ, AWS, etc.

A brief history of Apache Kafka

In this section of Apache Kafka tutorial, we will elucidate on the history of Kafka. LinkedIn was facing the issue of low latency ingestion of huge amount of data from the website into a lambda architecture which could be able to process real-time events. Since none of the solutions were available to deal with this drawback, Kafka was developed in the year 2010 as a solution to this problem.

Though there were technologies available for batch processing, but the deployment details of those technologies were shared with the downstream users. Moreover those technologies were not suitable for real-time processing.

Subsequently it was made public in the year 2011.

Learn Kafka in 12 hrs. Download e-book now

X

Apache Kafka architecture

We will familiarize you in this Apache Kafka tutorial on its architecture. Kafka is usually integrated with Apache Storm, Apache HBase and Apache Spark in order to process real-time streaming data. It is capable of delivering massive message streams to Hadoop cluster regardless of the industry or use-case. Its process flow can be better understood if we take a close look into its ecosystem:

Apache Kafka architecture

Kafka is deployed as a cluster implemented on one or more servers. The cluster is capable of storing ‘topics’ which consists streams of ‘records’. Every record holds three details, a key, a value, a timestamp. Brokers are the abstractions which manage the persistence and replication of message.

Basically it has four core APIs :

  • Producer API -This API permits the applications to publish stream of records to one or more topics.
  • Consumer API -The Consumer API lets the application to subscribe to one or more topics and process the produced stream of records.
  • Streams API – This API takes the input from one or more topics and produces the output to one or more topics by converting the input streams to the output ones.
  • Connector API – This API is responsible for producing and executing reusable producers and consumers which are able to link topics to the existing applications.

Kafka topic

Kafka topics consists of several partitions.A topic can be parallelized through these partitions by splitting the data in a particular topic across multiple brokers. For multiple consumers to read a topic in parallel each partition should be placed on a separate machine. Multiple consumers can read from multiple partitions in a topic which allows for a phenomenally high message processing throughput.

Offset is an identifier within a partition of every message. The offset is simply an immutable sequence where the messages are ordered. This kind of message ordering is maintained by Kafka. Starting from a specific offset, consumers can read messages. They can choose and read from any offset which allows to consumers at any time to join the cluster. Through the message’s partition, topic and offset within the partition each specific message in a Kafka cluster can be uniquely identified by a tuple.

Kafka topic

Log Anatomy

One can also view as a log the partitions. Messages to the log are written by a data source. At any time one or more consumers reads from the log they choose.  In the diagram below a log is being written by the data source and the log is being read by consumers at different offsets.

Log Anatomy

Data Log

For a considerable amount of time messages are retained by Kafka and consumers can read as per their convenience. The consumer will lose messages if Kafka is configured to keep messages for 24 hours and a consumer is down for time greater than 24 hours. If however the downtime on part of the consumer is just 60 minutes then messages can be read from last known offset. Kafka doesn’t however keep state on what consumers are reading from a topic.

Messaging setup in Kafka

As with other publish-subscribe distributed messaging systems, Kafka also holds feeds of messages in topics. Producers write data to topics and consumers read from topics. Topics are partitioned and replicated across multiple nodes as Kafka is a distributed system.

Messages are simply byte arrays and any object can be stored in any format by the developers. The format can be in String, JSON Avro and much more. If one wants all their messages with the same key to arrive to the same partition the keys should be attached to each message. This job is that of the producer. It is possible to configure a consumer group with multiple consumers when consuming from a topic. Every consumer in a group of consumers will read messages from a subset of partitions which is unique in every topic they subscribe to. This is done so each message is sent to one consumer in the group. Messages with the same key naturally goesto the same consumer.

Kafka views as a log each topic partition which is why it is so unique. A unique offset is assigned to each message in a partition. Which messages were read by a particular consumer is not tracked by Kafka. Instead the technology retains all those messages for a particular amount of time. It is the prerogative of the consumers to be able to track their location in each log. It is due to this virtue that Kafka can support huge number of consumers and without much difficulty are able to retain large amount of data. If one enables log compaction then data in Kafka will be kept for all time.

Partitioning in Apache Kafka

Every broker holds few partitions and each partition can be either a leader or a replica for a topic. All writes and reads to a topic go via the leader and the leader is responsible for updating replicas with new data. If the leader fails, the replica takes over as the new leader.

Partitioning in Apache Kafka

Producers

As a means of load balancing, Producers write to a single leader so that each write can be serviced by a separate Kafka message broker and machine. From the image below, the producer is writing to partition 0 of the topic. Partition 0 replicates that write to the available replicas of the same partition in different brokers.

Partitions and brokers

In the below image, partition 1 is being written by the producer and partition 1 replicates that write to the available replicas.

Producer writing to partition

Throughput of the system is increased very much as each machine is responsible for each write.

Importance of Java in Apache Kafka

Kafka is written in pure Java and also Kafka’s native API is java. C++, Python, .NET, Go and many other languages also support Kafka. But Java is the only language where there is no need of using a third-party library. There will be a little overhead in writing code in languages apart from Java.

For high throughput applications such as Kafka, Node.js can’t be optimized enough. So, if you need the high processing rates that come standard on Kafka, java language can be used.

There is a good community support for Kafka consumer clients written in Java. Therefore, there are all the right reasons to implement Kafka in Java. You can find various examples of Kafka being deployed in the web.

There is a Camus API which is provided by LinkedIn to facilitate Kafka to hdfs pipeline. There is a robust support for Kafka java client also.

Use cases of Kafka

Through below use cases you’ll get to know as to what is Apache Kafka really used for.
 
Messaging
 
We have extensively in this Apache Kafka tutorial covered on the messaging concepts.
 
Metrics
 
Kafka is finds good application for operational monitoring data. To produce centralized feeds of operational data, statistics are aggregated from distributed applications.
 
Event Sourcing
 
In this style of application changes in design state are logged as a time-ordered sequence of records. Kafka is an excellent backend for applications of event sourcing as it supports very large stored log data.
 
Commit Log
 
For a distributed system Kafka can act as a sort of external commit-log. Replication of data between nodes and re-syncing the failed nodes to restore their data is the function of this log. This usage is supported by the log compaction feature of Kafka.
 
Tracking of website activity
 
Initially the use case for Kafka was intended on building a tracking pipeline as a set of real time publish subscribe feeds. Huge number of activity messages are generated for each user page view as activity tracking is of very high volume.
 

Scope of Apache Kafka

LinkedIn has deployed one of the biggest clusters of Kafka and has reported saying “Back in 2011, it was ingesting more than 1 billion events a day. Recently, it has reported ingestion rates of 1 trillion messages a day.”

An analysis by Redmonk reveals a surprising fact saying “Kafka is increasingly in demand for usage in servicing workloads like IoT, among others.”

Scope of Apache Kafka

redmonk

“The partnership with popular streaming systems like Spark has resulted in the consistent growth of active users on the Kafka users mailing list, which is just over 260% since July 2014.” – Fintan Ryan, Redmonk Analyst

Scope Kafka TrainingThis powerful technology has created a lot of buzz since its emergence due to its special features that distinguishes it from other similar tools. Its ability to provide a unique design makes it suitable for various software architectural challenges.

Some of the tech leaders who have implemented it are-

  • Twitter
  • LinkdIn
  • Netflix
  • Mozilla
  • Oracle

Wish to grab high-paying real-time analytical jobs? Start with Apache Kafka Online Training Course!

Who is the right audience for Apache Kafka?

Apache Kafka is best suited course for aspirants willing to make their career as Big Data Analysts, Big Data Hadoop Developers, Architects, Testing Professionals, Project Managers, Messaging and Queuing System Professionals.

However a thorough knowledge about Java, Scala, Distributed Messaging System and Linux is recommended.

How to download Apache Kafka?

You can download Apache Kafka through Kafka quick start page.

How Apache Kafka will help you in career growth?

The demand for Kafka is rising at such a pace that it is outperforming Apache Spark in terms of relative employer demand.

  • “The average salary for Kafka professional is 122,000 USD per annum. This is 112% higher than the average salaries of other jobs.”- Indeed.com
  • The salary trend also indicates a steady and zooming growth from early 2015 that is still on the rise. -Indeed.com

From the aforementioned facts and figures we can definitely assess the extent to which tech giants are craving for Kafka professionals.One thing is clear that Kafka has created a solid impact on the leading market players and is anticipated to grow in near future. Thus mastering it will definitely give you a sure-shot success.Though there are many technologies in market which address the similar issues, but it has created a niche for itself by delivering high-end services to the companies wanting to process streaming data in real-time. The range of qualities it offers is broad and hence it is being widely accepted by the major technology leaders. The growing popularity of this technology has created a huge demand for its professionals offering high-paying jobs to the right candidates.

Kickstart your career through keen interest in Kafka through Intellipaat’s Kafka tutorial!

 

Related Articles

Suggested Articles

  • Swapna

    intellipaat i want to know ‘what are the prerequisites for Kafka training’ ?? and is necessary to learn hadoop before Kafka.
    i am looking for your answer.. ASAP.

    • Mo

      I’ve just finished my Big Data Class and I’ve learned that none of these tools in the big data landscape is required for any others, and no way is it complete. We’ll have dozens of new technologies out before some sort of standard. If you want to build a robust big data system you need to think of layers (ie. the Lambda Architecture), and decide what works for you. For example for my use case I’m aiming to learn Cassandra, Spark and Kafka to do real-time big data machine learning with a Kappa Architecture. https://www.linkedin.com/pulse/analytics-data-pipeline-lambda-kappa-architecture-farshad-vahidpour/