Apache Kafka is open-source software platform for stream processing, which was developed by Apache Software Foundation. It is written in Scala and Java. Projects tries to create unified platforms with high throughput and low latency for data flow processing in the real time.

Its storing layer is basically massively scalable front of messages, which is designed as distributed log of transactions. The platform is very useful by that for the business infrastructure, which processes data streaming. Kafka also connects to external systems for import and export of data through Kafka Connect and offers Kafka Streams, which is library in Java for stream processing.

Usage

Apache Kafka is a platform that is based on commits logging and allows users to log in and publish data in random number of systems or applications in real time. That can be applications like passenger and driver pairing management on Uber, analytics in real time and predicting maintenance as at British Gas and their smart households or as services in real time on LinkedIn.

Architecture

Kafka stores messages, which come from processes, which are called producers. Data can be divided into different partitions with different topics. Within partition, messages are sorted according to their position within partition, indexed and stored together under time label. Other processes, which are called consumers, can read the messages from partition. For stream processing Kafka offers Streams API, which allows to write apps in Java, which consume data from Kafka and write results back. Apache Kafka also functions with external systems for stream processing as is Apache Apex, Apache Flink, Apache Spark or Apache Storm.

Kafka runs on cluster of one or more servers, which are called brokers and partitions of all topics are divided among the cluster’s nodes. Partitions can be replicated. Architecture allows Kafka to deliver massive message stream. Therefore, it replaced some of the conventional systems for sending messages as is JMS, AMQP or other.

Kafka supports two types of topics: common and compact. The common can be configurated so they would have set the retention period or spatial restriction. If the records are older than the retention period or space that they occupy, it exceeds the space given for the partition. Kafka has the right to delete the older data for releasing space.

The themes are configurated by default with retention period of seven days, but it is possible to store them even unlimitedly long. Compact themes aren’t set to retention period or time limit. Instead of that, the later messages are considered to be the older messages actualizations with the same key and it warrants that it will never delete the newest message with the key. Users can delete messages by replacing every message by the null value.

Advantages of using Apache Kafka

  • High throughput. Kafka enables processing of huge amounts of data volume with great speed while using not too extensive hardware. It can support the throughput of messages in the order of thousands of messages per second.
  • Low latency. Kafka can process messages with low latency within milliseconds, which is enough for most of applications.
  • Tolerability of mistakes. Built-in ability to cope with node or machine error within cluster.
  • Durability. Data or messages are stored on disc. It is also possible to replicate messages and thanks to that there isn’t going to be a loss.
  • Scalability. It is possible to scale Kafka without stoppage when adding nodes.

Disadvantages of using Apache Kafka

  • Space for storing messages. Data are often stored twice therefore their storage is relatively expensive.
  • Adding consumers can lead to slowing down.