Over the years, technological evolution is getting faster, all over the world, every year, we generate more data. Be able to deal with this scenario and extract the right information, may define the success of the companies.
Technological evolution has allowed us to have more connectivity and portability, which has resulted in a universe of possibilities to evolve our products. Today, we have social networks, IOT (Internet of Things), transaction data from our software and data ingestion from external systems etc. All of these sources bring us data all the time and to handle such a high volume of data, your software must be prepared to maintain, process, transform and produce new data in a safe and viable way for your business. Apache Kafka is a technology that makes it easier for us to deal with this complexity.
But anyway, what is Kafka?
Apache Kafka is an open-source distributed event streaming platform highly scalable, which aims to provide a platform with low latency for processing data in real time. Like other technologies used in modern systems such as Kubernetes and Cassandra, Apache Kafka is a distributed system, that is, its processing and its data are distributed in its nodes (brokers) that operate in cluster mode, guaranteeing the fault tolerance and data durability.
As a data streaming platform, Apache Kafka diverges from the old batch processing that was so common to work years ago and brings us a series of tools and solutions that allow us to deal with a continuous flow of data and to treat it, as soon as each record is generated in our system.
Apache Kafka consists of the following main elements:
Broker: Handles all interactions with clients (Producers, Consumers and Stream Apps) and maintains persisted data; Zookeeper: Maintains the status and settings of the nodes in your cluster;
Producer: Send records to brokers;
Consumer: Consume records from the broker;
Stream App: Application that reads and writes data in your cluster.
How Kafka deals with my messages?
In Apache Kafka, we refer to messages as records, to understand how these records are sent and consumed by Apache Kafka, it is necessary to know what topics, partitions and records are.
A topic is a particular data stream, in which records are published and consumed. Each topic is sub-divided into partitions, and each new record received is allocated to one of these partitions and is assigned a sequential code that we call offset.
The record is composed by key and value, every record sent with the same key to the same topic, will always be allocated in the same partition.
For each different consumer in our system, Apache Kafka tracks what was the last offset consumed for that topic/partition, so when that consumer gets new records, only the records he didn’t consumed are delivered.
The partition is what allows parallelism when consuming records from a topic by the same application, the maximum number of instances in parallel that you can have processing the topic records for the same application, is equal to the number of partitions in the topic.
In order for Apache Kafka to understand that different instances are part of the same application, they must be labeled with the same consumer group name and this is how the cluster will understand that it must balance the processing of partitions between the instances, as shown in the figure below:
As the consumer always handles the deviations of a partition sequentially and records with the same key are always sent to the same partition, Apache Kafka enables parallel processing, ensuring sequential processing of records with the same key.
Why should I consider using Kafka?
It is very difficult for our system to operate alone, it is necessary to perform integrations with other systems and, often, what we face is a scenario of integration between the systems, similar to the following figure:
This diagram illustrates how often the integration between systems in companies is confusing. The use of several implementations with different technologies for the integrations was the most common path to be followed, which led us to a chaotic scenario that made it impossible to evolve the systems. The situation was even worse due to the scalability limitations of traditional solutions, it was practically impossible to have a real- time integration between all of these systems.
Apache Kafka allows us to decouple data processing and systems, allowing for much more efficient integration. The idea is to centralize all asynchronous communication through Kafka, so that you don’t have so many different solutions. With Kafka, our scenario would look like this figure:
That way, when your data is inside Kafka, you can share it with any other system you want. As Apache Kafka operates in publisher and subscriber mode, we can publish an event from a given system and whoever is registered to handle this message can respond in real time. Thus, we can integrate our systems in a much more efficient way with communication through Apache Kafka.
Why Apache Kafka and not another solution?
Apache Kafka is an open source project that has a large active community and is maintained by Confluent that leverages the power of Kafka, extending its functionality.
Apache Kafka has been used on a large scale around the world, working with hundreds of applications and thousands of software engineers working independently in the same cluster, transmitting trillions of messages.
Adoption in the market is proof of the efficiency of Apache Kafka and that it can handle complex scenarios with high data throughput and deliver a real time processing.
More than 80% of all Fortune 100 companies trust, and use Kafka, below is the top-ten largest companies using Kafka, per-industry:
- MANUFACTURING: 10 out of 10;
- BANKS: 7 out of 10;
- INSURANCE: 10 out of 10;
- TELECOM: 8 out of 10.