The Importance of Monitoring Kafka Performance

Learn via video courses
Topics Covered

Overview

As the amount of data increases, Kafka serves as the real-time data transfer system. But with the real-time use case, it seems that there are many issues seen with Kafka such as latency, replication, etc which affect and jeopardized the fault-tolerance and reliability. Kafka offers various metrics, of which the four major metrics are, Kafka server (broker) metrics, Producer metrics, Consumer metrics, and ZooKeeper metrics which help in monitoring Kafka and handling the issue before it can cause major problems.

Introduction

As we know that Apache Kafka is the preferred choice in the infrastructure when requirement from the users involves working with massive amounts of data flow. Kafka’s reliability, high throughput, faster speed, as well as ability to scale, is capturing the attention of many organization.

While working with Kafka, it has been observed that infrastructure is not perfect there are dependencies on various internal and external factors that cause a delay in the data transfers and sometimes overwhelm the subscribers too. Instances when insufficient copies of replicas or partitions failed to replicate are also been observed. All this results in hammering the fault-tolerant properties offered by Kafka, resulting in data loss as there is the risk of a server breakdown.

Consumer lag is another major issue seen. The rate at which the publisher is pushing the message and the rate at which consumers failed to keep up also calls for consumer latency. The increasing lag between producer and consumer affected the purpose of companies where they are dealing with real-time data systems.

Hence, monitoring Kafka becomes important. Constant monitoring and regular checking of the health of the Kafka cluster support the overall performance of the brokers, producers, and consumers.

Key Metrics for Monitoring Kafka

As we know that Kafka works on the publish-subscribe messaging model, and only when the publisher pushes the messages, only then the subscriber can pull the message, it becomes crucial to understand how we as users can efficiently be monitoring Kafka.

It's recommended to have a clear picture of what is Kafka. Before you jump into this article, it helps to understand Kafka Topics, Kafka Partitions, and Kafka Offsets at a much deeper level. You can visit What is Kafka to know more.

When the Kafka cluster is properly functioning it can easily handle a vast amount of data. Hence, it becomes crucial for the users to constantly monitor the health of the Kafka deployment. Monitoring Kafka helps to maintain reliability and obtain a consistent performance from the system.

The Kafka metrics can be segmented into four categories:

  • Kafka server (broker) metrics
  • Producer metrics
  • Consumer metrics
  • ZooKeeper metrics

Kafka can seamlessly deliver the data records from the publisher to the subscriber in real-time. This makes the system into which Kafka is integrated highly interactive. As these users are entirely dependent on the system integrated with Kafka, it becomes important to capture and fix any problems before they can impact the system and cause more damage. If the Kafka server slows down, then we should be able to immediately catch the issue and come up with a solution to deal with it. Monitoring Kafka helps to answer these questions and to help us stay updated with the business requirements.

In this article, we shall cover all four major metrics needed for monitoring Kafka.

  • Broker metrics help to understand how the requests from clients are handled and how the data is stored. There can be more than one broker per client.
  • Zookeeper metrics help to understand the health and state of the cluster.
  • Producer metrics help to understand the health while the data records are being sent to the broker.
  • Consumer metrics help to understand the health of the Kafka when the data is received from the broker.

Kafka Server (broker) Metrics

While the message is pushed by the publisher, it travels via the broker before getting consumed by the subscriber. Hence, monitoring the performance and health of the broker becomes very important. To understand the Kafka server` (broker) metrics, we shall deep dive into four major metrics under it.

  • BytesInPerSec/BytesOutPerSec
  • Under-replicated Partitions
  • Leader Election Rate
  • Network Request Rate

BytesInPerSec/BytesOutPerSec

We can define the 'BytesInPerSec/BytesOutPerSec' as the number of data brokers that are received from the producers along with the number of consumers that read from these specific brokers. The 'BytesInPerSec/BytesOutPerSec' helps to understand the overall workload of the Kafka cluster.

It is very important to keep the 'BytesInPerSec/BytesOutPerSec' optimal as this directly affects the disc throughput of the Kafka cluster. If the broker and its health are correctly monitored, users can mitigate the major risks that might not have been so evident before. When your system is delivering the data records across various Data Centers, and the topics have been subscribed by a large number of subscribers, or also in scenarios where the replicas are catching up with their leaders, having a good network throughput can a become significant in monitoring Kafka performance.

kafka.server:type=BrokerTopicMetrics,name={BytesInPerSec|BytesOutPerSec}

Under-replicated Partitions

We can define this as the number of under-replicated `partitions spread across all the topics on the broker. It is this indicator that helps to know when more than one broker is unavailable. To make sure the broker is always healthy and easily accessible, the replication number per topic is as needed. This also ensures data durability, which helps when one of the brokers fails. But as the data was duplicated across various brokers, allowing the data to be available for processing.

When the 'UnderReplicatedPartitions' metric is traced, users can keep a check when the number of active brokers for a defined topic is less than the desired number.

It is recommended that no under-replicated partitions must be running in a running Kafka deployment and this metrics value should be zero.

kafka. server: type=ReplicaManager, name=UnderReplicatedPartitions

Leader Election Rate

The partition leader election is when the ZooKeeper is unable to connect with the leader. The broker's unavailability is indicated by this metric.

In the scenario where the Partition Leader dies, then a Replacement Leader is selected via election. The Partition Leader is declared dead when it fails to maintain its ZooKeeper session. The rate of leader elections per second is reflected via the 'LeaderElectionRateAndTimeMs' where you also get to know the overall amount of time in milliseconds that would be taken for the cluster without a leader.

Aleader electionis necessary when contact with the existing leader is not retrieved which indicates that the broker is offline.

kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs

Network Request Rate

The Network Request Rate determines the frequency of requests from the producers, consumers, and subscribers. As the Kafka Brokers are the main source of the network traffic, the main focus is on gathering as well as transporting` the data so that it could reach the subscriber's end for consumption and processing. Users can analyze the number of network requests being made per second. This will help to monitor as well as compare the network throughput achieved across each cloud provider and data center per server.

Users might need to increase the number of brokers when it sees that a kafka broker’s network frequency has exceeded or the network bandwidth has shrunkenn under the specified threshold. If not done, subscribers can expect a delay. Sometimes this also signals for modification in the subscriber's back-off protocol to effectively manage the data rates.

Producer metrics

In this section of the article, we shall be learning about the Producer metrics that need to be checked while monitoring Kafka.

We know that the producers are the processes in Kafka that shall be pushing the message in the Kafka log cost model to be consumed by the subscriber. But when the producers stop working, then it is evident that no message can be consumed by the subscribers.

Below are a few of the key producer metrics:

Producer metricsDescription
compression-rate-avgDescribed as the average compression rate at which the batches were sent.
response-rateDescribes as the average number of responses that are received per producer.
request rateDescribed as the average number of responses that are sent per producer.
request-latency-avgDescribed as the average requests latency experienced in milliseconds.
outgoing-byte-rateDescribed as the average number of outgoing bytes per second.
io-wait-time-ns-avgDescribed as the average span of time when the I/O thread are taking in nanoseconds for waiting for a socket
batch-size-avgDescribed as the average number of bytes that are sent per request for each partition.

Let us understand the response rate and I/O wat time in detail.

Response Rate

We can define a response rate as the rate at which the responses from brokers are received for the producers. The broker sends a reaction to the producer to notify it that the data has been acknowledged.

Now when we say that the data is “received”, we can mean any one of the following 3 scenarios.

  • The leader received a confirmation stating that the data is written to disc from all the available replicas. (request.required.acks == all).
  • The data record or message that the leader writes to the disk. (request.required.acks == 1).
  • The data record or message is received but not yet committed (request.required.acks == 0).

kafka.producer:type=producer-metrics,client-id=([-.w]+)

I/O Wait Time

We know that it's the producer that sends the data or messages for the consumer to subscribe to it. The producer has two functions to perform: either it needs to wait or simply send the data. But when the producers are generating additional data compared to what could be pushed in the log commit model, then the producer needs to hold on until sufficient network resources are available.

It was very difficult to identify the issue of whether the producers are working well or not. hence, it is best to validate the working of the producer if it uses all of the available bandwidth. With the knowledge about disc access, users can know how the processing is going on. Hence, users validating the I/O wait times on its producer is a sharp decision.

The I/O wait tells how much percentage of time is being spent executing the I/O, where the CPU condition was validated as idle. When more time is seen as the wait time, it indicates that the producers are incapable to collect the data that it needs in a proper timely manner.

Kafka.producer:type=producer-metrics,client-id=([-.w]+)
Consumer Metrics

Consumer Metrics

Consumer metrics in monitoring kafka is another important metric to be considered which helps the users to understand the efficiency when the data is being retrieved by consumers. This helps to validate and identify if any system performance issues are coming up.

Given below are the various sub-metrics under the consumer metrics which could be utilized for monitoring kafka.

Consumer metricsDescription
records-consumed-rateIt represents the average of the number of records that are consumed per second across all or any defined topic.
bytes-consumed-rateIt represents the average number of bytes consumed for each of its consumers per second across all or any defined topic.
records-lagIt represents the number of messages that a consumer is lagging from the producer depending upon this specific partition.
records-lag-maxIt represents the maximum recorded lag. When the value of records-lag-max increases, it means the consumer is not aligned or is not able to keep up with the producers.
fetch-rateIt represents the number of fetch requests the subscriber makes per second.

Let us take a detailed look at one of the most commonly utilized sub-metrics of consumer metrics named, records-lag.

Records Lag: It is defined as the difference approximated between the number of messages a consumer is lagging behind the producer. While the maximum difference between the number of messages between a consumer and a producer is defined as records-lag-max. It depends on your customer's usage pattern which determines the significance of the numbers that the indicators’ statistics show. You might experience a longer record latency when scenarios like the consumers back up the old communications for storage over a longer period. Hence,records-lagcan be summarized as a difference between the subscriber’s present log offset and the producer’s present log offset. While the consumers process real-time data, and continuous high lag values are seen then it indicates which might be indicating an overcrowded log. To enhance the throughput and minimize the latency, additional consumers along with topics distributed across additional partitions can be helpful.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),topic=([-.w]+),partition=([-.w]+)

ZooKeeper metrics

ZooKeeper metrics in monitoring kafka is a crucial component of Kafka deployment. ZooKeeper is required for storing important information about brokers and kafka themes and disabling it can also stop Kafka. Zookeeper helps its users to control the speed of traffic passing via the cluster as well as store information about the replicas.

Given below are the various sub-metrics under the ZooKeeper metrics which could be utilized for monitoring Kafka.

ZooKeeper metricsDescription
outstanding-requestsIt is defined as the number of requests that resides in the queue.
avg-latencyIt is defined as the response time in milliseconds to a client request.
num-alive-connectionsIt is defined as how many clients are connected to ZooKeeper.
followersIt is defined as the number of active followers.
pending-syncsIt is defined as the amount of pending consumers syncs.
open-file-descriptor-countIt is defined as the number of utilized file descriptors.

Log Flush Latency

In Kafka, the data is saved by appending the new log files to old ones. To maximize the performance and durability, the cache-based writes are asynchronously written to the physical storage depending upon the various Kafka internal parameters. With the “fsync” system function, the writing can be forced. And with the 'log.flush.interval.messages' and 'log.flush.interval.ms' settings, you can advise the kafka when to flush.

However, it is recommended to use the operating system’s background flush capabilities rather than the kafka flush settings. The latency can be calculated by analyzing the planned flush time with the actual log time. This can help the users know about greater replication, quicker storage, scale, or any hardware issue.

kafka.log: type=LogFlushStats, name=LogFlushRateAndTimeMs

Offline Partition Count

We can define offline partitions as data stores that are not available to your systems when an issue like system failure or there is a system restart. In such cases, 'one' out of all the other Kafka brokers takes command as controller. This helps to maintain the statuses of replicas, and partitions and reassign the partitions as per requirement.

The Kafka partition replication needs to be increased when scenarios, where the amount of Kafka partitions over a single cluster, is fluctuating. These could be reflected with variables like cloud connection or network events that are temporary. Whereas when the offline partition counts increase more than zero, it reflects that the users need to increase the number of brokers. This also reflects that the fetches are not able to keep up with the number of messages received.

kafka.controller:type=KafkaController,name=OfflinePartitionsCount

Total Time to Service a Request

To calculate the total time to service a request, Kafka uses the 'TotalTimeMs' metric. The 'TotalTimeMs' metric is defined when the users want to calculate the total time that is taken for servicing or handling a request.

Below are a few of the service requests:

fetch-follower: The fetch-follower is used to make requests from the brokers. The broker needs to be the follower of a partition to gather the new data. produce: The produce is used to make requests from the producers for sending the data. fetch-consumer: The fetch-consumer is used to make requests from the consumers for gathering new data.

While the 'TotalTimeMs' metric is made by the sum of the below given four sub-metrics:

remote: Time spent waiting for a response from a follower. queue: The amount of time spent waiting for a request to be processed. response: The time to send the response. local: Time spent being processed by the leader.

Only minor changes should be observed in all these parameters, that too under certain settings. It is recommended that the users investigate the individual queue, remote, response, and local values whenever any unusual behavior is seen. This needs to be done to find a specific request segment that caused the problem.

kafka.network=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}

Conclusion

  • As the amount of data increases, Kafka serves as the real-time data transfer system. But with the real-time use case, it seems that there are issues seen such as latency, replication, etc which affect and jeopardized the fault-tolerance and reliability that Kafka offers.

  • outstanding-requests are defined as the number of requests that resides in the queue.

  • Kafka offers various metrics, four major are Kafka server (broker) metrics, Producer metrics, Consumer metrics, and ZooKeeper metrics which help in monitoring Kafak and handling the issue before it can cause major issues.

  • request rate is described as the average number of responses that are sent per producer.

  • Broker Metrics helps to understand how the requests from clients are handled and how the data is stored. There can be more than one broker per client.

  • Zookeeper metrics help to understand the health and state of the cluster.

  • Producer metrics help to understand the health while the data records are being sent to the broker`.

  • Consumer metrics help to understand the health of the Kafka when the data is received from the broker.