What is ZooKeeper & why is it needed for Kafka?
ZooKeeper is a distributed coordination service used in Kafka Clusters Architecture and it maintains information about the state of the Kafka cluster, such as the location of each broker, the topics and partitions in the cluster, and the configuration settings for each component.
ZooKeeper in Kafka helps to ensure that brokers, producers, and consumers are coordinated and working together efficiently to provide reliable and scalable data processing and messaging capabilities.
Overall, Zookeeper in Kafka is a crucial component of its Clusters Architecture, providing a reliable and scalable way to manage the distributed state of the cluster and coordinate the Kafka brokers, producers, and consumers.
What does Zookeeper Do?
In the Kafka Clusters Architecture, ZooKeeper in Kafka is employed to control the Kafka cluster's state by coordinating actions like:
- Leader Election:
In Kafka, there is a leader broker for each topic partition who manages all reads and writes for that partition. When the current leader broker fails, ZooKeeper is in charge of organizing the election of a new leader broker. - Topic Configuration:
Each topic's configuration, including its number of partitions, replication factor, and retention policy, is kept up to date by ZooKeeper in Kafka. - Broker:
Each Kafka broker registers with ZooKeeper in Kafka and provides details about its hostname, IP address, topics, and partitions that it is in charge of. - Management of Consumer Groups:
When a consumer enters or leaves the group, it reassigns partitions to make sure that each is only allotted to one member of the group at a time. - Monitoring:
ZooKeeper in Kafka keeps track of the cluster's health and alerts users when any brokers fail or go down. As the cluster state changes, it alerts the other brokers, producers, and consumers.
Optimizing ZooKeeper
It's essential to optimize ZooKeeper for Kafka Cluster Architecture to make sure the system functions properly and can withstand the load imposed on it.
- Dedicated ensemble:
Running ZooKeeper on a dedicated group of servers improves performance. It is made sure that the Zookeeper in Kafka service does not compete for resources with other applications operating on the same servers by using a distinct ensemble for Kafka Clusters Architecture. - Proper hardware:
ZooKeeper functions best with fast discs and high-speed networks. It will be easier to verify that the system functions properly and is able to manage the load if high-performance hardware is used. - Set JVM settings:
By setting the Java Virtual Machine (JVM) parameters, ZooKeeper's performance may be increased. For instance, by lowering the frequency of trash collection, increasing the heap size might enhance performance. - Performance:
Performance can also be increased by configuring the network settings on the ZooKeeper servers. For instance, decreasing the amount of missed connections and network interruptions can both be achieved by raising the size of the TCP backlog and turning off IPv6.
Monitoring ZooKeeper with Elasticsearch and Kibana
- A distributed publish/subscribe messaging system with streaming capability is Apache Kafka.
- A distributed search and analytics engine is Elasticsearch.
- An open-source analytics and visualization platform is called Kibana.
Flow of data between Producers, Kafka, Zookeeper, Consumers, Elasticsearch, and Kibana as shown in the diagram. This diagram shows how data flows when Kafka is used as intended to transfer incoming data to the proper database(s) or target systems.
This diagram demonstrates how data is transmitted straight from Kafka to Elasticsearch while monitoring Kafka's status using Elasticsearch.
Key ZooKeeper Metrics for Performance Monitoring
It's essential to keep an eye on the operating system and Java Virtual Machine (JVM), which executes ZooKeeper, and metrics particular to ZooKeeper.
Example metrics to monitor from the following :-
1. ZooKeeper Metrics
- Latency:
This monitors how long it takes ZooKeeper to respond to a client request. - Requests:
The quantity of outstanding requests is an indicator of how many requests ZooKeeper is presently handling. - Connections:
The amount of client connections to the ZooKeeper ensemble is measured here. - Number of Watches:
ZooKeeper provides watch metrics that track the number of watches being set and triggered. These metrics can be used to monitor the usage of watches and detect potential issues with watch performance or behavior.
2. Metrics for operating systems
- CPU utilization:
This calculates the ZooKeeper process's proportion of total CPU consumption. - Spending Of Memory:
This monitors how much memory the ZooKeeper process is using. - Disk I/O:
This determines how frequently the ZooKeeper process performs disc I/O activities.
3. Metrics for JVM
- Heap memory Use:
The heap memory consumption statistic monitors how much memory the JVM heap consumes. This measure is crucial for detecting memory leaks or other memory-related issues that might cause the JVM to crash or become unstable. - Time for garbage collection:
This gauges the time frame of garbage collection. - Count of Threads :
The number of threads the JVM is using is shown by the thread count.
Running Kafka Without ZooKeeper
Apache ZooKeeper is a distributed coordination service that provides a set of primitives for building reliable and scalable distributed systems.
Choice of a controller:
When a node goes down, ZooKeeper makes sure that other replicas step in to lead partitions in place of the node that is going down.
Membership Of Clusters:
A list of all active brokers in the cluster is maintained by ZooKeeper.
Configuring the topic:
The configuration of every topic is kept up to date by ZooKeeper, including the preferred leader node, the list of available topics, the number of partitions for each subject, the location of the replicas, configuration overrides for topics, and more.
Lists of access controls (ACLs):
Additionally, Zookeeper in Kafka keeps track of all topics' ACLs. This consists of a list of consumer groups, members of the groups, and the most recent offset each consumer group has received from each partition. It also indicates who or what is permitted to read to and write to each subject.
Quotas:
Zookeeper in Kafka controls the amount of data that each client is permitted to read and write.
There is a chance to begin using Kafka without ZooKeeper with the release of version 2.8.0 in April 2021. However, this version lacks certain essential functions and is not yet ready for usage in production.
The Kafka cluster can now conduct all of the metadata tasks that were previously handled by ZooKeeper and the Kafka controller because of this quorum controller.
FAQ
Q Should you use Zookeeper with Kafka brokers?
A: Since Kafka without Zookeeper isn't yet suitable for production usage, utilize Zookeeper in Apache Kafka production installations.
Q Should you use Zookeeper with Kafka clients?
A: Kafka clients and CLI have migrated to leverage Kafka brokers as a connection endpoint instead of Zookeeper, making the change invisible to clients. Zookeeper is less secure and should only be opened to allow traffic from Kafka brokers, not Kafka clients.
Conclusion
- When it comes to distributed coordination and management of the Kafka cluster, ZooKeeper is an important component of the Kafka architecture.
- Kafka uses ZooKeeper to choose a controller node, govern broker membership, and keep track of topic partitions and their leaders.
- ACLs (access control lists) that specify who may access and alter various components of the Kafka cluster are also stored in ZooKeeper.
- For fault tolerance and high availability, Kafka depends on ZooKeeper. The Kafka cluster will stop functioning without ZooKeeper, or worst-case scenarios might result in data loss.
- To provide a stable and dependable Kafka cluster, using ZooKeeper with Kafka brokers is strongly advised.