Zookeeper – Applications

Zookeeper provides a flexible coordination infrastructure for a distributed environment. ZooKeeper framework supports many of today’s best industrial applications. We will discuss some of the most notable applications of ZooKeeper in this chapter.

Yahoo!

The ZooKeeper framework was originally built at “Yahoo!”. A well-designed distributed application needs to meet requirements such as data transparency, better performance, robustness, centralized configuration, and coordination. So, they designed the ZooKeeper framework to meet these requirements.

Apache Hadoop

Apache Hadoop is the driving force behind the growth of the Big Data industry. Hadoop relies on ZooKeeper for configuration management and coordination. Let us take a scenario to understand the role of ZooKeeper in Hadoop.

Assume that a Hadoop cluster bridges 100 or more commodity servers. Therefore, there’s a need for coordination and naming services. As the computation of a large number of nodes is involved, each node needs to synchronize with each other, know where to access services, and know how they should be configured. At this point in time, Hadoop clusters require cross-node services. ZooKeeper provides the facilities for cross-node synchronization and ensures the tasks across Hadoop projects are serialized and synchronized.

Multiple ZooKeeper servers support large Hadoop clusters. Each client machine communicates with one of the ZooKeeper servers to retrieve and update its synchronization information. Some of the real-time examples are −

  • Human Genome Project âˆ’ The Human Genome Project contains terabytes of data. Hadoop MapReduce framework can be used to analyze the dataset and find interesting facts for human development.
  • Healthcare âˆ’ Hospitals can store, retrieve, and analyze huge sets of patient medical records, which are normally in terabytes.

Apache HBase

Apache HBase is an open-source, distributed, NoSQL database used for real-time read/write access of large datasets and runs on top of the HDFS. HBase follows master-slave architecture where the HBase Master governs all the slaves. Slaves are referred to as Region servers.

HBase distributed application installation depends on a running ZooKeeper cluster. Apache HBase uses ZooKeeper to track the status of distributed data throughout the master and region servers with the help of centralized configuration management and distributed mutex mechanisms. Here are some of the use-cases of HBase −

  • Telecom âˆ’ Telecom industry stores billions of mobile call records (around 30TB / month) and accessing these call records in real time become a huge task. HBase can be used to process all the records in real time, easily and efficiently.
  • Social network âˆ’ Similar to telecom industry, sites like Twitter, LinkedIn, and Facebook receive huge volumes of data through the posts created by users. HBase can be used to find recent trends and other interesting facts.

Apache Solr

Apache Solr is a fast, open-source search platform written in Java. It is a blazingly fast, fault-tolerant distributed search engine. Built on top of Lucene, it is a high-performance, full-featured text search engine.

Solr extensively uses every feature of Zookeepers such as Configuration Management, Leader election, node management, Locking, and synchronization of data.

Solr has two distinct parts, indexing and searching. Indexing is the process of storing the data in a proper format so that it can be searched later. Solr uses ZooKeeper for both indexings the data in multiple nodes and searching from multiple nodes. ZooKeeper contributes the following features −

  • Add / remove nodes as and when needed
  • Replication of data between nodes and subsequently minimizing data loss
  • Sharing of data between multiple nodes and subsequently searching from multiple nodes for faster search results

Some of the use-cases of Apache Solr include e-commerce, job search, etc.

Leave a Reply