Processing of Hadoop using Highly Available NameNode

Size: px
Start display at page:

Download "Processing of Hadoop using Highly Available NameNode"

Transcription

1 Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale College of Engineering Pune, India. 1 deshpandeakash1@gmail.com, 2 shrikant.badwaik@gmail.com, 3 saileenalawade29@gmail.com, 4 anjalibt456@gmail.com, 5 shyamkosbatwar@gmail.com Abstract Hadoop is open source programming framework which enables the processing of large amount of data in distributed computing environment. Instead of depending on expensive hardware and different system for storing and processing data, Hadoop enables parallel processing on Big Data. They need to go through terabytes and petabytes of data to figure out particular queries and demands of users. Existing tools were becoming inadequate to process such large data sets. Hadoop solved the problem of handling such a huge amount of data. Hadoop has two core components:1. Hadoop Distributed File System (HDFS): It understands and assigns work to the nodes in a cluster.2. MapReduce: It is software framework used for distributed computing and processing of large amounts of data in the Hadoop cluster.in past few years, Hadoop Distributed File System (HDFS) has been used by many organizations with gigantic data sets and streams of operations on it. HDFS provides distinct features like, high fault tolerance, scalability, etc. The NameNode machine is a single point of failure (SPOF) for a HDFS cluster. If the NameNode machine fails, the system needs to be restarted manually, making the system less available. This paper proposes a highly available architecture and its working principle for the HDFS NameNode against its SPOF. Keywords HDFS, MapReduce, NameNode, SPOF I. INTRODUCTION There are so many organizations such as Facebook, Google, Yahoo, etc. in the world that owns very large amount of data on the internet. The main challenge in front of such organizations is to store and analyze huge quantities of data. To store and process such large amount of data, require special hardware machines, which is not a good deal in terms of money, as well as the processing capability of single machine cannot scale to process such huge data on it. Every single day huge amount of data is going to add from various fields like science, economics, engineering and commerce. To analyze such huge amounts of data for better understanding of users needs have led to development of data intensive applications and storage clusters. The basic requirement of these applications is a highly available, highly scalable and reliable cluster based storage system. Hadoop is an open source frame work used for processing such very large data sets. Hadoop develops software framework for reliable and scalable distributed computing. Hadoop mainly do this job with the help of Hadoop Distributed file system (HDFS) and MapReduce to perform computation on a cluster of commodity computers. II. EXISTING SYSTEM HDFS is doing its part of work efficiently in Hadoop since it was developed. HDFS works with two types of hardware machines, the (Slave machine) which is the machine on which application s data is stored and the NameNode (Master machine/server machine) which store the metadata of file system. Application s data is stored on using small blocks of data where 528

2 NameNode is the only single machine for storing metadata of file system and is the Single Point of Failure (SPOF) for the HDFS. SPOF of NameNode machine affects the overall availability of Hadoop. Hadoop provides high bandwidth to application data and suits with applications having large data sets. The core idea is to move large set of applications closer to where the data is located and perform computations on it. HDFS can scale up to ten thousands of low cost hardware machines on which data can be stored in the blocks and computation can be partition on these machines using MapReduce. MapReduce is a programming model used for processing very large data sets stored on HDFS clusters. NameNode is central point of focus in HDFS architecture as it is the only server machine which store metadata of file system; it manages namespace and grant access to file on the request of client where the is machine which stores the application s data. The data on is stored in small blocks which are made up by splitting large files. A. Namenode NameNode is the master of HDFS that directs the slave s to perform the low-level I/O tasks. NameNode keeps track of how files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed file system. The function of the NameNode is memory and I/O intensive. As such, the server hosting NameNode typically doesn t store any user data or perform any computations for a MapReduce program to lower the workload on the machine. This means that the NameNode server doesn t double as a or a. B. Each slave machine in a cluster will host a to perform the grunt work of the distributed file system - reading and writing HDFS blocks to actual files on the local file system. The data nodes can communicate to each other to rebalance data, move and copy data around and keep the replication high. Client can communicate directly with the to process the local files corresponding to the blocks. Metadata ops Read Datanodes Rack 2 Client Name Node Datanodes Rack 1 Replication Client Metadata (Name, Replicas,..) Fig 1.1 HDFS Architecture Block ops C. Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS has a high degree of fault-tolerance and is usually developed for deploying on low-cost hardware. As shown in the Fig 1.1, HDFS has master-slave architecture, with a single master called the NameNode and multiple slaves called s. NameNode manages and store the metadata of the file system. The metadata is maintained in the main memory of the NameNode to ensure fast access to the client, on read/write requests. s store and service read/write requests on files in HDFS, as directed by the NameNode. The files stored into HDFS are replicated into any number of s as per configuration, to ensure reliability and data availability. These replicas are distributed across the cluster to ensure rapid computations. HDFS provides efficient access to data and is appropriate for applications having big data sets

3 Files in HDFS are divided into smaller blocks, typically block size of 64MB, and each block is replicated and stored in multiple s. NameNode maintains the metadata for each file stored into HDFS, in its main memory. This includes a mapping between stored file names, the corresponding blocks of each file and the s that host these blocks. Hence, every request by a client to create, write, read or delete a file passes through the NameNode. Using the metadata stored, NameNode has to direct every request from client to the appropriate set of s. The client then communicates directly with the s to perform file operations. D. MapReduce MapReduce is a programming model and is opensource implementation of Hadoop, where MapReduce programs comprise of Map phase and Reduce phase - each of which defines a mapping from one set of key-value pairs to another. MapReduce programs can be written in various languages; Java, Ruby, Python, and C++. MapReduce programs are parallel, with Map tasks running concurrently, followed by their corresponding Reduce tasks that operate on data stored on the HDFS file System. Decomposing a data processing application into mappers and reducers is sometimes not important. But, once you write an application in the MapReduce form, scaling the application to run over hundreds, thousands of machines in a cluster is merely a configuration change. This specific scalability has attracted many programmers to use the MapReduce model. Hadoop MapReduce Framework comprises of two types of components that control the job execution process: a central component called the Job and a number of distributed components called s. Based on the location of a job s input data, the Job schedules tasks to run on some s and coordinates their execution of the job. s run tasks assigned to them and send progress reports to the Jobtracker. III. Secondary NameNode Name Node Job Fig 1.2 HDFS Architecture AVAILABILITY OF HDFS ARCHITECTURE The architecture of HDFS is designed to deal with frequent hardware failure as the core theme of Hadoop and cluster computing is that they run over lost cost commodity hardware machines, which are more susceptible to failure. Hence, the architecture of Hadoop was designed to deal with these failures. The other important feature of Hadoop is its scalability. It can scale up to ten thousands of machines. If you are using single hardware machine with some probability of failure, and on other side you are using ten thousands of such machines then probability of hardware failure will be ten thousand times great than that single machine. Hadoop was designed to have features like reliability, availability, fault tolerance, affordability and scalability. Hadoop exhibits all of these features very well except the availability feature that it has compromised a bit. The availability of Hadoop is very good when it comes to ; as blocks of data are being replicated at three different s. But when it comes to NameNode; Hadoop s availability is always on stack. If the NameNode machine goes down then the whole system becomes offline, as NameNode is the only entity which deals with the client queries and respond accordingly. It also keeps the precious metadata of file system. Hence with single NameNode machine the availability of Hadoop is always on stack, because it can go down at any time

4 IV. NAMENODE S SINGLE POINT OF FAILURE A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. They are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system. The single point of failure in a Hadoop cluster is the NameNode. While the loss of any other machine (intermittently or permanently) does not result in data loss, NameNode loss results in cluster unavailability. The permanent loss of NameNode data would render the cluster's HDFS inoperable. Therefore, another step should be taken in this configuration to back up the NameNode metadata. Large Hadoop clusters have thousands of data nodes and one name node. The probability of failure goes up linearly with machine count. So if Hadoop didn't cope with data node failures it wouldn't scale. Since there's still only one name node, the Single Point of Failure (SPOF) is there. To eliminate the problem of HDFS NameNode, we have devised architecture of HDFS and its working principle. This architecture and working principle eliminates SPOF of NameNode by initially replicating NameNode at separate machines (in our case two machines) only once and then starts working normally with its s. The core idea of this architecture is that once NameNode get replicated after that, it will do all operation on its metadata by using well known 2- Phase Commit (2PC) Protocol. Initially the NameNode will be acting as Coordinator NameNode, and its replicas will be Participants NameNode machines. The Coordinator and Participants machines will perform any update to their metadata according to 2PC protocol. Due to 2PC protocol both of the replicas of Coordinator NameNode will be fully updated all time, hence in case Coordinator NameNode fails or even completely goes down then any of its replicas can become Coordinator and hosts the s. The new Coordinator can be elected using Election by Bully algorithm. V. PROPOSED SYSTEM To eliminate the Single Point of Failure (SPOF) of Hadoop HDFS Name Node, we have devised architecture of Name Node and its working principle under which the problem of Name Node SPOF is eliminated and the availability of Name Node is ensured. To eliminate the Single Point of Failure (SPOF) of Hadoop HDFS Name Node, we have devised architecture of Name Node and its working principle under which the problem of Name Node SPOF is eliminated and the availability of Name Node is ensured. A. Name Node Replication To increase the availability of any system, the very simple technique that can be used is redundancy. Here at initial step, we have done the same thing by replicating Name Node at two different machines that are called Participant Name Nodes. In our case we have taken replica factor two, but it can be varied according to degree of availability requirement. Hence when the Name Node starts for the first time, it will be replicated only once on Participant Name Node machines. Once replication is complete, the system will then work under the principle of Two Phase Commit Protocol with some modifications. Before discussing our proposed modified form of 2PC protocol, the actual 2PC protocol is discussed below. B. Two Phase NameNode Commit (2PNNC) Protocol In our proposed solution once NameNode is replicated, after that NameNode and its replicas will work under the principle of Two Phase Commit (2PC) Protocol. We have not exactly used the same 2PC protocol, but have made some variations in it according to our needs. Here the NameNode will be called Coordinator NameNode and its replicas will be known as Participant NameNode machines. Every Participant NameNode is associated with an update-queue that will hold multiple vote-requests from the Coordinator. Any of the Participants can send a votecommit to coordinator, as soon as it receives the vote-request and its update-queue is empty. So, in 531

5 this case the Coordinator need not wait for all of the vote-commit replies from the Participants, rather it will continue to send global-commit against a single vote-commit from any of the Participant. The remaining Participants can perform their commit operation later, but there must be at least one Participant who will update itself with the Coordinator at same time. After the vote-request from the Coordinator, if the update-queue is partially filled then the Participant can accept the voterequests, but will send vote-commit to coordinator until it has performed all the pending operations in its update-queue. The Participant will send voteabort in the case when its update-queue is full. The 2PNNC protocol for our proposed solution is explained below: Phase 1a. Coordinator NameNode sends vote request to all Participant NameNodes. Phase 1b. When any of the Participant NameNode receives vote-request from Coordinator then it will either reply vote-commit or vote-abort. It will send vote-abort if its update-queue is full of previous voterequests, otherwise it will keep the vote-request in its update-queue and sends vote-commit to coordinator when its update-queue become empty by performing all previous requested updates. Phase 2a. The Coordinator NameNode collects all votes from the Participants. If it gets vote-commit from any one of its Participants NameNode, it commits its update operation and sends out globalcommit to all Participant NameNodes. If it receives vote-abort from any of its Participants NameNodes then it will abort its update operation and will send global-abort to all Participant NameNodes. Phase 2b. Every Participants NameNode machines will wait for Coordinator global-commit and globalabort and will respond accordingly. Vote Request Vote Abort Global Abort Commit ACK Abort State Vote Request Vote Commit Global ACK Fig 1.3 Participant NameNode Working Principle under 2PNNC protocol VI. CONCLUSION In this paper we have proposed the implementation of Hadoop to deal with SPOF problem of NameNode machine. This will greatly increase the availability of Hadoop by eliminating NameNode s SPOF. We have utilized two wellknown techniques of distributed systems, the 2PC protocol and Election by Bully algorithm, with some variations in these techniques by reducing their waiting time and increasing their throughput. Both of these techniques are widely deployed and adopted successfully. VII. Initial State Ready State Commit State REFERENCES [1] Prashant D. Londhe, Satish S. Kumbhar, Ramakant S. Sul, Amit J. Khadse, PROCESSING BIG DATA USING HADOOP FRAMEWORK, Proceedings of 4th SARC- IRF International Conference, New Delhi, India, April. 2014, ISBN: [2] Mohammad Asif Khan, Zulfiqar A. Memon, Sajid Khan, Highly Available Hadoop NameNode Architecture, 2012 International Conference on Advanced Computer Science Applications and Technologies. [3] Dhruba Borthakur, The Hadoop Distributed File System:Architecture and Design, in Apache Software foundation, hadoop.apache.org/common/docs /r0.18.0/hdfs _design.pdf

6 [4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo!, The Hadoop Distributed File System, IEEE NASA storage conference, storageconference.org/2010/papers/msst/ Shvachko.pdf. [5] Aaron Myers, High Availability for the Hadoop Distributed File System (HDFS), Article at Cloudera, March 07, [6] Tom White, Hadoop: The Definitive Guide. O Reilly Published by O Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472,