1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous hardwares. This framework provides applications with the data motion and fault-tolerance transparently. This framework implements the computational paradigm called MapReduce, which splits the application into various tasks that can be executed or re-executed on any node on the cluster. The MapReduce framework and the Hadoop Distributed File System (HDFS) run on same set of nodes and hence provide very high aggregate bandwidth across the cluster. The MapReduce framework and Hadoop Distributed File System are designed in such a way that the framework automatically handles node failures. Keywords: Hadoop,MapReduce,HDFS INTRODUCTION Hadoop MapReduce is a framework that can be used for executing applications containing vast amounts of data (terabytes of data) in parallel on largely built clusters with numerous nodes in a reliable and fault-tolerant manner. Though it can be executed in a single machine, its true power lies in its ability to scale to several thousands of systems each with several processor cores. Hadoop is designed in such a way that it distributes data efficiently across various nodes in the cluster. It includes a distributed file system that takes care of distributing the huge amount of data sets efficiently across the nodes in the cluster. MapReduce framework splits the job into various numbers of chunks which the Map tasks process in parallel. The outputs from the map tasks are sorted by the framework and given to Reduce tasks as input. Both the input and output of the tasks are stored in a file system. The framework takes care of scheduling the tasks, monitoring those tasks and reexecuting the failed tasks. The MapReduce framework and Hadoop Distributed File System run on same set of nodes, that is, both the compute nodes and the storage nodes are the same. This kind of setup helps in doing computation in the nodes where data already exists, thus resulting in efficient utilization of bandwidth across the nodes in the clusters. Each cluster has only one JobTracker which is actually a daemon service for submitting and tracking MapReduce jobs in Hadoop. So it is a single point of failure for MapReduce service and hence if it goes down all running jobs is halted. The slaves are configured to the node location of the JobTracker and perform tasks as directed by the JobTracker. Each slave node has only one TaskTracker which keeps track of task instances and notifies the JobTracker about the status. Implementation of appropriate interfaces and abstract classes by the applications specify the input and output functions and supply Map and Reduce functions. Job configuration comprises of these and other parameters. The Hadoop Job client submits the job and configuration to the JobTracker which distributes the configuration to the slaves, schedules tasks and monitors them. It then submits the job report to the Job client. The report consists of status and diagnostic information about the tasks.
2 THE HADOOP APROACH In Hadoop cluster, the data is distributed to all the nodes as it is loaded in. The Hadoop Distributed File System (HDFS) splits the large amount of data into various chunks which are managed independently by the nodes in the cluster. Each chunk is replicated across these nodes in the cluster so that failure at one point does not result in halting of the job and is executed or re-executed in the other node that is part of the cluster. An active monitoring system monitors the status of these chunks so that it could report in case of failure of execution of any chunk. Though the file chunks are replicated and distributed across several nodes, they form a single namespace and are universally accessible. In the Hadoop programming framework, data is conceptually record-oriented. The input files are broken into lines or some other formats depending on the application logic. Each process on a node then executes subset of these records. The Hadoop framework schedules these processes in proximity to the location of the data/nodes using knowledge gained from the distributed file system. Since the files are distributed in file chunks across various nodes, each process in a node processes a subset of the data. The data on which the each node operates depends on the locality of the data to the node. Most of the data are read from the local disk into the CPU and hence alleviates the burden on the network by moving the computation to where data exists. This kind of movement of computation to the place of data is one of the primary features in Hadoop which helps in effectively utilizing the bandwidth and hence producing a high performance. Figure 1: Data distributed across various nodes at load time. MAPREDUCE One of the primary features of Hadoop is that limits the amount of communication involved. In Hadoop, the programs written for distributing such large amounts of data conform to a programming model called MapReduce. In MapReduce, the records are processed in isolation by tasks called Mappers. The output from the Mappers is given to second set of tasks called Reducers which gives us the final output of the job. The following diagram illustrates how Mappers and Reducers work: Figure 2: Mappers and Reducers As figure 2 suggests, Mappers read the input from the Hadoop Distributed File System (HDFS) and performs computation. The output from the Mappers are partitioned by key and sent to Reducers. The Reducers sort the input from the Mapper by the key and reduce output is written to HDFS.
3 COMMUNICATION As mentioned earlier, one important advantage of Hadoop is that it limits the amount of communication involved. But still several nodes in the cluster have to communicate with each other at some point of time. Unlike other programming model like MPI, where the Application developer has to explicitly specify the bytes that are to be streamed to different nodes, in Hadoop this is done implicitly. Each piece of data is tagged with some keyname which the Hadoop uses to send related bits of information to the destination node. Hadoop internally manages the data transfer and all the cluster topology issues. This advantage of Hadoop to limit the communication between the nodes makes the system more reliable. The individual node failures can be handled by restarting tasks on some other node. Since user-level tasks do not communicate with each other, no messages are exchanged between the user programs. Even if there is a failure at one node, the other nodes work as if nothing went wrong and this failure is taken care of by the underlying Hadoop layer. HADOOP ARCHITECTURE HDFS has a master/slave architecture. A HDFS cluster has a master called NameNode that takes care of filespace naming and regulates access to data files. Each node in the cluster contains at least one DataNode that manages the storage of data. A file is split into two or more parts internally and these parts are stored in sets of DataNodes. The NameNode performs the filesystem namespace operations like opening,closing and reading of files It also maps the data blocks to the DataNodes. The following figure explains the HDFS architecture [6]: Figure 3: HDFS architecture with the NameNode and DataNode The client making use of HDFS will input the file which is split by the NameNode and it tags the blockid associated with the block containing data and gives both blockid and location of the DataNodes to which the blocks of data are mapped. The client accesses data from the DataNodes containing the blocks of data. HDFS supports a traditional hierarchical file organization. The HDFS filesystem namespace is similar to other existing file systems like one can create, edit or remove files to the directories. Yet HDFS does not implement access permissions and does not support hard links or soft links. NameNode maintains the file system namespace. NameNode records changes to the filesystem namespace or its properties. The NameNode also stores the number of replicas of a file called the replication factor of that file. An application can specify the number of replicas, that is, the replication factor for that file. FAULT TOLERANCE HDFS is designed in such a way that it can distribute very large files reliably. Each file is stored in the form of sequence of blocks with
4 each block having the same size except for the last block. These blocks of file are replicated for fault tolerance. The block size and replication factor are configurable per file. As mentioned earlier, an application can specify number of replicas for each file. The replication factor can be specified at file creation time and can be changed at any point of time. NameNode is the one that makes all the decisions regarding the replication of blocks. It receives a report periodically from the DataNodes regarding the status of the block. It receives a HeartBeat from the DataNode suggesting that it is functioning properly and a BlockReport that contains list of blocks for that DataNode. The NameNode makes decisions regarding replication based on the reports obtained from the DataNodes. The placement of replicas is very crucial with regard to the performance. What distinguishes HDFS from other filesystems is the optimized placement of replicas. The following figure illustrates data replication [6]: PERFORMANCE One of the major benefits of Hadoop in terms of performance when compared to other distributed systems is the flat scalability curve. Hadoop does not perform very well with small number of nodes because of the high overhead in starting Hadoop programs when compared to other distributed systems. Distributed systems like MPI perform well two, four or even dozen machines. Though such systems perform well with small number of systems, the price paid in performance and engineering effort, that is, with increase in number of systems, increases nonlinearly. Programs written in other distributed frameworks require lots of refactoring to scale up from ten to one hundred or thousands of machines. This could involve rewriting the programs several times and they even have to put a cap on the scale to which an application can grow. Hadoop is designed in such a way that it provides a flat scalability curve. Very little work is required with respect to the program to actually scale up to the commodity hardware, that is, orders of magnitude of growth can be handled by Hadoop with very liitle re-work on the application program. The underlying Hadoop platform will manage the data and hardware resources and provide dependable performance growth proportionate to the number of machines available. The following graph illustrates the flat scalability curve achieved by Hadoop. Figure 4: Data Replication The NameNode determines the rack id to which each DataNode belongs to. The replicas are placed in such a way that even if a rack fails there is no loss of data. This policy makes it easy to evenly distribute data which makes it easy for load balancing.
5 There are two types of file systems handling large files for clusters, namely, parallel file systems and Internet service file systems [3]. Hadoop distribution file system (HDFS) [2] is a popular Internet service file system that provides the right abstraction for data processing in Mapreduce frameworks. CONCLUSION In this paper, an exhaustive survey has been made on Hadoop with regard to its power of performance, scalability and its advantages over other distributed systems. SUMMARY Figure 5: Flat scalability curve achieved by Hadoop RELATED WORK Some research has been directed to implementation and evaluation of performance in Hadoop [4][12][7]. Ranger implemented MapReduce for shared memory systems. Phoenix provide with a scalable performance with both multi-core and conventional symmetric multi-processors. Bingsheng et al. developed Mars which is a Mapreduce framework for graphics multi-processors [4]. The goal of Mars was to hide the programming complexity of GPU by providing simple MapReduce interface. Zaharia et al. implemented a new scheduler - LATE in Hadoop to improve MapReduce performance by speculatively executing tasks that hurt response time the most [11].Asymmetric multi-core processors (AMPs) address the I/O bottleneck issue, using doublebuffering and asynchronous I/O to support MapReduce functions in clusters with asymmetric components [10]. Chao et al. classified MapReduce workloads into three categories based on CPU and I/O utilization [13].They designed the Triple-Queue Scheduler in light of the dynamic MapReduce workload prediction mechanism called MR-Predict. Although the above techniques can improve MapReduce performance of heterogeneous clusters, they do not take into account data locality and data movement overhead. Features MapReduce HDFS Communication Fault Tolerance Flat Scalability REFERENCES Description It is a programming model that the programs written to distribute large amount of data conform to in a Hadoop framework A distributed file system that Hadoop utilizes for handling filespace naming and for handling the files like reading, writing and deleting. One of the most important features of Hadoop is that it limits the amount of communication involved by moving computation to the node where data exist. Hadoop achieves fault tolerance by means of data replication. An application itself can specify the number of replicas called the replication factor for the file. One of the most important benefits of Hadoop is the Flat scalability curve. With increase in scale of machines Hadoop is able to achieve a flat scalability curve. [1] http://lucene.apache.org/hadoop.
6 [2] Parallel virtual file system, version 2. http://www.pvfs2.org. [3] A scalable, high performance file system. http://lustre.org. [4] B.He, W.Fang, Q.Luo, N.Govindaraju, and T.Wang. Mars: a MapReduce framework on graphics processors. ACM, 2008. [5] C.Olston, B.Reed, U.Srivastava, R.Kumar, and A.Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD 08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, 2008. [13] T.Chao, H.Zhou, Y.He, and L.Zha. A Dynamic MapReduce Scheduler for Heterogeneous Workloads. IEEE Computer Society, 2009. [14] W.Tantisiriroj, S.Patil, and G.Gibson. Data-intensive file systems for internet services: A rose by any other name Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-114, October 2008. [6] D.Borthakur. The Hadoop Distributed File System: Architecture and Design. The Apache Software Foundation, 2007. [7] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI 04, pages 137 150, 2008. [9] H.Yang, A.Dasdan, R.Hsiao, and D.S.Parker. Map-reducemerge: simplified relational data processing on large clusters. In SIGMOD 07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, ACM, 2007. [10] M.Rafique, B.Rose, A.Butt, and D.Nikolopoulos. Supporting mapreduce on large-scale asymmetric multi-core clusters. SIGOPS Oper. Syst. Rev., 43(2):25 34, 2009. [11] M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI 08: 8th USENIX Symposium on Operating Systems Design and Implementation, October 2008. [12] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. High-Performance Computer Architecture, International Symposium on, 0:13 24, 2007.