Fault Tolerance in Hadoop for Work Migration



Similar documents
Hadoop Architecture. Part 1

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Parallel Processing of cluster by Map Reduce

Chapter 7. Using Hadoop Cluster and MapReduce

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

CSE-E5430 Scalable Cloud Computing Lecture 2

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Big Data With Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

International Journal of Advance Research in Computer Science and Management Studies

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Apache Hadoop. Alexandru Costan

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Distributed File Systems

Apache Hadoop new way for the company to store and analyze big data

Survey on Scheduling Algorithm in MapReduce Framework


INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney


Processing of Hadoop using Highly Available NameNode

HadoopRDF : A Scalable RDF Data Analysis System

A very short Intro to Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Introduction to MapReduce and Hadoop

Hadoop: Embracing future hardware

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Snapshots in Hadoop Distributed File System

Hadoop implementation of MapReduce computational model. Ján Vaňo

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Large scale processing using Hadoop. Ján Vaňo

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Hadoop Scheduler w i t h Deadline Constraint

Open source Google-style large scale data analysis with Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Intro to Map/Reduce a.k.a. Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Distributed File Systems

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Hadoop Framework

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

BBM467 Data Intensive ApplicaAons

Generic Log Analyzer Using Hadoop Mapreduce Framework

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Reduction of Data at Namenode in HDFS using harballing Technique

Hadoop Distributed File System. Dhruba Borthakur June, 2007

HDFS Architecture Guide

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Big Data and Apache Hadoop s MapReduce

Hadoop and Map-Reduce. Swati Gore

Hadoop Parallel Data Processing

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Big Data Management and NoSQL Databases

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Survey on Load Rebalancing for Distributed File System in Cloud

HDFS Users Guide. Table of contents

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Mobile Cloud Computing for Data-Intensive Applications

HDFS Space Consolidation

Energy Efficient MapReduce

Big Data Analysis and Its Scheduling Policy Hadoop

Distributed Filesystems

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Data-intensive computing systems

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Log Mining Based on Hadoop s Map and Reduce Technique

Transcription:

1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous hardwares. This framework provides applications with the data motion and fault-tolerance transparently. This framework implements the computational paradigm called MapReduce, which splits the application into various tasks that can be executed or re-executed on any node on the cluster. The MapReduce framework and the Hadoop Distributed File System (HDFS) run on same set of nodes and hence provide very high aggregate bandwidth across the cluster. The MapReduce framework and Hadoop Distributed File System are designed in such a way that the framework automatically handles node failures. Keywords: Hadoop,MapReduce,HDFS INTRODUCTION Hadoop MapReduce is a framework that can be used for executing applications containing vast amounts of data (terabytes of data) in parallel on largely built clusters with numerous nodes in a reliable and fault-tolerant manner. Though it can be executed in a single machine, its true power lies in its ability to scale to several thousands of systems each with several processor cores. Hadoop is designed in such a way that it distributes data efficiently across various nodes in the cluster. It includes a distributed file system that takes care of distributing the huge amount of data sets efficiently across the nodes in the cluster. MapReduce framework splits the job into various numbers of chunks which the Map tasks process in parallel. The outputs from the map tasks are sorted by the framework and given to Reduce tasks as input. Both the input and output of the tasks are stored in a file system. The framework takes care of scheduling the tasks, monitoring those tasks and reexecuting the failed tasks. The MapReduce framework and Hadoop Distributed File System run on same set of nodes, that is, both the compute nodes and the storage nodes are the same. This kind of setup helps in doing computation in the nodes where data already exists, thus resulting in efficient utilization of bandwidth across the nodes in the clusters. Each cluster has only one JobTracker which is actually a daemon service for submitting and tracking MapReduce jobs in Hadoop. So it is a single point of failure for MapReduce service and hence if it goes down all running jobs is halted. The slaves are configured to the node location of the JobTracker and perform tasks as directed by the JobTracker. Each slave node has only one TaskTracker which keeps track of task instances and notifies the JobTracker about the status. Implementation of appropriate interfaces and abstract classes by the applications specify the input and output functions and supply Map and Reduce functions. Job configuration comprises of these and other parameters. The Hadoop Job client submits the job and configuration to the JobTracker which distributes the configuration to the slaves, schedules tasks and monitors them. It then submits the job report to the Job client. The report consists of status and diagnostic information about the tasks.

2 THE HADOOP APROACH In Hadoop cluster, the data is distributed to all the nodes as it is loaded in. The Hadoop Distributed File System (HDFS) splits the large amount of data into various chunks which are managed independently by the nodes in the cluster. Each chunk is replicated across these nodes in the cluster so that failure at one point does not result in halting of the job and is executed or re-executed in the other node that is part of the cluster. An active monitoring system monitors the status of these chunks so that it could report in case of failure of execution of any chunk. Though the file chunks are replicated and distributed across several nodes, they form a single namespace and are universally accessible. In the Hadoop programming framework, data is conceptually record-oriented. The input files are broken into lines or some other formats depending on the application logic. Each process on a node then executes subset of these records. The Hadoop framework schedules these processes in proximity to the location of the data/nodes using knowledge gained from the distributed file system. Since the files are distributed in file chunks across various nodes, each process in a node processes a subset of the data. The data on which the each node operates depends on the locality of the data to the node. Most of the data are read from the local disk into the CPU and hence alleviates the burden on the network by moving the computation to where data exists. This kind of movement of computation to the place of data is one of the primary features in Hadoop which helps in effectively utilizing the bandwidth and hence producing a high performance. Figure 1: Data distributed across various nodes at load time. MAPREDUCE One of the primary features of Hadoop is that limits the amount of communication involved. In Hadoop, the programs written for distributing such large amounts of data conform to a programming model called MapReduce. In MapReduce, the records are processed in isolation by tasks called Mappers. The output from the Mappers is given to second set of tasks called Reducers which gives us the final output of the job. The following diagram illustrates how Mappers and Reducers work: Figure 2: Mappers and Reducers As figure 2 suggests, Mappers read the input from the Hadoop Distributed File System (HDFS) and performs computation. The output from the Mappers are partitioned by key and sent to Reducers. The Reducers sort the input from the Mapper by the key and reduce output is written to HDFS.

3 COMMUNICATION As mentioned earlier, one important advantage of Hadoop is that it limits the amount of communication involved. But still several nodes in the cluster have to communicate with each other at some point of time. Unlike other programming model like MPI, where the Application developer has to explicitly specify the bytes that are to be streamed to different nodes, in Hadoop this is done implicitly. Each piece of data is tagged with some keyname which the Hadoop uses to send related bits of information to the destination node. Hadoop internally manages the data transfer and all the cluster topology issues. This advantage of Hadoop to limit the communication between the nodes makes the system more reliable. The individual node failures can be handled by restarting tasks on some other node. Since user-level tasks do not communicate with each other, no messages are exchanged between the user programs. Even if there is a failure at one node, the other nodes work as if nothing went wrong and this failure is taken care of by the underlying Hadoop layer. HADOOP ARCHITECTURE HDFS has a master/slave architecture. A HDFS cluster has a master called NameNode that takes care of filespace naming and regulates access to data files. Each node in the cluster contains at least one DataNode that manages the storage of data. A file is split into two or more parts internally and these parts are stored in sets of DataNodes. The NameNode performs the filesystem namespace operations like opening,closing and reading of files It also maps the data blocks to the DataNodes. The following figure explains the HDFS architecture [6]: Figure 3: HDFS architecture with the NameNode and DataNode The client making use of HDFS will input the file which is split by the NameNode and it tags the blockid associated with the block containing data and gives both blockid and location of the DataNodes to which the blocks of data are mapped. The client accesses data from the DataNodes containing the blocks of data. HDFS supports a traditional hierarchical file organization. The HDFS filesystem namespace is similar to other existing file systems like one can create, edit or remove files to the directories. Yet HDFS does not implement access permissions and does not support hard links or soft links. NameNode maintains the file system namespace. NameNode records changes to the filesystem namespace or its properties. The NameNode also stores the number of replicas of a file called the replication factor of that file. An application can specify the number of replicas, that is, the replication factor for that file. FAULT TOLERANCE HDFS is designed in such a way that it can distribute very large files reliably. Each file is stored in the form of sequence of blocks with

4 each block having the same size except for the last block. These blocks of file are replicated for fault tolerance. The block size and replication factor are configurable per file. As mentioned earlier, an application can specify number of replicas for each file. The replication factor can be specified at file creation time and can be changed at any point of time. NameNode is the one that makes all the decisions regarding the replication of blocks. It receives a report periodically from the DataNodes regarding the status of the block. It receives a HeartBeat from the DataNode suggesting that it is functioning properly and a BlockReport that contains list of blocks for that DataNode. The NameNode makes decisions regarding replication based on the reports obtained from the DataNodes. The placement of replicas is very crucial with regard to the performance. What distinguishes HDFS from other filesystems is the optimized placement of replicas. The following figure illustrates data replication [6]: PERFORMANCE One of the major benefits of Hadoop in terms of performance when compared to other distributed systems is the flat scalability curve. Hadoop does not perform very well with small number of nodes because of the high overhead in starting Hadoop programs when compared to other distributed systems. Distributed systems like MPI perform well two, four or even dozen machines. Though such systems perform well with small number of systems, the price paid in performance and engineering effort, that is, with increase in number of systems, increases nonlinearly. Programs written in other distributed frameworks require lots of refactoring to scale up from ten to one hundred or thousands of machines. This could involve rewriting the programs several times and they even have to put a cap on the scale to which an application can grow. Hadoop is designed in such a way that it provides a flat scalability curve. Very little work is required with respect to the program to actually scale up to the commodity hardware, that is, orders of magnitude of growth can be handled by Hadoop with very liitle re-work on the application program. The underlying Hadoop platform will manage the data and hardware resources and provide dependable performance growth proportionate to the number of machines available. The following graph illustrates the flat scalability curve achieved by Hadoop. Figure 4: Data Replication The NameNode determines the rack id to which each DataNode belongs to. The replicas are placed in such a way that even if a rack fails there is no loss of data. This policy makes it easy to evenly distribute data which makes it easy for load balancing.

5 There are two types of file systems handling large files for clusters, namely, parallel file systems and Internet service file systems [3]. Hadoop distribution file system (HDFS) [2] is a popular Internet service file system that provides the right abstraction for data processing in Mapreduce frameworks. CONCLUSION In this paper, an exhaustive survey has been made on Hadoop with regard to its power of performance, scalability and its advantages over other distributed systems. SUMMARY Figure 5: Flat scalability curve achieved by Hadoop RELATED WORK Some research has been directed to implementation and evaluation of performance in Hadoop [4][12][7]. Ranger implemented MapReduce for shared memory systems. Phoenix provide with a scalable performance with both multi-core and conventional symmetric multi-processors. Bingsheng et al. developed Mars which is a Mapreduce framework for graphics multi-processors [4]. The goal of Mars was to hide the programming complexity of GPU by providing simple MapReduce interface. Zaharia et al. implemented a new scheduler - LATE in Hadoop to improve MapReduce performance by speculatively executing tasks that hurt response time the most [11].Asymmetric multi-core processors (AMPs) address the I/O bottleneck issue, using doublebuffering and asynchronous I/O to support MapReduce functions in clusters with asymmetric components [10]. Chao et al. classified MapReduce workloads into three categories based on CPU and I/O utilization [13].They designed the Triple-Queue Scheduler in light of the dynamic MapReduce workload prediction mechanism called MR-Predict. Although the above techniques can improve MapReduce performance of heterogeneous clusters, they do not take into account data locality and data movement overhead. Features MapReduce HDFS Communication Fault Tolerance Flat Scalability REFERENCES Description It is a programming model that the programs written to distribute large amount of data conform to in a Hadoop framework A distributed file system that Hadoop utilizes for handling filespace naming and for handling the files like reading, writing and deleting. One of the most important features of Hadoop is that it limits the amount of communication involved by moving computation to the node where data exist. Hadoop achieves fault tolerance by means of data replication. An application itself can specify the number of replicas called the replication factor for the file. One of the most important benefits of Hadoop is the Flat scalability curve. With increase in scale of machines Hadoop is able to achieve a flat scalability curve. [1] http://lucene.apache.org/hadoop.

6 [2] Parallel virtual file system, version 2. http://www.pvfs2.org. [3] A scalable, high performance file system. http://lustre.org. [4] B.He, W.Fang, Q.Luo, N.Govindaraju, and T.Wang. Mars: a MapReduce framework on graphics processors. ACM, 2008. [5] C.Olston, B.Reed, U.Srivastava, R.Kumar, and A.Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD 08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, 2008. [13] T.Chao, H.Zhou, Y.He, and L.Zha. A Dynamic MapReduce Scheduler for Heterogeneous Workloads. IEEE Computer Society, 2009. [14] W.Tantisiriroj, S.Patil, and G.Gibson. Data-intensive file systems for internet services: A rose by any other name Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-114, October 2008. [6] D.Borthakur. The Hadoop Distributed File System: Architecture and Design. The Apache Software Foundation, 2007. [7] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI 04, pages 137 150, 2008. [9] H.Yang, A.Dasdan, R.Hsiao, and D.S.Parker. Map-reducemerge: simplified relational data processing on large clusters. In SIGMOD 07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, ACM, 2007. [10] M.Rafique, B.Rose, A.Butt, and D.Nikolopoulos. Supporting mapreduce on large-scale asymmetric multi-core clusters. SIGOPS Oper. Syst. Rev., 43(2):25 34, 2009. [11] M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI 08: 8th USENIX Symposium on Operating Systems Design and Implementation, October 2008. [12] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. High-Performance Computer Architecture, International Symposium on, 0:13 24, 2007.