Hadoop Scheduler w i t h Deadline Constraint



Similar documents
Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Architecture. Part 1

Apache Hadoop. Alexandru Costan

Survey on Scheduling Algorithm in MapReduce Framework

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

MapReduce Job Processing

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Performance and Energy Efficiency of. Hadoop deployment models

Telecom Data processing and analysis based on Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data Analysis and Its Scheduling Policy Hadoop

Fault Tolerance in Hadoop for Work Migration

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Task Scheduling in Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics


Hadoop implementation of MapReduce computational model. Ján Vaňo

Evaluating HDFS I/O Performance on Virtualized Systems

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

GraySort and MinuteSort at Yahoo on Hadoop 0.23

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

Open source Google-style large scale data analysis with Hadoop

International Journal of Advance Research in Computer Science and Management Studies

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Apache Hadoop new way for the company to store and analyze big data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

NoSQL and Hadoop Technologies On Oracle Cloud

Introduction to MapReduce and Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Processing of Hadoop using Highly Available NameNode

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Mobile Cloud Computing for Data-Intensive Applications

The Recovery System for Hadoop Cluster

Analysis of Information Management and Scheduling Technology in Hadoop

Distributed Framework for Data Mining As a Service on Private Cloud

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Hadoop Big Data for Processing Data and Performing Workload

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)


6. How MapReduce Works. Jari-Pekka Voutilainen

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Parallel Processing of cluster by Map Reduce

CSE-E5430 Scalable Cloud Computing Lecture 2

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Reduction of Data at Namenode in HDFS using harballing Technique

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Large scale processing using Hadoop. Ján Vaňo

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Benchmarking Hadoop & HBase on Violin

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Hadoop

Mobile Storage and Search Engine of Information Oriented to Food Cloud

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Data-Intensive Computing with Map-Reduce and Hadoop

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

HadoopRDF : A Scalable RDF Data Analysis System

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Big Data With Hadoop

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Cloud Federation to Elastically Increase MapReduce Processing Resources

Distributed File Systems

A very short Intro to Hadoop

Integration Of Virtualization With Hadoop Tools

Adaptive Task Scheduling for Multi Job MapReduce

Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop

The Improved Job Scheduling Algorithm of Hadoop Platform

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Energy Efficient MapReduce

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Matchmaking: A New MapReduce Scheduling Technique

Extending Hadoop beyond MapReduce

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Database Hadoop Hybrid Approach of Big Data

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

ISSN: Keywords: HDFS, Replication, Map-Reduce I Introduction:

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Storage and Retrieval of Data for Smart City using Hadoop

Advances in Natural and Applied Sciences

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Transcription:

Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, Karnataka 560054. 2 Department of Computer Science, Government College (UG & PG) Anantapur, A.P., India. 3 Department of Computer Science and Engineering, JNTU College of Engineering, Pulivendula, kadapa dist AP. Abstract A popular programming model for running data intensive applications on the cloud is map reduce. In the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce applications which require strict deadline. In Hadoop framework, scheduler w i t h deadline con stra in ts has not been implemented. Existing schedulers d o not guarantee that the job will be completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on improving s y s t e m utilization. We have proposed an algorithm which facilitates the user to specify a jobs deadline and evaluates whether the job can be finished before the deadline. Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or virtual nodes can be added dynamically to complete the job within deadline[8]. Keywords Hadoop, Mapreduce, Hadoop Schedulers, HDFS, Namenode, Datanode, Jobtracker, Tasktracker, Deadline 1. INTRODUCTION Apache Hadoop is an open source implementation of the Google s MapReduce [1] parallel processing framework. The details of parallel processing, including distribution of data to processing nodes, restart failed subtasks, and consolidation of results after computation is hidden by Hadoop framework. This framework allows developers to write parallel processing programs which focus on computation. Hadoop includes: 1) Hadoop Distributed File System (HDFS) [2], a distributed file system that store large amount of data with high throughput access to data on clusters. 2) Hadoop MapReduce: a software framework for processing of distributed data on clusters. These problems are well-dealt by Hadoop; a reliable, scalable, distributed computing platform developed by Apache [5].It is an open source implementation of Google s MapReduce framework that allows for the distributed processing of huge data sets transversely clusters of computers using simple programming models. It is designed to level up from single datacenter to thousands of machines, each offering local computation and storage. At the application layer the library is DOI : 10.5121/ijccsa.2014.4501 1

designed to detect and handle failures, Instead of relying on hardware to deliver high-availability, so a highly-available service on top of a cluster of computers can be delivered each of which may be prone to failures. Hadoop has an underlying storage system called HDFS-Hadoop Distributed file system. To process the data in HDFS, Hadoop provides a MapReduce engine that runs on top of HDFS. This engine has master-slave architecture. The master node is called the JobTracker and the slaves are called TaskTrackers. MapReduce jobs are automatically parallelized across a large set of TaskTrackers. The JobTracker splits the job into several maps and reduce tasks. Hadoop divides the input to the job into fixed size pieces called input splits. The outputs of the map tasks are stored in the local disk. This intermediate output serves as input to the reduce tasks. The consolidated output of the reduce task is the output of the MapReduce job and is stored in HDFS. As Hadoop jobs have to share the cluster resources, a scheduling strategy is used to determine when a job can execute its tasks. As the pluggable scheduler was implemented, several scheduler algorithms have been developed for it like FIFO Scheduler, Fair Scheduler [3], and Capacity Scheduler [4]. 2. FOUNDATION A. Problem Definition Can a given task that translates to a MapReduce job J and has to process data of size N be completed within a deadline D, when run in a MapReduce cluster having M nodes with map task slots, reduce task slots and possibly k jobs executing at the time[8]. B. Deadline Estimation Model We develop an initial estimation model based a set of assumptions The cluster consists of heterenegous nodes, so that the processing of unit cost for each map or reduce node is equal; Key distribution of the input data is uniform, so that each reduce node gets equal amount of reduce data to process; Reduce tasks starts after all map tasks have completed; The input data is already available in HDFS. To derive the expressions for the minimum number of map tasks and reduce tasks, we extend the model used in for Equal Load Partitioning technique[8]. To estimate the duration of the job we consider map completion time, reduce completion time and data transfer during reduce copy phase. Hadoop supports pluggable schedulers and we have implemented Constraint Scheduler using the minimum task scheduling criteria. It is developed as a contrib module using Hadoop version 1.1.2 source code. Hadoop config file needs to be modified to use the scheduler. We have also implemented a web based interface that allows the user to specify the deadline for a given job. C. Working The job submission process implemented by Job Submitter does the following: User submits the job. (Step 1).Scheduler will compute the minimum number of map and reduce slots required for the job completion (step 2). If schedulability test fails user will be notified to enter another deadline value. If passed, job will be scheduled.(step 4). 2

Fig. 1 Working of Deadline Scheduler 3. DESIGN AND IMPLEMENTATION A. Design Goals The design goals for Constraint Scheduler were: 1) To be able to give users immediate feedback on whether the job can be completed within the given deadline or not and proceed with execution if deadline can be met. Otherwise, users have the option to resubmit with modified deadline requirements. 2) Maximize the number of jobs that can be run in the cluster while satisfying the time requirements of all jobs. B. Sequence Diagram Fig. 2. Working of Deadline Scheduler 3

C. Deadline Estimation Model. Constraint Fair Scheduler ScheduleJob(Cluster c) freemapslots <-c.freemapslots freereduceslots <-c.freereduceslots REPEAT j <-nextjob(pool) arrivaltime <-getstarttime(j) mapcostperunit <-getmapcostperunit() reducecostperunit <-getreducecostperunit() shufflecostperunit <-getshufflecostperunit() mapinputoutputfraction <-getmapinputoutputcostperunit() reducestarttime <- getreducestarttime() deadline <-getfloat() inputsize <-0 REPEAT inputsize<-inputsize+length(inputsplits) UNTIL(endOf(inputspilts)) Compute MaxReduceStartTime IF MaxReduceStartTime < arrivaltime THROW EXCEPTION Compute minmapslots Compute minreduceslots IF m i n M a p S l o t s > freemapslots O R m i n R e d u c e S l o t s > freereduceslots THROW ConstraintFailException UNTIL (EndOf (jobs in Pool)) ConstraintFairSchedulerParamsEstimator (job word count) Run the job randomgenerator Output <- randomoutput Repeat n times Run the job wordcount taking random generator output as input for this job Compute the parameters total map cost, total reduce cost, total io fraction, total time to start reduce, total time for job completion Compute the average of each parameters. A Set-up Experiments were conducted in physical cluster. It consisted of 4 physical cluster with one jobtracker and 3 tasktrackers. The machine had 4 GB memory with 64 bit Intel i5 processor and Ubuntu server present in each node. We also had virtual node in the cluster which can be added to the existing cluster dynamically installed in oracle virtualbox on one of the nodes. Both the 4

environments had two map slots and two reduce slots on each machine. B. R e s u l t After the set up, we estimated the values of the parameters like map cost, reduce cost, shuffle cost, filter ratio, we submitted the job with deadline and took observation for different deadline values keeping input size constant, which shows the variation of required map and reduce slots as shown in Fig. 3. We also submitted the job with different input sizes and took observations. Our observation shows that for a particular data size with increasing deadlines, resource demand will decrease. Estimated parameter values for Wordcount 1) Deadline = 45 secs 2) Mapcost = 0.00018423391 3) Reduce cost = 0.000019585172 4)Shuffle cost = 0.00000186264 5)IO Fraction = 1.3711414 The fig 3 shows observation for input data size of 500 MB taking deadline values on x-axis and minimum number of map reduce slots required on y-axis. In the setup we had 6 free map and 6 free reduce slots available. So, it is observed from figure that when deadline value of 20 seconds or more is given, minimum requirement of map reduce slots is met and job is scheduled. Fig. 3 Graph for 500 MB input data size Fig. 4 Graph for 1.2GB input data size 5

The fig 4 shows observation for input data size of 1.2 GB taking deadline values on x-axis and minimum number of map reduce slots required on y-axis. In the setup we had 6 free map and 6 free reduce slots available. So, it is observed from figure that when deadline value of 48 seconds or more is given, minimum requirement of map reduce slots is met and job is scheduled. The fig 5 shows observation for input data size of 5 GB taking deadline values on x-axis and minimum number of map reduce slots required on y-axis. In the setup we had 6 free map and 6 free reduce slots available. So, it is observed from figure that when deadline value of 190 seconds or more is given, minimum requirement of map reduce slots is met and job is scheduled. 4. CONCLUSION Fig. 5 Graph for 5GB input data size We extended the approach of real time cluster scheduling to derive minimum map and reduce task count criteria for performing task scheduling with deadline constraints in Hadoop. We computed the amount of resource required to complete a job for a particular deadline. For this, we proposed the idea of estimation of the values of parameters: filter ratio, cost of processing a unit data in map task, cost of processing a unit data in reduce task, communication cost of transferring unit data. Our observation shows that for a particular data size with increasing deadlines, resource demand will decrease. Also, if the data size increases and deadline is kept constant, resource demand will increase. If resource demand increases, we can meet the demand by adding physical or virtual node to the existing cluster dynamically or provide a feasible deadline. We have implemented the same. REFERENCES [1] Ricky Ho, How MapReduce works Dzone publishing, retrieved from http://architects.dzone.com/articles/how-hadoop- mapreduce-works [2] Robert C,Hirang K and Sanjay R, Architecture of open source applica- tions HDFS. [3] Apache Hadoop publishing, Hadoop MapReduce NextGeneration-Fair Scheduler, December 2013, Version 2.0.4-alpha. [4] Arun M,Understanding Capacity Scheduler,July 2012,unpublished. [5] Shafer J,Rixner S, and Cox, Balancing portability and performance March 2010, Performance Analysis of Systems and Software (ISPASS), 2010 IEEE International Symposium on publishing. 6

[6] Deadline: www4.ncsu.edu/ kkc/papers/rev2.pdf. [7] Multi Node Cluster: http://www.michael-noll.com/tutorials/ running-hadoop- on-ubuntu- linuxmulti- node-cluster/. [8] Scheduling Hadoop Jobs to Meet Deadline, Kamal Kc, Kemafor Anyanwu, IEEE CloudCom 2010. 7