A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS



Similar documents
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters

Survey on Scheduling Algorithm in MapReduce Framework

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Scheduler w i t h Deadline Constraint

MAPREDUCE [1] is proposed by Google in 2004 and

Scheduling Algorithms in MapReduce Distributed Mind

Improving MapReduce Performance in Heterogeneous Environments

Big Data and Apache Hadoop s MapReduce

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Apache Hadoop. Alexandru Costan

Fault Tolerance in Hadoop for Work Migration

Task Scheduling in Hadoop

A Comprehensive View of Hadoop MapReduce Scheduling Algorithms

Keywords: Big Data, HDFS, Map Reduce, Hadoop

The Improved Job Scheduling Algorithm of Hadoop Platform

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Research on Job Scheduling Algorithm in Hadoop

Resource Scalability for Efficient Parallel Processing in Cloud

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Hadoop Architecture. Part 1

Hadoop Cluster Applications

Chapter 7. Using Hadoop Cluster and MapReduce

GraySort on Apache Spark by Databricks

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

6. How MapReduce Works. Jari-Pekka Voutilainen

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

How To Balance In Cloud Computing

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Energy Efficient MapReduce

Survey on Load Rebalancing for Distributed File System in Cloud


A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Cloud Computing based on the Hadoop Platform

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Big Data With Hadoop

Introduction to Cloud Computing

Introduction to Apache YARN Schedulers & Queues

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

CSE-E5430 Scalable Cloud Computing Lecture 2

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

SCHEDULING IN CLOUD COMPUTING

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Task Scheduling Algorithm for Map Reduce To Control Load Balancing In Big Data

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Apache Hama Design Document v0.6

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

Big Application Execution on Cloud using Hadoop Distributed File System

Large scale processing using Hadoop. Ján Vaňo

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

NoSQL and Hadoop Technologies On Oracle Cloud

The Hadoop Framework

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

Log Mining Based on Hadoop s Map and Reduce Technique

Matchmaking: A New MapReduce Scheduling Technique

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters

MapReduce (in the cloud)

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

CDBMS Physical Layer issue: Load Balancing

BSPCloud: A Hybrid Programming Library for Cloud Computing *

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Analysis and Its Scheduling Policy Hadoop

Telecom Data processing and analysis based on Hadoop

MapReduce and Hadoop Distributed File System

Virtual Machine Based Resource Allocation For Cloud Computing Environment

An Approach to Load Balancing In Cloud Computing

Efficient Data Replication Scheme based on Hadoop Distributed File System

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Performance and Energy Efficiency of. Hadoop deployment models


Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Energy Constrained Resource Scheduling for Cloud Environment

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Transcription:

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS Suma R 1, Vinay T R 2, Byre Gowda B K 3 1 Post graduate Student, CSE, SVCE, Bangalore 2 Assistant Professor, CSE, SVCE, Bangalore 3 Assistant professor, CSE, SIR.MVIT, Bangalore ABSTRACT For large data processing in the cloud map reduce is a process we can split the data into multiple parts or make it into the slot and then process and mapping process will happen. The slot based map reduce is not too effective it gives the poor performance because of the unoptimized resource allocation and they have the various challenges. The map reduce job task execution have the two unique feature. The map slot allocation only allocate the map task and reduce task only be allocated to reduce task and the map task process before the reduce task. The data locality maximization for the efficiency and utilization is required to improve the quality of the system proposed the various challenge to address this problem. The DynamicMR is a dynamic slot allocation framework to improve the performance of map reduce[1]. The DynamicMR focuses on Hadoop fair scheduler (HFS). The Dynamic scheduler consist of three optimization techniques Dynamic Hadoop slot Allocation (DHSA), Speculative Execution performance Balancing(SEPB) and Slot Prescheduling. 1. INTRODUCTION Big data is a collection of both structured and unstructured data that is too large, fast and distinct to be managed by traditional database management tools or traditional data processing applications. Hadoop is an open-source software framework from Apache that supports scalable distributed applications. Hadoop supports running applications on large clusters of commodity hardware and provide fast and reliable analysis of both structured and unstructured data. Hadoop uses simple programming model. Hadoop can scale from single servers to thousands of machines, each offering local computation and storage. Despite many studies in optimizing MapReduce/Hadoop, there are many key challenges for the utilization and performance improvement of a Hadoop cluster. Firstly, the resources (e.g., CPU cores) are abstracted as map and reduce slots. A MapReduce job execution has two unique features: 1. The slot allocation constraint assumption that map slots are allocated to map tasks and reduce slots are allocated to reduce tasks, and 2. The map task are executed first and then the reduce task. We have 2 observation here 1. For different slot configuration there are different system utilization and performance for a mapreduce workload. 2. Idle reduce slots which affects the performance and system utilization. Secondly due to straggler problem of map or reduce task,delay of the whole job occurs. Thirdly,by maximizing the data locality performance and slot utilization efficiency improvement occurs in mapreduce workload. DynamicMR have 3 techniques here,dynamic Hadoop Slot Allocation(DHSA),Speculative Execution Performance Balancing(SEPB),Slot Prescheduling(SP). DHSA 1. Slots can be used for map task or reduce task,in map slots if there are insufficient slots for map task it can borrow unused slots from reduce slots. Similarly reduce slots can also borrow slots from map slots if the reduce task is greater than reduce slots. 2. Map slots are used for map task and reduce slots are used for reduce task.

SEPB It is used to find out the slow running task.so we propose a technique called speculative execution performance balancing for this task which is speculative.by this it can balance performance tradeoff b/w a single job and a batch of job execution time. Slot Prescheduling Approach for improving data locality in mapreduce. So, it is at the cost of fairness. DynamicMR improve performance and utilization of mapreduce workload with 46% -115% for single job and 49% - 112% for multiple job. 1. Slot Utilization Optimization 2. Utilization Efficiency Optimization PI-DHSA PD-DHSA Reduce Task Map Task Idle Slot Dynamic Hadoop Slot Allocation (DHSA) Speculative Execution Performance Balancing (SEPB) Slot PreScheduling Fig- 1: O verview of DynamicMR Framework Popularity of mapreduce in industry, bioinformatics, machine learning. Implementation of mapreduce is hadoop[6]. Multiple task run in mapreduce in each node. Each node host a configure number of map and reduce slots. When task is assigned to slots it get occupied,and when task completes slot gets released. Resource underutilization overcome by using resource stealing. Speculative execution in mapreduce to support fault tolerance. Progress of all scheduled task maintained by master node. When master node finds a slow running task,a speculative task is launched to process the task fast. To process huge data framework and powerful hardware is required. Google proposed mapreduce for parallel data. Dynamic and aggressive approach is mapreduce. Sometimes fairness and data locality conflict eachother,when fairness is strict data locality degradation occurs and purely data locality result in unfairness of resource usage. Mapreduce is a programming model for large scale data processing[2]. Mapreduce which process 20 petabytes of data per day. Open source implementation of mapreduce is hadoop.example of hadoop is facebook,google etc. In mapreduce it uses a distributed storage layer refered to as Hadoop distributed file system. A job is submitted by user comprising of map function and reduce function which are transformed into map and reduce task respectively. Data is split into equal size by HDFS and distributes data into cluster nodes, mapping is performed in HDFS. Intermediate output are partitioned into one or many reduce task. Locality- Aware Reduce task Scheduler (LARTS) partition of sizes to have data locality. 2. EXISTING SYSTEM Scheduling and resource allocation optimization There are scheduling and allocation of resource for mapreduce jobs. In mapreduce case for 1 job we have multiple task. In the same time all job arrive and minimize the job completion time is objective. To achieve this we develop a computation model to solve large scale data problem and undergo graph analysis. Mapreduce modeled into 2 stage hybrid flow shop. Job submission result in performance improvement of system and utilization. Map and reduce task execution time should be known before, which is not possible in real world application. DHSA can be used for any mapreduce workload. In optimal hadoop configuration eg:in Map/reduce slot configuration,it contain room for improving performance of mapreduce workload. Guo et al propose a method called resource stealing[3] to steal resources which are reserved for idle slots here adopting multi-threading technique for task which is running on multiple CPU cores. Polo et al propose a method called resource aware scheduling technique for map reduce workload, which improve resource utilization.in DHSA we can improve system utilization by allocating unused map and reduce slots. New version of hadoop is YARN.Inefficiency problem of hadoop is overcome by using YARN.Resources are managed here consisting of resources like memory,band width.however for multiple jobs DynamicMR is better than YARN bcz here is YARN there is no concept of slot.

Speculative Execution optimization: Use to deal with straggler problem using LATE.Longest Approximate Time to End is algorithm for Speculative Execution which focuses on heterogeneous environment and speculative task are capped. By Guo et al LATE performance is improved by proposing a Benefit Aware Speculative Execution(BASE). Benefit Speculative Task, so we propose SEPB to balance tadeoff b/w single job and batch of job. Data Locality Optimization For efficiency improvement and performance of the cluster utilization by data locality Optimization[4]. In mapreduce we have map side and reduce side. The data locality optimization for mapside is moving the maptask close to the input data blocks. Mapreduce jobs are classified into map-input heavy,map and reduce input heavy and reduce-input heavy. The reduce-side data locality place reduce task to the machines that generate intermediate data by maptask. Mapside data locality belong to slot prescheduling. Extra idle slots is used to maximize data locality and faireness. Delay scheduler and slot prescheduling is used to achieve faireness and data locality. 2 types of slot optimizers SEPB and Slot prescheduling for improvement of DHSA. Mapreduce optimization on cloud computing Fine grained optimization for hadoop is DynamicMR. By combine existing system and DynamicMR together develop framework and budget in cloud computing. 3. PROPOSED SYSTEM Mapreduce performance can be improved from 2 perspective. Firstly slots are classified into busy slot and idle slot. One approach here is to increasing slot utilization by maximizing busy slot and minimizing the idle slots. Second is every busy slot have not been efficiently utilized. Thus our approaches is to improve the utilization of busy slot. DHSA which is used to increase slot utilization and maintaining faireness [4]. SEPB improve slow running task. Slot prescheduling [10]improves performance by data locality and faireness. DynamicMR have the following step-by-step processes: 1 When there is a idle slot, DynamicMR will improve the slot utilization with DHSA. DynamicMR will decide whether to allocate it or not Eg:Faireness. 2 Allocation is true, DynamicMR will improve the efficiency of slot by SEPB. Speculative Execution will achieve performance tradeoff b/w a single job and batch of job. 3 For pending maptask allocate idle slots. DynamicMR will improve efficiency of slot utilization with slot prescheduling. 3.1 Dynamic Hadoop Slot Utilization: Mapreduce current design suffers from under utilization of slots bcz number of map and reduce task varies over time. Where the number of map/reduce task is greater than the map/reduce slots. Reduce task which is overloaded we can use unused map slots by that mapreduce performance is improved. All workload will lie in the map side. So we use idle reduce slots for map task. Map and reduce task can run on either map slots or reduce slots. 1 In HFS faireness is important: When all pools are allocated with equal amount of resources it is a fair. 2 Map slots and reduce slot resource requirement is different[9]. Memory and n/w bandwidth are resources of reduce task. DHSA contain 2 alternatives namely PD-DHSA and PI-DHSA. Pool-Independent DHSA:PI-DHSA process consist of 2 parts: Fig-2: Pool-Independent DHSA

1 Intra-phase dynamic slot allocation: Pool is divided into 2 sub pools i.e. map-phase pool and reduce-phase pool. The pool which is overloaded and have slot demand can borrow unused slots from other pool of same phase. Eg: Map phase pool 2 can borrow map slots from map phase pool 1 and pool 3. 2 Inter-phase dynamic slot allocation: When reduce phase contain unused reduce slot and we have insufficient map slots for map task, then it will borrow idle slots from reduce slots. Nm-total number of map task. Nr-total number of reduce task. Sm-total number of map slots. Sr-total number of reduce slots. Case 1: When Nm Sm and Nr Sr map slots run on reduce task and reduce slots run on reduce task i.e slots borrowing does not takes place. Case 2: When Nm > Sm and Nr < Sr reduce slots for reduce task and use idle reduce slots for running map task. Case 3: When Nm < Sm and Nr > Sr, for running reduce task we use unused mapslots. Case 4: When Nm > Sm and Nr > Sr system in busy state, map and reduce slots have no movement. We have 2 variables PercentageOfBorrowed MapSlots and PercentageOfBorrowed ReduceSlots. PD-DHSA: Fig-3: Pool Dependent DHSA 2 pools map-phase pool and reduce-phase pool is selfish. Until the map-phase and reduce-phase satisfy its own shared map and reduce slots before going to other pools. 2 processes: 1 Intra-pool dynamic slot allocation: In this pool we have 4 relationship Case a: Mapslot Demand < Mapshare and Reduceslot Demand > reduce share,borrow unused map slots from reduce phase pool 1 st for its overloaded reduce task. Case b: MapslotsDemand > Mapshare and ReduceSlotsDemand < reduce share, reduce phase contain unused slots to its map task. Case c: MapSlotsDemand Mapshare and reduceslotsdemand reduceshare,mapslots and reduce slots do not borrow any slots. It can give slots to other pools. Case d: MapSlotsDemand > mapshare and reduceslotsdemand > reduceshare.here mapslots and reduceslots are insufficient. Map slots and reduce slots borrow slots from other pools. 2 Inter-pool dynamic slot allocation: MapslotsDemand + ReduceslotsDemand Mapshare + reduceshare in this case no need of borrowing slots from other pools. MapSlotsDemand + ReduceSlotsDemand > mapshare + reduceshare in this case even after Intra-pool dynamic slot allocation slots are not enough. So it will borrow unused slots from other pools. Tasktracker have 4 possible slot allocation.

Fig-4: Slot Allocation For PD-DHSA Case 1: Tasktracker if have idle map slots it undergo map tasks allocation and it contain pending task for pool. Case 2: If case 1 fails then Tasktracker if have idle reduce slots it undergo reduce task allocation and it contain pending task for pool. Case 3: If case 1 and case 2 fails then in case 3 for map task we try reduce slots. Case 4: For reduce task we allocate map slots. 3.2 Speculative execution performance balance: Job execution time for mapreduce is very sensitive to slow running task. Stragglers due to faulty hardware and software misconfiguration. Stragglers are 2 types Hard straggler and soft straggler. Hard straggler :A task due to endless waiting for certain resources goes to deadlock status. we should kill the task, because it will not stop. Soft straggler :A task take much longer time than the common task, but the task get successfully complete. Back up task means killing task of Hard straggler and running other task. Straggler problem detected by Late algorithm. Speculation excecution will reduce a job excecution time. Fig-5: TotalnumofPending maptask and totalnumofpending reducetask. In SEPB 1 st the task which is failed given higher priority. 2 nd the task which are pending are considered. LATE which handle straggled task,it will call backup task and allocate a slot. Consider example with 6 jobs. Speculative cap for LATE is 4 and the maxnum of jobs checked for pending taskis 4. Idle slots are 4. SEPB will allocate all 4 idle slot to pending task bcz pending task for j1,j2,j3,j4,j5,j6 are 0,0,10,10,15,20 respectively. On top of LATE, SEPB works and SEPB is enhancement of LATE. 3.3 Slot perscheduling: Which improve data locality[5] and without having negative impact on the faireness of mapreduce jobs. Defn 1: The available idle map slots that can be allocated to the tasktracker. Defn 2: The extra idle map slots are subtracting used map slots and allow available idle map slots. Technique Faireness Slot Utilization Performance

DHSA + + + SEPB + + DS _ %(+) + SPS + %(+) + TABLE 1: +, _, % Denotes Benefit,Cost,efficiency respectively. 4. CONCLUSION Improving performance of Mapreduce workload by DynamicMR framework and maintaining faireness.three techniques here are DHSA, SEPB, Slot prescheduling all focus on utilization of slot for mapreduce cluster. Utilization of slot can be maximized by DHSA. Inefficiency of slot is identified by SEPB. Slot prescheduling improves slot utilization efficiency. Combining these techniques improve Hadoop System. REFERENCES [1] Q. Chen, C. Liu, Z. Xiao, Improving MapReduce Performance Using Smart Speculative Execution Strategy. IEEE Transactions on Computer, 2013. [2] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, In OSDI 04, pp. 107-113, 2004. [3] Z.H. Guo, G. Fox, M. Zhou, Y. Ruan.Improving Resource Utilization in MapReduce. In IEEE Cluster 12. pp. 402-410, 2012. [4] Z. H. Guo, G. Fox, and M. Zhou.Investigation of data locality and fairness in MapReduce. In MapReduce 12, pp, 25-32, 2012. [5] Z. H. Guo, G. Fox, and M. Zhou. Investigation of Data Locality in MapReduce. In IEEE/ACM CCGrid 12, pp, 419-426, 2012. [6] Hadoop. http://hadoop.apache.org. [7] M. Hammoud and M. F. Sakr. Locality-Aware Reduce Task Scheduling for MapReduce. In IEEE CLOUDCOM 11. pp. 570-576, 2011. [8] M. Hammoud, M. S. Rehman, M. F. Sakr. Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic. In IEEE CLOUD 12, pp. 49-58, 2012. [9] B. Palanisamy, A. Singh, L. Liu and B. Jain, Purlieus: Localityaware Resource Allocation for MapReduce in a Cloud, In SC 11, pp. 1-11, 2011. [10] J. Polo, C. Castillo, D. Carrera, et al. Resource-aware Adaptive Scheduling for MapReduce Clusters. In Middleware 11, pp. 187-207,