A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS Suma R 1, Vinay T R 2, Byre Gowda B K 3 1 Post graduate Student, CSE, SVCE, Bangalore 2 Assistant Professor, CSE, SVCE, Bangalore 3 Assistant professor, CSE, SIR.MVIT, Bangalore ABSTRACT For large data processing in the cloud map reduce is a process we can split the data into multiple parts or make it into the slot and then process and mapping process will happen. The slot based map reduce is not too effective it gives the poor performance because of the unoptimized resource allocation and they have the various challenges. The map reduce job task execution have the two unique feature. The map slot allocation only allocate the map task and reduce task only be allocated to reduce task and the map task process before the reduce task. The data locality maximization for the efficiency and utilization is required to improve the quality of the system proposed the various challenge to address this problem. The DynamicMR is a dynamic slot allocation framework to improve the performance of map reduce[1]. The DynamicMR focuses on Hadoop fair scheduler (HFS). The Dynamic scheduler consist of three optimization techniques Dynamic Hadoop slot Allocation (DHSA), Speculative Execution performance Balancing(SEPB) and Slot Prescheduling. 1. INTRODUCTION Big data is a collection of both structured and unstructured data that is too large, fast and distinct to be managed by traditional database management tools or traditional data processing applications. Hadoop is an open-source software framework from Apache that supports scalable distributed applications. Hadoop supports running applications on large clusters of commodity hardware and provide fast and reliable analysis of both structured and unstructured data. Hadoop uses simple programming model. Hadoop can scale from single servers to thousands of machines, each offering local computation and storage. Despite many studies in optimizing MapReduce/Hadoop, there are many key challenges for the utilization and performance improvement of a Hadoop cluster. Firstly, the resources (e.g., CPU cores) are abstracted as map and reduce slots. A MapReduce job execution has two unique features: 1. The slot allocation constraint assumption that map slots are allocated to map tasks and reduce slots are allocated to reduce tasks, and 2. The map task are executed first and then the reduce task. We have 2 observation here 1. For different slot configuration there are different system utilization and performance for a mapreduce workload. 2. Idle reduce slots which affects the performance and system utilization. Secondly due to straggler problem of map or reduce task,delay of the whole job occurs. Thirdly,by maximizing the data locality performance and slot utilization efficiency improvement occurs in mapreduce workload. DynamicMR have 3 techniques here,dynamic Hadoop Slot Allocation(DHSA),Speculative Execution Performance Balancing(SEPB),Slot Prescheduling(SP). DHSA 1. Slots can be used for map task or reduce task,in map slots if there are insufficient slots for map task it can borrow unused slots from reduce slots. Similarly reduce slots can also borrow slots from map slots if the reduce task is greater than reduce slots. 2. Map slots are used for map task and reduce slots are used for reduce task.

SEPB It is used to find out the slow running task.so we propose a technique called speculative execution performance balancing for this task which is speculative.by this it can balance performance tradeoff b/w a single job and a batch of job execution time. Slot Prescheduling Approach for improving data locality in mapreduce. So, it is at the cost of fairness. DynamicMR improve performance and utilization of mapreduce workload with 46% -115% for single job and 49% - 112% for multiple job. 1. Slot Utilization Optimization 2. Utilization Efficiency Optimization PI-DHSA PD-DHSA Reduce Task Map Task Idle Slot Dynamic Hadoop Slot Allocation (DHSA) Speculative Execution Performance Balancing (SEPB) Slot PreScheduling Fig- 1: O verview of DynamicMR Framework Popularity of mapreduce in industry, bioinformatics, machine learning. Implementation of mapreduce is hadoop[6]. Multiple task run in mapreduce in each node. Each node host a configure number of map and reduce slots. When task is assigned to slots it get occupied,and when task completes slot gets released. Resource underutilization overcome by using resource stealing. Speculative execution in mapreduce to support fault tolerance. Progress of all scheduled task maintained by master node. When master node finds a slow running task,a speculative task is launched to process the task fast. To process huge data framework and powerful hardware is required. Google proposed mapreduce for parallel data. Dynamic and aggressive approach is mapreduce. Sometimes fairness and data locality conflict eachother,when fairness is strict data locality degradation occurs and purely data locality result in unfairness of resource usage. Mapreduce is a programming model for large scale data processing[2]. Mapreduce which process 20 petabytes of data per day. Open source implementation of mapreduce is hadoop.example of hadoop is facebook,google etc. In mapreduce it uses a distributed storage layer refered to as Hadoop distributed file system. A job is submitted by user comprising of map function and reduce function which are transformed into map and reduce task respectively. Data is split into equal size by HDFS and distributes data into cluster nodes, mapping is performed in HDFS. Intermediate output are partitioned into one or many reduce task. Locality- Aware Reduce task Scheduler (LARTS) partition of sizes to have data locality. 2. EXISTING SYSTEM Scheduling and resource allocation optimization There are scheduling and allocation of resource for mapreduce jobs. In mapreduce case for 1 job we have multiple task. In the same time all job arrive and minimize the job completion time is objective. To achieve this we develop a computation model to solve large scale data problem and undergo graph analysis. Mapreduce modeled into 2 stage hybrid flow shop. Job submission result in performance improvement of system and utilization. Map and reduce task execution time should be known before, which is not possible in real world application. DHSA can be used for any mapreduce workload. In optimal hadoop configuration eg:in Map/reduce slot configuration,it contain room for improving performance of mapreduce workload. Guo et al propose a method called resource stealing[3] to steal resources which are reserved for idle slots here adopting multi-threading technique for task which is running on multiple CPU cores. Polo et al propose a method called resource aware scheduling technique for map reduce workload, which improve resource utilization.in DHSA we can improve system utilization by allocating unused map and reduce slots. New version of hadoop is YARN.Inefficiency problem of hadoop is overcome by using YARN.Resources are managed here consisting of resources like memory,band width.however for multiple jobs DynamicMR is better than YARN bcz here is YARN there is no concept of slot.

Speculative Execution optimization: Use to deal with straggler problem using LATE.Longest Approximate Time to End is algorithm for Speculative Execution which focuses on heterogeneous environment and speculative task are capped. By Guo et al LATE performance is improved by proposing a Benefit Aware Speculative Execution(BASE). Benefit Speculative Task, so we propose SEPB to balance tadeoff b/w single job and batch of job. Data Locality Optimization For efficiency improvement and performance of the cluster utilization by data locality Optimization[4]. In mapreduce we have map side and reduce side. The data locality optimization for mapside is moving the maptask close to the input data blocks. Mapreduce jobs are classified into map-input heavy,map and reduce input heavy and reduce-input heavy. The reduce-side data locality place reduce task to the machines that generate intermediate data by maptask. Mapside data locality belong to slot prescheduling. Extra idle slots is used to maximize data locality and faireness. Delay scheduler and slot prescheduling is used to achieve faireness and data locality. 2 types of slot optimizers SEPB and Slot prescheduling for improvement of DHSA. Mapreduce optimization on cloud computing Fine grained optimization for hadoop is DynamicMR. By combine existing system and DynamicMR together develop framework and budget in cloud computing. 3. PROPOSED SYSTEM Mapreduce performance can be improved from 2 perspective. Firstly slots are classified into busy slot and idle slot. One approach here is to increasing slot utilization by maximizing busy slot and minimizing the idle slots. Second is every busy slot have not been efficiently utilized. Thus our approaches is to improve the utilization of busy slot. DHSA which is used to increase slot utilization and maintaining faireness [4]. SEPB improve slow running task. Slot prescheduling [10]improves performance by data locality and faireness. DynamicMR have the following step-by-step processes: 1 When there is a idle slot, DynamicMR will improve the slot utilization with DHSA. DynamicMR will decide whether to allocate it or not Eg:Faireness. 2 Allocation is true, DynamicMR will improve the efficiency of slot by SEPB. Speculative Execution will achieve performance tradeoff b/w a single job and batch of job. 3 For pending maptask allocate idle slots. DynamicMR will improve efficiency of slot utilization with slot prescheduling. 3.1 Dynamic Hadoop Slot Utilization: Mapreduce current design suffers from under utilization of slots bcz number of map and reduce task varies over time. Where the number of map/reduce task is greater than the map/reduce slots. Reduce task which is overloaded we can use unused map slots by that mapreduce performance is improved. All workload will lie in the map side. So we use idle reduce slots for map task. Map and reduce task can run on either map slots or reduce slots. 1 In HFS faireness is important: When all pools are allocated with equal amount of resources it is a fair. 2 Map slots and reduce slot resource requirement is different[9]. Memory and n/w bandwidth are resources of reduce task. DHSA contain 2 alternatives namely PD-DHSA and PI-DHSA. Pool-Independent DHSA:PI-DHSA process consist of 2 parts: Fig-2: Pool-Independent DHSA

1 Intra-phase dynamic slot allocation: Pool is divided into 2 sub pools i.e. map-phase pool and reduce-phase pool. The pool which is overloaded and have slot demand can borrow unused slots from other pool of same phase. Eg: Map phase pool 2 can borrow map slots from map phase pool 1 and pool 3. 2 Inter-phase dynamic slot allocation: When reduce phase contain unused reduce slot and we have insufficient map slots for map task, then it will borrow idle slots from reduce slots. Nm-total number of map task. Nr-total number of reduce task. Sm-total number of map slots. Sr-total number of reduce slots. Case 1: When Nm Sm and Nr Sr map slots run on reduce task and reduce slots run on reduce task i.e slots borrowing does not takes place. Case 2: When Nm > Sm and Nr < Sr reduce slots for reduce task and use idle reduce slots for running map task. Case 3: When Nm < Sm and Nr > Sr, for running reduce task we use unused mapslots. Case 4: When Nm > Sm and Nr > Sr system in busy state, map and reduce slots have no movement. We have 2 variables PercentageOfBorrowed MapSlots and PercentageOfBorrowed ReduceSlots. PD-DHSA: Fig-3: Pool Dependent DHSA 2 pools map-phase pool and reduce-phase pool is selfish. Until the map-phase and reduce-phase satisfy its own shared map and reduce slots before going to other pools. 2 processes: 1 Intra-pool dynamic slot allocation: In this pool we have 4 relationship Case a: Mapslot Demand < Mapshare and Reduceslot Demand > reduce share,borrow unused map slots from reduce phase pool 1 st for its overloaded reduce task. Case b: MapslotsDemand > Mapshare and ReduceSlotsDemand < reduce share, reduce phase contain unused slots to its map task. Case c: MapSlotsDemand Mapshare and reduceslotsdemand reduceshare,mapslots and reduce slots do not borrow any slots. It can give slots to other pools. Case d: MapSlotsDemand > mapshare and reduceslotsdemand > reduceshare.here mapslots and reduceslots are insufficient. Map slots and reduce slots borrow slots from other pools. 2 Inter-pool dynamic slot allocation: MapslotsDemand + ReduceslotsDemand Mapshare + reduceshare in this case no need of borrowing slots from other pools. MapSlotsDemand + ReduceSlotsDemand > mapshare + reduceshare in this case even after Intra-pool dynamic slot allocation slots are not enough. So it will borrow unused slots from other pools. Tasktracker have 4 possible slot allocation.

Fig-4: Slot Allocation For PD-DHSA Case 1: Tasktracker if have idle map slots it undergo map tasks allocation and it contain pending task for pool. Case 2: If case 1 fails then Tasktracker if have idle reduce slots it undergo reduce task allocation and it contain pending task for pool. Case 3: If case 1 and case 2 fails then in case 3 for map task we try reduce slots. Case 4: For reduce task we allocate map slots. 3.2 Speculative execution performance balance: Job execution time for mapreduce is very sensitive to slow running task. Stragglers due to faulty hardware and software misconfiguration. Stragglers are 2 types Hard straggler and soft straggler. Hard straggler :A task due to endless waiting for certain resources goes to deadlock status. we should kill the task, because it will not stop. Soft straggler :A task take much longer time than the common task, but the task get successfully complete. Back up task means killing task of Hard straggler and running other task. Straggler problem detected by Late algorithm. Speculation excecution will reduce a job excecution time. Fig-5: TotalnumofPending maptask and totalnumofpending reducetask. In SEPB 1 st the task which is failed given higher priority. 2 nd the task which are pending are considered. LATE which handle straggled task,it will call backup task and allocate a slot. Consider example with 6 jobs. Speculative cap for LATE is 4 and the maxnum of jobs checked for pending taskis 4. Idle slots are 4. SEPB will allocate all 4 idle slot to pending task bcz pending task for j1,j2,j3,j4,j5,j6 are 0,0,10,10,15,20 respectively. On top of LATE, SEPB works and SEPB is enhancement of LATE. 3.3 Slot perscheduling: Which improve data locality[5] and without having negative impact on the faireness of mapreduce jobs. Defn 1: The available idle map slots that can be allocated to the tasktracker. Defn 2: The extra idle map slots are subtracting used map slots and allow available idle map slots. Technique Faireness Slot Utilization Performance

DHSA + + + SEPB + + DS _ %(+) + SPS + %(+) + TABLE 1: +, _, % Denotes Benefit,Cost,efficiency respectively. 4. CONCLUSION Improving performance of Mapreduce workload by DynamicMR framework and maintaining faireness.three techniques here are DHSA, SEPB, Slot prescheduling all focus on utilization of slot for mapreduce cluster. Utilization of slot can be maximized by DHSA. Inefficiency of slot is identified by SEPB. Slot prescheduling improves slot utilization efficiency. Combining these techniques improve Hadoop System. REFERENCES [1] Q. Chen, C. Liu, Z. Xiao, Improving MapReduce Performance Using Smart Speculative Execution Strategy. IEEE Transactions on Computer, 2013. [2] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, In OSDI 04, pp. 107-113, 2004. [3] Z.H. Guo, G. Fox, M. Zhou, Y. Ruan.Improving Resource Utilization in MapReduce. In IEEE Cluster 12. pp. 402-410, 2012. [4] Z. H. Guo, G. Fox, and M. Zhou.Investigation of data locality and fairness in MapReduce. In MapReduce 12, pp, 25-32, 2012. [5] Z. H. Guo, G. Fox, and M. Zhou. Investigation of Data Locality in MapReduce. In IEEE/ACM CCGrid 12, pp, 419-426, 2012. [6] Hadoop. http://hadoop.apache.org. [7] M. Hammoud and M. F. Sakr. Locality-Aware Reduce Task Scheduling for MapReduce. In IEEE CLOUDCOM 11. pp. 570-576, 2011. [8] M. Hammoud, M. S. Rehman, M. F. Sakr. Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic. In IEEE CLOUD 12, pp. 49-58, 2012. [9] B. Palanisamy, A. Singh, L. Liu and B. Jain, Purlieus: Localityaware Resource Allocation for MapReduce in a Cloud, In SC 11, pp. 1-11, 2011. [10] J. Polo, C. Castillo, D. Carrera, et al. Resource-aware Adaptive Scheduling for MapReduce Clusters. In Middleware 11, pp. 187-207,