Special Issue on Performance and Resource Management in Big Data Applications

Similar documents

Forecasting the Direction and Strength of Stock Market Movement

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

What is Candidate Sampling

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

How To Solve An Onlne Control Polcy On A Vrtualzed Data Center

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

Fragility Based Rehabilitation Decision Analysis

The OC Curve of Attribute Acceptance Plans

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

On the Interaction between Load Balancing and Speed Scaling

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

DEFINING %COMPLETE IN MICROSOFT PROJECT

Dynamic Pricing for Smart Grid with Reinforcement Learning

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

On the Interaction between Load Balancing and Speed Scaling

Cloud Auto-Scaling with Deadline and Budget Constraints

CALL ADMISSION CONTROL IN WIRELESS MULTIMEDIA NETWORKS

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Self-Adaptive SLA-Driven Capacity Management for Internet Services

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

J. Parallel Distrib. Comput. Environment-conscious scheduling of HPC applications on distributed Cloud-oriented data centers

Project Networks With Mixed-Time Constraints

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

行政院國家科學委員會補助專題研究計畫成果報告期中進度報告

This paper can be downloaded without charge from the Social Sciences Research Network Electronic Paper Collection:

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Credit Limit Optimization (CLO) for Credit Cards

An Interest-Oriented Network Evolution Mechanism for Online Communities

Intra-year Cash Flow Patterns: A Simple Solution for an Unnecessary Appraisal Error

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Efficient Bandwidth Management in Broadband Wireless Access Systems Using CAC-based Dynamic Pricing

The Greedy Method. Introduction. 0/1 Knapsack Problem

An Alternative Way to Measure Private Equity Performance

LITERATURE REVIEW: VARIOUS PRIORITY BASED TASK SCHEDULING ALGORITHMS IN CLOUD COMPUTING

An MILP model for planning of batch plants operating in a campaign-mode

Marginal Revenue-Based Capacity Management Models and Benchmark 1

Multi-Period Resource Allocation for Estimating Project Costs in Competitive Bidding

Recurrence. 1 Definitions and main statements

A Secure Password-Authenticated Key Agreement Using Smart Cards

Preventive Maintenance and Replacement Scheduling: Models and Algorithms

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

A Lyapunov Optimization Approach to Repeated Stochastic Games

This paper concerns the evaluation and analysis of order

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

IWFMS: An Internal Workflow Management System/Optimizer for Hadoop

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Abstract. 260 Business Intelligence Journal July IDENTIFICATION OF DEMAND THROUGH STATISTICAL DISTRIBUTION MODELING FOR IMPROVED DEMAND FORECASTING

J. Parallel Distrib. Comput.

In some supply chains, materials are ordered periodically according to local information. This paper investigates

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Optimization under uncertainty. Antonio J. Conejo The Ohio State University 2014

Optimal Scheduling in the Hybrid-Cloud

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

When Network Effect Meets Congestion Effect: Leveraging Social Services for Wireless Services

Analysis of Premium Liabilities for Australian Lines of Business

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Revenue Management for a Multiclass Single-Server Queue via a Fluid Model Analysis

Support Vector Machines

Multi-Resource Fair Allocation in Heterogeneous Cloud Computing Systems

Activity Scheduling for Cost-Time Investment Optimization in Project Management

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

Pricing Model of Cloud Computing Service with Partial Multihoming

Can Auto Liability Insurance Purchases Signal Risk Attitude?

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

IMPACT ANALYSIS OF A CELLULAR PHONE

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Optimal Customized Pricing in Competitive Settings

M3S MULTIMEDIA MOBILITY MANAGEMENT AND LOAD BALANCING IN WIRELESS BROADCAST NETWORKS

Rate Monotonic (RM) Disadvantages of cyclic. TDDB47 Real Time Systems. Lecture 2: RM & EDF. Priority-based scheduling. States of a process

Effective Network Defense Strategies against Malicious Attacks with Various Defense Mechanisms under Quality of Service Constraints

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

A Self-Organized, Fault-Tolerant and Scalable Replication Scheme for Cloud Storage

Cloud-based Social Application Deployment using Local Processing and Global Distribution

QoS-based Scheduling of Workflow Applications on Service Grids

1 Example 1: Axis-aligned rectangles

Transcription:

Specal Issue on Performance and Resource Management n Bg Data Applcatons Danlo Ardagna Dpartmento d Elettronca Informazone e Bongegnera Poltecnco d Mlano 2133 Mlano, Italy ardagna@elet.polm.t Today, many sectors of the global economy are guded by datadrven decson processes and applcatons, whch have caused the analyss of large amounts of data to become a hgh prorty task for many companes. The expectaton s that the knowledge obtaned from new large datasets, now readly avalable, can enhance the effcency of many ndustry and servce sectors and thus mprove the qualty of our lves. For these reasons, many bg data applcatons are under development to support data analyses that have resource requrements whch exceed the processng capacty of conventonal computng systems. The development of massvely parallel and scalable systems has therefore rased a consderable amount of nterest n both ndustry and academa. Ths n turn has exacerbated the many challenges n the areas of performance evaluaton, capacty plannng, dynamc resource management, and schedulng for large-scale parallel computng envronments. Open ssues n such areas of research have motvated ths specal ssue, comprsed of 7 papers that span dfferent aspects of these research areas. We next provde a bref summary of these papers. The study of Tan and Xa consders the general problem of resource provsonng n cloud computng servces that provde the underlyng support for allocatng resources to bg data applcaton workloads. An onlne adaptve learnng approach s presented, based on a stochastc multclass loss model for cloud servces and a stochastc gradent-based learnng algorthm that adaptvely adjusts the provsonng soluton as observatons of the demand are contnuously made over tme. In terms of general cluster computng envronments, Rosà et al. perform a deep study of Google cluster traces. The authors analyze how task prorty and machne utlzaton mpact cluster performance, n terms of wasted machne tme and resources, and dentfy statstcal patterns of task preemptons and ther dependency on other types of unsuccessful events. From a technologcal perspectve, the MapReduce programmng model has been recognzed to be one of the most promnent solutons for bg data applcatons. Hadoop, the MapReduce opensource mplementaton, s expected to touch half of the world data by the end of ths year. As such, the remanng papers n ths specal ssue focus on problems that nvolve MapReduce n one way or another. The study of Yng et al. consders an ever challengng problem assocated wth energy-performance optmzaton of data centers that process dfferent types of workloads, namely batch and nteractve MapReduce workloads together wth (non-mapreduce) Mark S. Squllante Mathematcal Scences Department IBM Research Yorktown Heghts, NY 1598, USA mss@us.bm.com web-applcaton workloads. An energy mnmzaton framework s presented that attempts to maxmze the spare capacty of MapReduce servers to execute non-mapreduce workloads, whle controllng delays on the varous workloads partcularly batch MapReduce workloads. Tan et al. seek to characterze farness for mult-resource sharng among concurrent workflows, where each MapReduce job can be vewed as a workflow, and to optmze n general the allocaton of multple resources to workflows that consst of dfferent classes of jobs wth heterogeneous demands. Two dfferent sets of assumptons about the jobs are consdered, one n whch the jobs are nfntely dvsble and one n whch the jobs are non-preemptve and ndvsble. The study of Z. Zhang et al. encompasses MapReduce applcaton performance proflng, smulaton and capacty allocaton on heterogeneous Amazon EC2 vrtual machnes. The authors show that for the same prce users can get a varety of Hadoop clusters based on dfferent vrtual machne types and develop a general soluton able to provde job deadlne guarantees, whle mnmzng the nfrastructure cost. In a related work, Malekmajd et al. smlarly face the jont capacty allocaton and admsson control of Hadoop cloud 2. clusters. The authors frst provde upper and lower bounds for MapReduce job executon tme, and then propose a very fast and scalable soluton that s able to mnmze cloud costs and rejecton penaltes for multple classes of jobs wth soft deadlne guarantees. Fnally, the problem of run-tme resource management for vrtualzed cloud data centers s consdered n the study of W. Zhang et al., where a two-layer scheduler and admsson control polces are proposed to optmze MapReduce applcaton executon. One scheduler, at the vrtualzaton layer, mnmzes the nterference of hgh prorty nteractve servces; the second scheduler acts at the Hadoop layer and helps batch processng jobs to meet ther performance deadlnes; the admsson control mechansm guarantees that deadlnes are met when resources are overloaded. We sncerely thank each of the authors for ther work and contrbuton to ths specal ssue. We also sncerely thank G. Casale for hs knd nvtaton and opportunty to guest edt an ssue of PER and for all of hs assstance and support wth ths specal ssue. Copyrght s held by author/owner(s).

An Adaptve Learnng Approach for Effcent Resource Provsonng n Cloud Servces Yue Tan Cathy H. Xa Dept. of Integrated Systems Engneerng The Oho State Unversty Columbus, Oho 4321 {tan.268, xa.52}@osu.edu ABSTRACT The emergng cloud computng servce market ams at delverng computng resources as a utlty over the Internet wth a hgh qualty. It has evolvng unknown demand that s typcally hghly uncertan. Tradtonal provsonng methods ether make dealzed assumpton of the demand dstrbuton or rely on extensve offlne statstcal analyss of hstorcal data. In ths paper, we present an onlne adaptve learnng approach to address the optmal resource provsonng problem. Based on a stochastc loss model of the cloud servces, we formulate the provsonng problem from a revenue management perspectve, and present a stochastc gradentbased learnng algorthm that adaptvely adjusts the provsonng soluton as observatons of the demand are contnuously made. We show that our adaptve learnng algorthm guarantees optmalty and demonstrate through smulaton that they can adapt quckly to non-statonary demand. 1. INTRODUCTION Bg data has flown nto every sector of the global economy rangng from socal networks to onlne busness to fnance to medcne to logstcs. Wth the rapd growth of data n many applcatons n the socety, the need to quckly and effcently manage and analyze them n a scalable and relable way s unprecedentedly hgh. Dstrbuted cloud computng platforms that have a large number of networked computng nodes wth massvely parallel structures has emerged as an attractve soluton for handlng bg data analytcs. Varous programmng models such as MapReduce [14], Dryad [19], etc., combned wth dstrbuted storage systems such as HDFS [3], HBase [4] and Cassandra [2] have been successfully mplemented on top of the dstrbuted cloud computng framework. The cloud computng framework s attractve because t offers the lluson of nfnte computng resource avalable on demand [5]. Typcally the servce offerng s bounded by servce level agreements (SLAs), whch specfy the requrements on certan performance metrcs such as relablty, avalablty, etc. For nstance, Amazon targets Amazon Elastc Compute Cloud (EC2) and Amazon Elastc Block Store (EBS) each avalable wth a monthly 99.95% uptme durng any monthly bllng cycle. Falure to meet ths target wll cause a penalty of up to 3% of the total servce charges [1]. Copyrght s held by author/owner(s). To satsfy such SLAs, the servce provders must plan suffcent amount of resources for each applcaton often referred to as the task of provsonng. Exstng solutons are mostly based on workload proflng and statstcal analyss methods conducted offlne on hstorcal data. The amount of resources provsoned s typcally estmated based on the peak load for each applcaton. As today s cloud provders support a large number of clents wth a dverse set of hghly dynamc usage patterns, such solutons can result n sgnfcant over-provsonng and underutlzaton of the resources. In addton, hstorcal data may not be avalable or not representatve for those emergng applcatons whose demand s hghly uncertan or may be contnuously evolvng. It s therefore crucal for the servce provders to be able to make tmely provsonng decsons that are adaptve to the changng envronment. In ths paper, we present an adaptve learnng approach to address the resource provsonng problem n cloud servces wth the objectve of maxmzng the provder s revenue subject to servce level agreements. Exstng datadrven decson makng approaches can be broadly classfed as exploratory and non-exploratory. Non-exploratory approaches are mostly based on parametrc methods, whch assume a partcular famly of demand dstrbutons and estmate the parameters usng hstorcal data. Such approaches are more sutable for mature products wth predctable demand. When the demand s hghly volatle, non-exploratory methods may fal to guarantee good performance. Even wth full knowledge of the true underlyng demand dstrbuton, some parametrc methods may stll lead to sub-optmal solutons ([27, 4]). Exploratory learnng approaches, on the other hand, are typcally non-parametrc and do not make assumptons on the demand dstrbuton, thus are more sutable for the emergng cloud computng markets. We adopt a non-parametrc learnng approach to solve the resource provsonng problem where the provsonng soluton s contnuously updated as more knowledge s ganed on the uncertan demand. Dfferent from tradtonal provsonng solutons employng classcal demand estmaton, our approach does not requre explct knowledge of the true demand dstrbuton, and furthermore, t can be adaptve to ncomng data as more demand observatons become avalable. Specfcally, the contrbutons of ths paper are summarzed as follows. 1) We present a mult-class stochastc loss model for cloud servces where the demand comes from a renewal process wth unknown dstrbuton. Ths model allows us to ap-

proxmate the servce avalablty n closed form va an extenson of the Erlang loss formula usng normal approxmatons and squared-root staffng. Note that the servce avalablty depends on both the uncertan demand and provsonng soluton, we take a revenue management perspectve and present a stochastc optmzaton formulaton for the provsonng problem wth the objectve of maxmzng the expected proft. 2) To solve the above stochastc optmzaton problem, we present a non-parametrc stochastc gradent-based learnng algorthm, whch adaptvely adjusts the provsonng soluton as observatons of the demand are contnuously made. We establsh varous mportant propertes of the loss probablty approxmaton and the objectve functon, and show that our stochastc gradent-based learnng algorthm has guaranteed convergence to the optmum. 3) Through numercal experments, we show that the proposed stochastc gradent-based learnng algorthm gves a sequence of adaptve provsonng decsons that eventually converge to the true (full nformaton) optmal provsonng soluton. Moreover, the algorthm can also adapt quckly to non-statonary demand. Although ths paper s presented n the context of cloud computng, the methodologes are readly applcable to a broad range of other applcatons n whch loss models are sutable. A wde varety of examples nclude but are not lmted to applcatons n e-commerce, nventory control and manufacturng systems, etc. The rest of the paper s organzed as follows. In 2, we present the mathematcal model of the proft maxmzaton problem, and necessary background about a mult-class stochastc loss model s dscussed. In 3, we provde detals of the our non-parametrc learnng approaches: stochastc gradent descent algorthm. Numercal experments are gven n 4. Related works and concludng remarks are fnally presented n 5 and 6, respectvely. 2. THE PROVISIONING PROBLEM Suppose the servce provder supples R dfferent classes of servce templates. For smplcty, we assume each class s defned as multple unts of base nstance. For example, {one Compute unt, 1.7GB Memory, 16GB Hard-drve} from Amazon EC2. That s, a servce template of class r conssts of b r unts of base nstance. Denote b = (b 1, b 2,..., b R). We assume that t takes the servce provder a cost γ u per unt tme to mantan a unt of base nstance over a planned tme perod. Thus the cost of mantanng a unt of class r servce template s gven by b rγ u. We assume that the demand for the R classes of servces s random wth unknown dstrbuton. Each class r customer, once admtted nto the system, wll stay for a random amount of tme, and then ether downgrade or upgrade hs servce by changng nto a dfferent servce class, or smply leave the system. Each class r customer wll be charged at a prce p r per unt tme durng ts sojourn. Let C be the total number of nstances provsoned and are shared by customers of all servce classes. An ncomng customer of class r wll be blocked (thus lost) f the amount of remanng resources s less than b r. Let P (r) L denote ths blockng probablty for class r. Clearly, when C s set too low, the blockng probabltes for all servce classes would be hgh; on the other hand, f C s set too hgh, t could result n sgnfcant cost by overprovsonng. Therefore, the servce provder needs to decde on the optmal total number of nstances C so as to maxmze the overall proft. 2.1 Stochastc Modelng: Loss Queue and the Blockng Probablty The dynamcs n the above provsonng problem can be modeled as a stochastc mult-class loss queue. As llustrated n Fgure 1, there s a common resource pool consstng of C servers that are shared by all R classes of customers. A class r customer requres the servce of b r servers smultaneously. Upon arrval, f the number of avalable servers s less than b r, the customer wll be lost. We assume the arrval process for class r customers s a general pont process A r(t) havng unknown dstrbuton F r(t) wth mean λ 1 r and varance σa 2 r <, r = 1,..., R. The servce tmes for class r customers, denoted by S r, are ndependent and dentcally dstrbuted (..d.) random varables wth fnte mean τ r and fnte varance σs 2 r <, ndependent of the arrval process. Let ν r = λ rτ r denote the traffc ntensty of class r customers, r = 1,..., R. Denote ν = (ν 1,..., ν R). Fgure 1: A mult-class stochastc loss model for cloud servces. In our earler work [35], we have shown that when the arrval processes are Posson, the system can be vewed as an M/G/C/C queue, and the blockng probablty for class r customers P (r) L can be derved by P (r) P(C br < Y C) L (ν, C) = P(Y C) where Y = r bryr, and Yr represents the offered load of class r, namely the total number of class r customers n system n steady-state of an M/G/ queue should there be nfnte number of resources avalable. It can be shown that Y r has a Posson dstrbuton wth mean ν r. Thus Y also has a Posson dstrbuton wth mean µ := R r=1 brνr and varance σ 2 := R r=1 b2 rν r. Snce the demand n cloud computng s typcal large, usng normal approxmaton for Posson random varables wth large means, we can further approxmate P (r) L (ν, C) assumng Y s normally dstrbuted wth mean µ and varance σ 2. The above results can be further extended to a more general settng when the arrval process s a renewal process. Lu et al. [28] show that n ths case, the blockng probablty for class r customers P (r) L can also be approxmated by (1), where Y s normally dstrbuted wth mean µ := R r=1 brνr and varance σ2 := R r=1 b2 rσr. 2 Sm- (1)

larly, we can show that Y = r bryr where Yr represents the offered load of class r, namely the total number of class r customers n system n steady-state of an GI/G/ queue. It can be shown that for large demand ν r, Y r s asymptotcally normal wth mean ν r and varance σr, 2 where σr 2 s the varance of the number of busy servers n a GI/G/ system wth arrval process A r(t) and..d. servces S r. Therefore, n the most general settng (when the arrval process s a general renewal process), the blockng probablty can be approxmated as follows: wth P (r) L P(C br < Y C) (ν, C) =, (2) P(Y C) Y N(µ, σ 2 ). (3) Usng the above normal approxmaton (3), for the blockng probablty assumng Y N(µ, σ 2 ), and under the so-called squared-root staffng: C(β) := µ + βσ, we can further approxmate the blockng probablty n closed form as follows. Let Z denote the standard Normal random varable N(, 1). Let φ(β) = 1 2π e β2 /2 and Φ(β) = 1 β 2π e u2 /2 du denote ts probablty densty functon and cumulatve dstrbuton functon respectvely. Then, and P(Y C(β)) ( Y µ =P C(β) µ ) σ σ P(Z β) = Φ(β), P(C(β) b r < Y C(β)) =P(Y C(β)) P(Y C(β) b r) ( ) =Φ(β) Φ β br σ br φ(β), for large ν. σ We then have the followng approxmaton for blockng probablty of class r: P (r) L br φ(β) (ν, C), for large ν, (4) σ Φ(β) where β = (C µ)/σ. The above blockng probablty approxmaton can be vewed as an extenson of the well known Erlang B loss formula from sngle class of servce wth unt demand to multple classes wth multple demands. It was frst observed by Erlang n [1] that n the sngle class unt demand case, under the squared-root staffng, C = ν + β ν, for large values of C (and ν), the Erlang B loss formula s well-approxmated by (4) wth b r = 1 and σ = ν. 2.2 Mathematcal Formulaton for Provsonng Problem Motvated by the applcatons and dscusson n the ntroducton, we now extend the above stochastc loss queueng model and present a mathematcal formulaton of the provsonng problem from a revenue management perspectve. Consder a fxed tme perod of length W. Let X r denote the total workload of class r customers arrved durng ths perod, r = 1,..., R. Denote X = (X 1,..., X R). We assume the tme perod s long enough for the stochastc loss queue model to reach statonarty, thus the blockng probablty under steady-state can be approxmated by (4). Let γ := γ uw denotes the total cost of mantanng one base nstance over the entre tme wndow W. Suppose C s the total number of base nstances provsoned for ths perod, then the total proft of the system durng ths tme perod s gven by: R(C, X) = r = r p rτ rx r(1 P (r) (ν, C)) γc L ( ) p rτ rx r 1 br φ(β) γc, (5) σ Φ(β) where X r(1 P (r) L ) can be vewed as the effectve workload of class r, snce the actual demand s thnned due to blockng. Note that β = (C µ)/σ, where both µ and σ depend on X. Our objectve s to fnd the optmal provsonng soluton so that the expected total system proft s maxmzed. That s, max E[R(C, X)]. (6) C The problem defned by (6) can be vewed as a stochastc optmzaton problem. When the dstrbuton of the workload X s known, the optmal soluton can be relatvely easly found. In [7], the workload s assumed to arrve accordng to a Posson process, the authors solve the capacty plannng problem usng a dynamc programmng approach for multple tme perods wth tme-varyng workloads. When the dstrbuton of the workload X s unknown, one then needs to make observatons and learn about the dstrbuton over tme, whereas the provsonng decson has to be made accordngly along the process of learnng. Ths can be done by consderng a sequence of ndependent tme perods, where observatons of the random workload are made for each perod. Typcal approaches ether use a parametrc approach, assumng that the demand dstrbuton belongs to certan parametrc dstrbuton famly; or a non-parametrc approach, where no assumptons on the parametrc form of the demand dstrbuton are made. A number of papers n the lterature (e.g. [4, 27]) have recognzed that parametrc approach could fal to fnd the optmum or lead to suboptmal soluton (see 5 for further dscussons on the related lterature). Recent advancements on non-parametrc learnng methods such as the stochastc gradent method (e.g. [18]) have demonstrated that non-parametrc learnng algorthm could not only converge to optmum but also adapt to non-statonary demand. Ths s very attractve to our provsonng problem for cloud computng snce the demand for an emergng market such as cloud computng can be volatle and nonstatonary. Ths motvates us to consder stochastc gradent descent algorthm whch wll be presented next. 3. A STOCHASTIC GRADIENT METHOD FOR LEARNING BASED PROVISIONING

In ths secton, we present the detals of our adaptve learnng algorthm whch s based on stochastc gradent descent algorthm. 3.1 Stochastc Gradent Descent Algorthm Stochastc gradent descent algorthm ams to solve a problem of the form max E[f(c, X)] c R where f(c, X) : R X R s contnuous but not necessarly dfferentable. And the followng assumptons have to hold [34]: f(c, X) s concave wth respect to c. If X s fxed at x, the dervatve of f(c, x) wth respect to c can be found, and ts dervatve cf(c, x) s called stochastc gradent. Stochastc gradent cf(c, x) s bounded. Then the soluton can be found by the followng teratve algorthm: at each tme perod t, c t = c t 1 + α t 1 cf(c t 1, X t ). (7) where cf(c t 1, X t ) depends on the prevous soluton c t 1 and most recent observaton X t. And α t 1 s called step sze, and t s a functon of the observatons X 1,..., X t 1. For the algorthm to converge as t, step sze α t must satsfy the followng condtons: 1. α t >, a.s.; 2. α t, as t ; 3. t= αt = a.s.; 4. E [ t= α2 t ] <. Stochastc gradent algorthm does not requre any pror knowledge of the underlyng demand dstrbuton, and at the end of each tme perod t, soluton c t adapts to hstorcal observatons of X. And the decson chan s guaranteed to converge to the true (full nformaton) optmal soluton n the long run. As a result, t has been a common nonparametrc approach for multperod revenue management n many applcatons such as nventory control ([12, 18, 23, 24, 38]). 3.2 A Stochastc Gradent Algorthm for Provsonng In ths secton, we present a stochastc gradent algorthm for provsonng under the rule of squared-root staffng C = µ + βσ. Frst, we need the followng lemmas to verfy that our provsonng problem (6) satsfes all the assumptons for stochastc gradent descent algorthm. Lemma 1. P (r) L (ν, C) s a contnuous functon that s monotoncally decreasng n C. Furthermore, P (r) L (ν, C) s a convex functon wth respect to C. Proof. Recall from (4), the normal approxmaton of blockng probablty P (r) (ν, C) under squared-root staffng L β = (C µ)/σ, can be vewed as a functon of C, whose frst order dervatve wth respect to C s = P (r) L (ν, C) C (r) P L (ν, C) β = br σ β C ( βφ(β) Φ(β) φ2 (β) Φ 2 (β) ) 1 σ = φ(β) br (βφ(β) + φ(β)) Φ 2 (β) σ 2, where the last nequalty s because of βφ(β) + φ(β) = (r) Φ(β)dβ >. Hence, P L (ν, C) must be monotoncally decreasng n C. To show that P (r) L (ν, C) s convex n C, we need to show that the second dervatve of P (r) (ν, C) wth respect to C 2 P (r) L (ν, C) C 2 = brφ(β) σ 3 Φ(β) L ( β 2 + 3β φ(β) Φ(β) + 2 φ2 (β) Φ 2 (β) 1 ). s non-negatve. Let A(β) = β 2 +3β φ(β) +2 φ2 (β) 1. It can be verfed Φ(β) Φ 2 (β) that A(β) s monotoncally ncreasng, and A(β) as β, therefore, 2 P (r) L (ν,c). C 2 The followng lemma follows mmedately from Lemma 1. Lemma 2. R(C, X) s contnuous and concave wth respect to C. Ths verfes the frst assumpton of stochastc gradent descent algorthm. If workload X s fxed at x = (x r, r = 1,..., R), and mean servce tme τ r s known, stochastc gradent s gven as CR(C, x) = r p rτ rx r b rφ(β) σ 2 Φ(β) ( β + φ(β) ) γ (8) Φ(β) where β = C µ. Ths verfes the second assumpton that σ stochastc gradent can be calculated explctly. Fnally, Lemma 3 verfes the last assumpton. Lemma 3. CR(C, x) s bounded. Proof. It can be verfed that φ(β) Φ(β) ( ) β + φ(β) Φ(β) n (8) has range [, 1]. For gven value of p r and γ, CR(C, x) s bounded. Now that all the assumptons met, we are able to adopt stochastc gradent descent algorthm to our provsonng problem. And the correspondng algorthm s outlned as follows: At the end of each tme perod t T, workload X t and mean servce tme τ t can be observed. New observaton updates stochastc gradent (8), and a new decson can be made by C t = C t 1 + α t 1 CR(C t 1, X t ),

Data: (X 1 r, τ 1 r ),..., (X t r, τ t r),... Intalzaton: t = 1, ntal pont C 1 ; whle t T do Collect demand data X t and τ t ; Set step sze α t 1; C t = C t 1 + α t 1 CR(C t 1, X t ); t = t + 1; end Algorthm 1: Stochastc gradent algorthm for provsonng problem. then move to the end next tme perod t + 1. By updatng stochastc gradent and decson teratvely, a chan of C t, t = 1,..., T can be generated that eventually converges to the optmal soluton as T. 3.3 Convergence Rate In order to show the stochastc gradent algorthm for provsonng problem converges at the rate up to O(1/ T ), we frst need to show the followng theorem. Theorem 1. Let Q : C R be a concave functon defned on a compact convex set C. For any C C, let g(c) denote any subgradent of Q at C, and let H(C) be a random varable defned on C such that E[H(C) C] = g(c). Suppose there exsts B such that H(C) B wth probablty 1 for all c C. Let C 1 be any pont n C. For any t 1, teratvely defne C t+1 = C t + α th(c t ) where α t = t a,.5 < a 1. Then, for all T 1 [ ] Q(C 1 T ( ) E Q(C t ) = O T max( a,a 1)). T t=1 where C = argmax C C Q(C). And Proof. For any t 1, E C C t+1 2 =E C C t α th(c t ) 2 =E C C t 2 + α 2 t E H(C t ) 2 2α te[h(c t )(C C t )] (9) E[H(C t )(C C t )] =E[E[H(C t )(C C t ) C t ]] =E[g(C t )(C C t )] (1) Snce Q s concave, t must have therefore, Q(C ) Q(C t ) g(c t )(C C t ), E[Q(C ) Q(C t )] E[g(C t )(C C t )]. (11) Plug (1) nto (9) and by nequalty (11), we have E[Q(C ) Q(C t )] E Ct C 2 2α t E Ct+1 C 2 2α t + αt 2 E H(Ct ) 2. (12) Take the summaton of (12) over t = 1,..., T, T Q(C ) T t=1 + 1 2 T E[Q(C t )] t=1 { } E C t C 2 E Ct+1 C 2 2α t 2α t T α te H(C t ) 2 t=1 dam(c )2 + B2 2α T +1 2 As a result, Q(C ) E T t=1 α t dam(c )2 T a + B2 2 2(1 a) T 1 a. [ 1 T ] T Q(C t ) t=1 ( = O T max( a,a 1)). If we replace Q(C), H(C) n the above Theorem 1 by E[R(C, X)], CR(C, X) from our provsonng problem, respectvely, we have the followng corollary. Corollary 1. The stochastc gradent algorthm for provsonng problem max CE[R(C, X)] converges to optmal ) provsonng soluton at the rate of O (T max( a,a 1) when step sze α t = t a, a [.5, 1]. Proof. By Lemma 2, R(C, X) s concave wth respect to C. Stochastc gradent CR(C, X) s bounded by Lemma 3. Replace Q(C), H(C) n Theorem 1 by E[R(C, X)], and CR(C, X), respectvely, ths corollary holds mmedately from Theorem 1. Moreover, the convergence rate s O(1/ T ) when α t = 1/ t. As mentoned above, the provsonng decsons generated by stochastc gradent algorthm are adaptve to observatons of workload and servce tmes, wthout knowledge of the true dstrbutons. In the next secton, we show that t works for both statonary and non-statonary demand through numercal experments. 4. NUMERICAL EXPERIMENTS In ths secton, we llustrate the convergence speed and performance of stochastc gradent algorthm through 3 numercal experments. We test the algorthm wth 3 experments: 1) statonary demand, 2) non-statonary demand and 3) under dfferent ntal values and step szes. For smplcty, we assume there are 3 dfferent servce classes, and each class conssts of b = (1, 3, 5) base nstances, respectvely. Prces are set to be p = (15, 5, 2), cost γ = 3. Demand (workload) X = (X 1, X 2, X 3) are sampled by random number generator as f they are real data. Mean servce tme τ (1, 2, 1) for all tme perods. However, the algorthm does not know the underlyng demand dstrbutons. In each experment, we examne the provsonng solutons generated by the algorthm, and correspondng tme average proft n the long run.

4.1 Statonary Demand Case We generated samples from three typcal dstrbutons wth the same means: 1) Unform: X 1 U(8, 12), X 2 U(4, 6), X 3 U(6, 14), 2) Gaussan: X 1 N(1, 1 2 ), X 2 N(5, 5 2 ), X 3 N(1, 2 2 ), 3) Posson: X 1 Posson(1), X 2 Posson(5), X 3 Posson(1). Step sze s set to α t = t.6, and ntal value C 1 = 4. We set the total number of tme epochs to be T = 15. Fgure 2 plots capacty solutons generated by Algorthm 1 and tme average profts over all tme perods when demand samples are sampled from above three dstrbutons. In all three cases, the stochastc gradent algorthm s able to converge. Ths confrms the statement n Corollary 1. Note that even wth the same mean workload, provsonng solutons converge to very dfferent optmal solutons when the underlyng dstrbutons dffer. Recall that tradtonal parametrc learnng algorthm depends on knowledge of the true dstrbuton type. When the wrong assumpton s made, the learnng algorthm could devate from true optmum. But ths ssue can be resolved wth our onlne adaptve learnng based on stochastc gradent algorthm. 4.2 Non-statonary Demand Case Next, we consder a experment when the underlyng demand dstrbuton shfts after frst 1 tme perods. Table 1 summarzes the parameters used for the test scenaros wth non-statonary demand. We consder 2 scenaros when 1) demand shfts from Posson dstrbutons to unform 2) demand shfts from Gaussan dstrbutons to Posson. And step sze s kept at α t = t.6, and ntal value C 1 = 46. We set the total number of tme epochs to be T = 3. Table 1: Summary of demand dstrbutons (dstr.). Scenaro Dstr. before shft Dstr. after shft Scenaro 1 Posson Unform Class 1 λ = 15 [8, 12] Class 2 λ = 525 [4, 6] Class 3 λ = 15 [ 6, 14] Scenaro 2 Gaussan Posson Class 1 N(1, 1 2 ) λ = 15 Class 2 N( 5, 5 2 ) λ = 525 Class 3 N( 1, 2 2 ) λ = 15 As shown n Fgure 3, stochastc gradent algorthm can adapt to the shft n the demand dstrbuton and capacty solutons quckly converge to the new optmum. Before the shft at t = 1, whch s the red vertcal dash lne, the soluton sequence converge smlarly as n the frst experment wth statonary demand. After t = 1, the solutons moves mmedately towards a new convergence lmt, whch shows that the algorthm s adaptng to a new optmum. Unlke many manufactured products that have relatvely stable and predctable demand, cloud computng servce has demand that s usually non-statonary. Compared to exstng tradtonal learnng algorthms, our adaptve learnng approach s more sutable for cloud computng provsonng problems because t s adaptve to ncomng non-statonary demand observatons. 4.3 Choces of Intal Value and Step Sze Durng exploraton of the numercal experments, we observe that choces of ntal value C 1 and step sze α t could affect the convergence performance. Ths s demonstrated va the followng experment. The demand s sampled from Gaussan dstrbuton (X 1 N(1, 1 2 ), X 2 N(5, 5 2 ), X 3 N(1, 2 2 )). We set the total number of tme perods to T = 3. As shown n Fgure 4, we plot capacty solutons and tme average proft over all tme perods. For the dash lne, we set step sze α t = t.5 and ntal value C 1 = 42; for the sold lne, we set step sze α t = t.5 and ntal value C 1 = 5; for the dotted lne, we set step sze α t = t.7 and ntal value C 1 = 5. By observaton, C 1 = 42 appears to be a better guess than 5. And accordng to Corollary 1, α t = t.5 should result n faster convergence rate than α t = t.7. When we fx step szes, clearly a better guess on the ntal value s benefcal to the overall performance of the algorthm. And n ths experment, when C 1 = 42, whch s a lucky guess, tme average proft curve s the hghest among all three curves. But after t = 1, the dfference between the dash lne (C 1 = 42) and sold lne (C 1 = 5) s almost neglgble. When we fx ntal value and dffer step sze, as stated n Theorem 1 and ts Corollary 1, the dash lne (α t = t.5 ) converges faster than the dotted lne (α t = t.7 ). However, due to larger step sze, the dash lne s fuzzer. In practce, servce provders usually prefer a smooth flow of decsons over tme. Inspred from ths observaton, t s suggested that durng the ntal perods of the algorthm, we could delberately make the step sze larger to speed up the algorthm. When the solutons s gettng close to the optmal, t may be preferable to decrease the step sze so as to keep more smooth soluton curve. 5. RELATED WORKS 5.1 Resource Provsonng for Cloud Servces Resource provsonng for cloud servces has receved ncreasng attenton n recent years. A number of exstng work studed long-term resource provsonng problems for data centers usng proflng and statstcal modelng of dfferent traffc patterns for ndvdual or a group of applcaton/vms (see, e.g. [16, 17, 6, 29]). Other studes focus on short-term provsonng solutons by explotng dynamc control technques ([6, 3, 25]). A few recent works consder cloud resource provsonng usng constraned programmng and robust optmzaton approaches ([37, 13]). Most of these exstng methods ether make dealzed assumpton of the demand dstrbuton or rely on extensve offlne statstcal analyss of hstorcal data. To the best of our knowledge, we are among the frst that adopt a non-parametrc adaptve learnng approach for the resource provsonng problem whch does not requre extensve hstorcal data nor any pror knowledge of the demand dstrbuton. In our mathematcal formulaton of the problem, we adopt a stochastc loss model and study the servce avalablty of cloud servces va a normal approxmaton of the loss probablty under the squared-root staffng. The lterature n

45 Convergence of the capacty solutons 1.7 x 15 Convergence of tme average proft 445 1.65 44 1.6 435 1.55 Capacty C 43 425 42 415 Tme Average Proft 1.5 1.45 1.4 1.35 41 45 Unform Gaussan Posson 1.3 1.25 Unform Gaussan Posson 4 5 1 15 Tme Perods t 1.2 5 1 15 Tme Perods t Fgure 2: Numercal results of stochastc gradent algorthm wth statonary demand. 5 Capacty solutons for non statonary demand 1.65 x 15 Tme average proft for non statonary demand Capacty C 49 48 47 46 45 44 43 t = 1 t = 1 Tme Average Proft 1.6 1.55 1.5 1.45 42 41 Scenaro 1 (Posson Unform) Scenaro 2 (Gaussan Posson) 1.4 Scenaro 1 (Posson Unform) Scenaro 2 (Gaussan Posson) 4.5 1 1.5 Tme Perod t 2 2.5 3 x 1 4 1.35.5 1 1.5 Tme Perod t 2 2.5 3 x 1 4 Fgure 3: Numercal results of stochastc gradent algorthm wth non-statonary demand. loss models s vast and varous formats of loss models and loss networks have been studed extensvely n the context of crcut swtched networks n the 8 s ([21, 39], etc.), and recently n the context of workforce management ([28, 7]) and cloud computng ([35, 36]). The well-known Erlang blockng formula [1], ntally establshed by Erlang for M/M/C/C queues, has been extended to mult-lnk, mult-route models (see e.g., [1], [11], [21]). Varous nsenstvty results have also establshed ([8, 11]), suggestng that the blockng probablty formula (1) can be appled to a broad class of traffc characterstcs thus effectve n modelng the loss behavor of networks wth fnte resources. 5.2 Statstcal Methods for Stochastc Optmzaton The lterature on stochastc optmzaton methodologes s also vast. When nformaton on demand dstrbuton s not avalable, a common approach n the lterature s Bayesan approach. Typcally a partcular famly of dstrbutons s pcked whle the assocated parameters are teratvely updated as more observatons are collected. See, for example, [33, 2, 4] etc. n the context of revenue management. A drawback of ths approach s that the expected proft n a perod actually depends on the manager s belef of the demand dstrbuton n that perod (see [18, 15]), whch s not deal when the demand dstrbuton s volatle and hard to predct. Lyanage and Shanthkumar [27] and Lm et al. [26] demonstrate that subjectve Bayesan approach very much depends on a lucky guess for the ntal values of the parameters. Bad choces could also lead to poor performance. Xa and Dube [4] show that usng a Bayesan adaptve control for prcng, the resulted soluton can have prce dsperson where wth postve probablty, a subset of sample paths wll lead to suboptmal solutons. When the demand s nonstatonary, tradtonal approaches often use movng average or exponental smoothng for forecastng. By separatng the estmaton and the optmzaton tasks, even wth full nformaton about the true demand dstrbuton, these tra-

5 Convergence of the capacty solutons 1.65 x 15 Convergence of tme average proft 49 48 C 1 =42, a t = t.5 C 1 =5, a t = t.5 1.6 C 1 =42, a t = t.5 C 1 =5, a t = t.5 47 C 1 =5, a t = t.7 1.55 C 1 =5, a t = t.7 Capacty C 46 45 44 43 42 41 4.5 1 1.5 2 2.5 3 Tme Perod t x 1 4 Tme Average Proft 1.5 1.45 1.4 1.35 1.3.5 1 1.5 2 2.5 3 Tme Perod t x 1 4 Fgure 4: Numercal results of stochastc gradent algorthm wth dfferent step szes and ntal values. dtonal approach sometmes leads to suboptmal solutons [27]. Stochastc gradent descent algorthm, one of the most commonly used non-parametrc learnng algorthm for stochastc optmzaton, has been studed extensvely n the lterature. The frst and prototypcal gradent-based algorthms gven are the Robbns-Monro [31] and Kefer-Wolfowtz [22] algorthms. Bottou and LeCun [9] proved that ths gradentbased learnng gves a sublnear convergence rate. Huh and Rusmevchentong [18] showed wth proper choce of step sze and bounded stochastc gradent, the convergence rate could be of O(1/ T ). Le Roux et al. [32] ntroduced a new varant of the algorthm, called stochastc average gradent, whch could provde lnear convergence rate. To the best of our knowledge, we are among the frst that employ the stochastc gradent method to develop an adaptve learnng approach to solve the provsonng problem for cloud servces. 6. CONCLUSION AND FUTURE WORKS In summary, we present an onlne non-parametrc adaptve learnng approach to address the optmal resource provsonng problem for cloud servces. We provde a stochastc optmzaton formulaton of the provsonng problem that ntegrates a stochastc loss model wth revenue management perspectves. We develp a stochastc gradent-based learnng algorthm that does not requre any pror knowledge about the demand dstrbuton and can adaptvely adjust the provsonng soluton as observatons of the demand are contnuously made. We show that our adaptve learnng algorthm guarantees the convergence to optmum and demonstrate through smulaton that they can adapt quckly to non-statonary demand. Through smlar arguments as we presented n 2 and 3, t s possble to generalze the approach to handle varous capacty plannng problems n a more general stochastc loss network settng. Ths s part of our ongong nvestgaton. We also ntend to explore varous advanced methods to accelerate the convergence of the stochastc gradent algorthms. 7. ACKNOWLEDGEMENTS Ths work has been supported n part by NSF grants ECCS-1232118, and SES-149214. 8. REFERENCES [1] Amazon web servces. http://aws.amazon.com/ec2/sla/. [2] The apache cassandra project. http://cassandra.apache.org. [3] Hadoop dstrbuted fle system. http://hadoop.apache.org/hdfs. [4] Hbase. http://hbase.apache.org. [5] Armbrust, M., Fox, A., Grffth, R., Joseph, A., Katz, R., Konwnsk, A., Lee, G., Patterson, D., Rabkn, A., Stoca, I., and Zahara, M. A vew of cloud computng, Aprl 21. [6] Bennan, M. N., and Menasce, D. A. Resource allocaton for autonomc data centers usng analytc performance models. Internatonal Conference on Autonomc Computng (25), 229 24. [7] Bhadra, S., Lu, Y., and Squllante, M. S. Optmal capacty plannng n stochastc loss networks wth tme-varyng workloads. Proceedngs of the 27 ACM SIGMETRICS Internatonal Conference on Measurement and Modelng of Computer Systems (27), 227 238. [8] Bonald, T. The erlang model wth non-posson call arrvals. SIGMETRICS Perform. Eval. Rev. 34 (June 26), 276 286. [9] Bottou, L., and LeCun, Y. Large scale onlne learnng. Advances n Neural Informaton Processng Systems 16, Proceedngs of the 23 Conference (23), 217 224. [1] Brockmeyer, E., Halstrom, H. L., and Jensen, A. The Lfe and Works of A. K. Erlang. Academy of Techncal Scences, Copenhagen, 1948. [11] Burman, D. Y., Lehoczky, J. P., and Lm, Y. Insenstvty of blockng probabltes n a crcut-swtchng network. Journal of Appled Probablty 21 (1984), 85 859.

[12] Burnetas, A., and Smth, C. Adaptve orderng and prcng for pershable products. Operatons Research 48 (2), 436 443. [13] Chasr, S., Lee, B., and Nyato, D. Robust cloud resource provsonng for cloud computng envronments. In 21 IEEE Internatonal Conference on Servce-Orented Computng and Applcatons (SOCA) (21), pp. 1 8. [14] Dean, J., and Ghemawat, S. Mapreduce: Smplfed data processng on large clusters. proc. USENIXOSDI, San Francsco, CA, Dec. 6-8, 24, pp. 137-149.. [15] Dng, X., Puterman, M., and Bs, A. The censored newsvendor and the optmal acquston of nformaton. Operatons Research 5 (22), 517 527. [16] Gmach, D., Rola, J., Cherkasova, L., and Kemper, A. Capacty management and demand predcton for next generaton data centers. Web Servces, IEEE Internatonal Conference on (27), 43 5. [17] Govndan, S., Cho, J., Urgaonkar, B., Svasubramanam, A., and Baldn, A. Statstcal proflng-based technques for effectve power provsonng n data centers. In Proceedngs of the 4th ACM European conference on Computer systems (New York, NY, USA, 29), EuroSys 9, ACM, pp. 317 33. [18] Huh, T., and Rusmevchentong, P. A nonparametrc asymptotc analyss of nventory plannng wth censored demand. Mathematcs of Operatons Research 1 (29), 13 123. [19] Isard, M., Budu, M., Yu, Y., Brrell, A., and Fetterly, D. Dryad: Dstrbuted data-parallel programs from sequental buldng blocks. n Proc. ACM SIGOPS/Eurosys, Lsboa, Portugal, Mar. 21-23, 27, 57 72. [2] Karln, S. Dynamc nventory polcy wth varyng stochastc demands. Management Scences 6 (196), 231 258. [21] Kelly, F. Reversblty and Stochastc Networks. Wley, Chchester, 1979. [22] Kefer, J., and Wolfowtz, J. Stochastc estmaton of the maxmum of a regresson functon. The Annals of Mathematcal Statstcs 23, 3 (1952), 462 466. [23] Kunnumkal, S., and Topaloglu, H. Usng stochastc approxmaton methods to compute optmal base-stock levels n nventory control problems. Operatons Research 56 (28), 646 664. [24] Kunnumkal, S., and Topaloglu, H. A stochastc approxmaton method for the sngle-leg revenue management problem wth dscrete demand dstrbutons. Mathematcal Methods of Operatons Research 7 (29), 477 54. [25] Kusc, D., and Kandasamy, N. Rsk-aware lmted lookahead control for dynamc resource provsonng n enterprse computng systems. In IEEE Internatonal Conference on Autonomc Computng 26 (ICAC 6) (26), pp. 74 83. [26] Lm, A. E., Shanthkumar, J. G., and Shen, Z. M. Model uncertanty, robust optmzaton and learnng. Tutorals n Operatons Research: Models, Methods, and Applcatons for Innovatve Decson Makng (26), 66 94. [27] Lyanage, L. H., and Shanthkumar, J. G. A practcal nventory control polcy usng operatonal statstcs. Operatons Research Letters 33 (25), 341 348. [28] Lu, Y., Radovanovc, A., and Squllante, M. S. Optmal capacty plannng n stochastc loss networks. ACM SIGMETRIC Performance Evaluaton Revew (27), 39 41. [29] Meng, X., Isc, C., Kephart, J., Zhang, L., Boullet, E., and Pendaraks, D. Effcent resource provsonng n compute clouds va vm multplexng. In Proceedng of the 7th nternatonal conference on Autonomc computng (New York, NY, USA, 21), ICAC 1, ACM, pp. 11 2. [3] Padala, P., Shn, K., Zhu, X., Uysal, M., Wang, Z., Snghal, S., Merchant, A., and Salem, K. Adaptve control of vrtualzed resources n utlty computng envronments. In Proceedngs of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 27 (New York, NY, USA, 27), EuroSys 7, ACM, pp. 289 32. [31] Robbns, H., and Monro, S. A stochastc approxmaton method. The Annals of Mathematcal Statstcs 22, 3 (1951), 4 47. [32] Roux, N. L., Schmdt, M., and Bach, F. A stochastc gradent method wth an exponental convergence rate for strongly-convex optmzaton wth fnte tranng sets. Arxv preprnt arxv:122.6258 (212). [33] Scarf, H. A mn-max soluton of an nventory problem, Studes n The Mathematcal Theory of Inventory and Producton. Stanford Unversty Press, Stanford, CA, 1958. [34] Spall, J. C. Introducton to Stochastc Search and Optmzaton. Wley-Interscence, 23. [35] Tan, Y., Lu, Y., and Xa, C. H. Provsonng for large scale cloud computng servces. Proceedngs of the 12th ACM SIGMETRICS/PERFORMANCE jont nternatonal conference on Measurement and Modelng of Computer Systems (212), 47 48. [36] Tan, Y., Lu, Y., and Xa, C. H. Provsonng for large scale loss network systems wth applcatons n cloud computng. Performance Evaluaton Revew (212), 83 85. [37] Van, H. N., Tran, F. D., and Menaud, J.-M. Sla-aware vrtual resource management for cloud nfrastructures. Computer and Informaton Technology, Internatonal Conference on 1 (29), 357 362. [38] van Ryzn, G., and McGll, J. Revenue management wthout forecastng or optmzaton: An adaptve algorthm for determnng arlne seat protecton levels. Management Scences 46 (2), 76 775. [39] Whtt, W. Blockng when servce s requred from several facltes smultaneously. AT&T Tech. J. 64 (1985), 187 1856. [4] Xa, C. H., and Dube, P. Dynamc prcng n e-servces under demand uncertanty. Producton and Operatons Management 16, 6 (27), 71 712.

Demystfyng Casualtes of Evctons n Bg Data Prorty Schedulng ABSTRACT Andrea Rosà Unverstà della Svzzera Italana Lugano, Swtzerland andrea.rosa@us.ch Robert Brke IBM Research Lab Zurch Rüschlkon, Swtzerland br@zurch.bm.com The ever ncreasng sze and complexty of large-scale datacenters enhance the dffculty of developng effcent schedulng polces for bg data systems, where prorty schedulng s often employed to guarantee the allocaton of system resources to hgh prorty tasks, at the cost of task preempton and resultng resource waste. A large number of related studes focuses on understandng workloads and ther performance mpact on such systems; nevertheless, exstng works pay lttle attenton on evcted tasks, ther characterstcs, and the resultng mparment on the system performance. In ths paper, we base our analyss on Google cluster traces, where tasks can experence three dfferent types of unsuccessful events, namely evcton, kll and fal. We partcularly focus on evcton events,.e., preempton of task executon due to hgher prorty tasks, and rgorously quantfy ther performance drawbacks, n terms of wasted machne tme and resources, wth partcular focus on prorty. Motvated by the hgh dependency of evcton on underlyng schedulng polces, we also study ts statstcal patterns and ts dependency on other types of unsuccessful events. Moreover, by consderng co-executed tasks and system load, we deepen the knowledge on prorty schedulng, showng how prorty and machne utlzaton affect the evcton process and related tasks. 1. INTRODUCTION In today s bg data systems, whch feature on hgh heterogenety and dynamcty [15], schedulng polces have become a key factor towards ensurng hgh performance and relablty. Among those, prorty schedulng s especally used nowadays to dctate dfferent strateges for resource allocaton [2, 9], executon queue management [22, 25] and preempton [2] based on task prorty. Exstng polces for prorty schedulng focus on dfferent objectves: from fulfllng latency/throughput requrements of varous prortes, mprovng resource effcency and mnmzng cost [2], to the scalablty on dfferent system szes [14]. As falures are very frequent n large scale datacenters [3], unsuccessful tasks executons emerge as another crtcal crteron for schedulng Copyrght s held by author/owner(s). Lyda Y. Chen IBM Research Lab Zurch Rüschlkon, Swtzerland yc@zurch.bm.com Walter Bnder Unverstà della Svzzera Italana Lugano, Swtzerland walter.bnder@us.ch polces,.e., scheduler responsveness and effectveness to mtgate unsuccessful termnatons of tasks. The multtude of jobs co-executng n large datacenters [24] and the mpressve amount of resources needed for ther computatons [7] push system admnstrators to nclude preempton strateges n schedulers for bg data systems. In the context of prorty schedulng, one or more tasks of low prorty are evcted from the cluster n case of shortage of resources, meanng that ther executon s stopped, resources allocated to them are freed, and a hgher prorty task s scheduled n ther place. Evcted tasks may be restarted at later tme, f the cluster provdes enough resources for ther executons. As evcted tasks do not complete ther executon, the amount of resources used by them before preempton s essentally wasted, as t leads to no useful computatons. Despte the large presence of prorty schedulng and task evcton n today s bg data schedulers [2], pror studes [4, 8, 15, 17] on datacenter workload manly focus on understandng ts resource demand and characterstcs, usng operatonal data collected by cloud operators [6, 1, 24]. Task preempton due to prorty schedulng and the resultng resource waste are typcally overlooked by the related work. Exstng dependablty studes manly focus on physcal components [18, 19] or software errors [13, 26]. As a result, there s no extensve study about preempton of tasks n large-scale datacenters due to bg data prorty schedulng. In ths paper, we shed lght on the performance mpact of evcted tasks, ther characterstcs, and ther relatonshp wth the underlyng schedulng polcy n today s bg data systems. We base our analyss on traces collected from schedulers and machnes at a szeable datacenter [24], where tasks can experence three types of unsuccessful events 1,.e., evcton, fal, and kll. We focus especally on evcton events, by provdng an exhaustve characterzaton study on task preempton. Throughout ths paper, we adopt a prorty-aware approach, by showng whether dfferent task prortes mpact on the system dfferently, and to what extent. Combnng several huge tables avalable n the traces, our analyss conssts of three parts: 1) the performance mpact of 1 Throughout ths paper, we also term unsuccessful events as unsuccessful executons, smply meanng that the schedulng of tasks on machnes does not lead to a successful termnaton.

evcton on system resources, 2) the study of the evcton dependency on unsuccessful events, and 3) the n-depth analyss of prorty schedulng mechansms related to system resources and task prortes. We hghlght the large amount of tme and physcal resources wasted due to evcted tasks by comparng evcton and successful events, wth respect to task prortes. Moreover, motvated by the hgh dependency of evcton on the schedulng polcy and the repettve occurrences of evcton on low prorty tasks, we conduct an n-depth analyss on evcton events and fnd out ther temporal dependency on other types of unsuccessful executons. Fnally, to shed lght on the characterstcs of prorty schedulng n bg data systems, we provde an extensve study on the underlyng mechansms of the evcton process, showng how prorty and system load mpact on scheduled and evcted tasks durng an evcton event. The buldng blocks of our analyss are the statstcal propertes and the resource demands of evcted tasks, besdes the extensve study on the underlyng mechansms of prorty schedulng. Due to lmted descrpton on the traces document and no access to the actual system, our characterzaton study resorts to a black-box approach that may fall short n provdng concse analyss. The contrbutons of ths paper are twofold: to the best of our knowledge, ths s the frst-of-ts-knd feld analyss on task preempton n bg data prorty schedulng. Moreover, we reveal the complex nterdependency among dfferent types of unsuccessful events, provdng observatons for modelng evcton events n large-scale datacenters. The outlne of ths work s as follows. Secton 2 descrbes the dataset and the data collecton methodology. We provde a general overvew about unsuccessful executons and ther man characterstcs n Secton 3. The performance mpact of evcton events and ther tme dependency are detaled n Secton 4. In-depth analyss on the evcton process s gven n Secton 5. Secton 6 presents related work, followed by our conclusons n Secton 7. 2. DATA COLLECTION Our work s based on producton traces from Google data centers [24], whch represent a rch mx of heterogeneous and dynamc workload runnng on a large cluster for 29 days. The traces format and specfcaton can be found n the traces document [16]. The dataset contans multple tables, each of whch serves dfferent purpose and covers dfferent workload nformaton. Tables are splt nto several fles by tme stamps. In partcular, we focus on the tasks Events table, the tasks Usages table and the Machnes events table. In our work, we look at dfferent portons of the traces,.e., 29 days, 7 days, and 1 day, dependng on the ongong analyss. In partcular, we select the second week of the workload when lookng at 7 days, and the 23 rd day of the traces when lookng at sngle day wndow. The Events table stores all the events, e.g., submssons, schedulngs and termnatons, experenced by all tasks durng ther lfe cycle n the cluster, along wth the resources requested by them at submsson tme and ther prorty. All events are recorded n a mcrosecond granularty. The Usages table reports several measurements related to tasks executon and collected by the cluster proflers at run tme. Specfcally, t collects, for each task, the average usage of CPU, RAM, and DISK for every measurement perod, whose default length s 3 seconds. The Machnes table reports submt UNSUB PEND DOWN submt schedule (a) Tasks. add remove fal,kll (b) Machnes. END RUN evct,fal, fnsh,kll update UP Fgure 1: States (uppercase) and events (lowercase) dagram for tasks and machnes n the cluster. the events occurrng to machnes, along wth ther equpped CPU and RAM. 2.1 Tasks States and Events The am of ths cluster s to execute jobs submtted by users. A sngle job s composed of one or more tasks, the mnmum runnng enttes. A sngle task represents a Lnux program, runnng on a sngle machne. Combnng the aforementoned three tables n the traces, one can know tasks requested and used resources, events n tasks lfe cycles and ther nteracton wth co-executed tasks. Users are requred to specfy requested resources for ther tasks,.e., the amount of CPU, RAM, and DISK that are allocated to tasks durng ther executon. CPU expresses the maxmum number of cores that tasks can use, whle RAM and DISK specfy the amount of volatle and mass memory, respectvely. The scheduler can allocate to tasks more resources than the ones equpped on machnes. We name resource overcommtment such a stuaton. Tasks also have a prorty, rangng between [, 11], where hgh values represent mportant tasks. Durng ther lfe cycle n the cluster, tasks pass through four states: unsub (unsubmtted), pend (pendng), run (runnng), and end. Fgure 1(a) shows the state dagram for tasks n ths cluster. As soon as tasks enter the cluster, they leave unsub state and wat for a schedulng decson n the pend state. Once tasks are placed on machnes, they enter run state untl ther executon completes, ether successfully or not, when they reach the end state. Transtons between states are trggered by dfferent types

of events. Apart from the successful fnsh of a task, there are three types of events that lead tasks from run to end, namely evcton, kll and fal: accordng to the traces document [16], tasks can be evcted by the scheduler due to tasks congeston, can be manually klled by datacenter users, or can fal after an nternal error. Evcted, klled and faled tasks can be rescheduled on a dfferent machne, provded that they reman runnable and that a gven maxmum reschedulng lmt s not reached. Although the presence of such a lmt s mentoned n traces document, no value s provded. 2.2 Machnes States and Events Machnes have a smpler lfecycle, as shown n Fgure 1(b): they can be added or removed to the cluster, or ther attrbutes can be updated durng ther uptme. The traces specfy the amount of equpped resources,.e., CPU and RAM, n each machne. For confdentalty reasons, equpped resources have been normalzed between [, 1] by the trace producer. Table 1 summarzes the heterogeneous cluster confguraton at the begnnng of the trace. The cluster s composed by a total of 12585 machnes. Table 1: Cluster confguraton at the begnnng of the trace: dstrbuton of machnes [%] and equpped resources (normalzed between [, 1]). % CPU RAM % CPU RAM 53.47.5.4995.43.5.1241 3.74.5.2493.4.5.39 7.95.5.749.3.5.9678 6.32 1 1.2 1.5.99.25.2498.1.5.616 2.3 Challenges n Data Collecton We face several challenges n gatherng and santzng the measurements of nterest. All traces tables have dfferent loggng granulartes, structures, and even mssng nformaton. Our analyss spans over very large volumes of tables due to the massve amount of jobs, tasks and events n the cluster, leadng to complex jons that requre burdensome amount of tme. To better santze the data, we gnore the followng types of tasks n our analyss: (1) tasks that have only the startng or endng tme stamps durng the observaton wndow, and (2) tasks that are recorded n the run state n the Events table but have no records n Usages table. In fact, s mpossble to compute for these tasks ther runnng tme and used resources, whch are essental metrcs n our study. We further gnore records for whch no clear event s specfed, accordng to the traces documentaton. 3. UNSUCCESSFUL EXECUTIONS Before deepen our knowledge on evcton and the underlyng prorty schedulng, we provde a general overvew on all types of events that jobs and tasks may experence n the system,.e., fnsh, kll, evcton, and fal. About 34.6% of tasks have at least one unsuccessful executon,.e., they do not termnate wth a fnsh event. We summarze n Table 2 the number and percentage of each possble outcome, consderng schedulng events and unque tasks. An unque task can result nto multple schedulng events Table 2: Summary of endng events n the system: by schedulng events and by unque tasks. Number of events Endng Schedulng events Unque tasks Event Count Percentage Count Percentage Fnsh 3.99M 25.35% 3.88M 68.5% Kll 2.35M 14.92% 1.76M 3.92% Evcton 1.83M 11.63%.33M 5.85% Fal 7.57M 48.1%.34M 5.93% 12 x 16 1 8 6 4 2 Evcton Fal Kll Fnsh 1 2 3 4 5 6 7 8 9 1 11 Prorty Fgure 2: Occurrences of endng events, breakdown by prorty. due to multple resubmssons: for example, a task can be evcted (and scheduled subsequently) multple tmes before t successfully completes. We consder every task evcton n the second column of Table 2 ( Schedulng events ), whle we account only the frst evcton of each task n the thrd column ( Unque tasks ). The negatve dfference between the number of schedulng events and unque tasks n Table 2 provdes the ndcaton that tasks experence the same type of event multple tmes. Ths s partcularly notceable for fal and evcton: a subset of tasks tends to be evcted repettvely or fals contnuously. Indeed, unque tasks whch fal or are evcted more than once experence on average 22.4 fal and 5.5 evcton events durng ther lfe, respectvely. To our surprse, less than the 26% of task schedulngs s completed successfully. A non-neglgble percentage of scheduled tasks s termnated due to fal and evcton. In contrast to the root cause of fal,.e., some unknown task nternal errors, evctons are hghly subjected to the scheduler decson: evcted tasks could potentally complete ther executon successfully, but are descheduled accordng to a gven polcy. We also observe that tasks experencng unsuccessful events have lower chances to eventually termnate ther executon correctly,.e., only 38.9%, 1.85% and 1.78% of tasks wth at least one evcton, fal and kll, respectvely, experence also a fnsh event durng ther lfe. When consderng unque tasks, roughly 68% of tasks are termnated successfully. To better hghlght the mpact of prorty on task termnaton, we break down all schedulng and endng events by prorty and summarze them n Fgure 2. One can see how scheduled tasks are dstrbuted across dfferent prortes,.e., most of scheduled tasks have prorty, 1, or 4. Moreover, prorty has the hghest number of kll, evcton, and fal events, whle prorty 4 has the hghest number of successfully completed tasks. When comparng the dstrbutons of the four endng events, one can observe that evctons concentrate themselves on a specfc prorty,.e, the lowest

Tme [s] x 1 9 6 4 2 Resubmsson Queue Runnng Tme [s] x 1 9 6 4 2 Resubmsson Queue Runnng 1 2 3 4 5 6 7 8 9 1 11 Prorty (a) Evcton. 1 2 3 4 5 6 7 8 9 1 11 Prorty (b) Fnsh. Fgure 4: Dstrbuton of tme consumed by evcton and fnsh events, breakdown by prorty. Cum. functon of events 1.8.6.4.2 Evcton Fal 5 1 15 2 25 3 35 Evctons / Falures Fgure 3: Cumulatve dstrbuton functon (CDF) of evcton and fal events for evcted and faled tasks, respectvely. For the sake of clarty, x axs s truncated. one, whereas the other events do not show such a trend. In combnaton wth the prevous result, we conclude that evcton happens repettvely on a subset of tasks of the lowest prorty. 3.1 Follow-up Evcton and Fal Recognzed the repettve trend of evcton and fal on sngle tasks, we are nterested n dscoverng the dstrbuton of follow-up evcton and fal wthn sngle tasks. To our end, we select tasks experencng one or more evcton (resp. fal) events, and compute n Fgure 3 the cumulatve dstrbuton functon (CDF) of the number of evctons (resp. falures) occurred to tasks before leavng the system. Note that, accordng to the offcal documentaton, the maxmum number of reschedulngs of the same task vares between jobs, however ths value s not publcly known. As a result, after multple evcton and fal events, tasks can be termnated wthout success. From the startng pont of the CDFs, we can observe that more than 53% of evcted tasks have subsequent evctons,.e., are evcted at least twce, and more than 55% of faled tasks experence multple falures durng ther lfe. The remarkable presence of follow-up evctons and falures shows a very strong dependence of evcton and fal on a small subset of tasks. Snce these tasks experence successful termnaton only rarely (as shown prevously n ths secton), the mposton a low maxmum threshold for reschedulngs could avod waste of resources and save tme. Whle falures are not dependent on the scheduler, the evcton trend can be explaned by a deeper understandng on the evcton polcy adopted by the underlyng prorty schedulng. We deepen ths topc n Sectons 4 and 5. 4. ANALYSIS OF EVICTION In ths secton, we focus our study on evcton events occurred n the system. Our objectves n ths analyss are to: 1) quantfy the performance mpact of evcton events, n terms of wasted tme and resources, and 2) determne whether evcton events show temporal dependences, ether among themselves or wth other types of unsuccessful events. Frst, we compare resources consumed by evcton and fnsh events n Secton 4.1. Secondly, we compute temporal correlaton functons and analyze dependences at dfferent levels of granularty,.e., cluster and task level, n Secton 4.2. 4.1 Resource Waste Here, we quanttatvely characterze the resources wasted due to evcton events, n terms of executon tme and physcal resources, n contrast to resources consumed by fnsh events. Partcularly, we are nterested at how resources are dstrbuted across dfferent task prortes. We frst present results for wasted tme n Secton 4.1.1, followed by wasted physcal resources,.e., CPU, memory and dsk, n Secton 4.1.2. 4.1.1 Wasted Tme Tasks change several states durng ther lfe, accordng to Secton 2.1. Based on them, we can dvde tme consumed by tasks n the system n three dfferent ntervals, accordng to the followng categorzaton: 1) resubmsson tme: the tme nterval from the end of the prevous task executon to the current submsson,.e., the tme the task s n unsub state; 2) queue tme: the tme nterval from submsson to schedulng,.e., the tme the task s n pend state; 3) runnng tme: the tme nterval from schedulng to the endng event,.e., the tme the task s n run state. Tme consumed by an evcted task s essentally wasted, as leads to no useful results. We summarze the total amount of wasted tme for evcton, and the amount of tme consumed by fnsh events n Fgure 4(a) and (b), respectvely. In each subfgure, we separate tme followng the prevous categorzaton and by task prorty. Schedulng events termnatng wth evcton spend most of tme (91.48%) n run state, whle queue tme and resubmsson tme account for only 7.89% and.63% of the total wasted

Resource demand [RES x s] 1.5 1.5 2 x 18 CPU RAM DISK 1 2 3 4 5 6 7 8 9 1 11 Prorty (a) Evcton. Resource demand [RES x s] 1.5 1.5 2 x 18 CPU RAM DISK 1 2 3 4 5 6 7 8 9 1 11 Prorty (b) Fnsh. Fgure 5: Resource demand for evcton and fnsh events, breakdown by prorty. tme, respectvely. Smlarly, runnng tme for fnsh events s consderably hgh,.e., 8.86%, whle queue and resubmsson tmes amount to 16.12% and 3.2%, respectvely. Overall, tasks do not spend sgnfcant tme n queue. As shown n Fgure 4(a), prorty contrbutes to the 82.56% of wasted tme for evcton, whle other prortes do not experence a so hgh rate of evcton and result nto far lower wasted tmes. On the contrary, successful tasks show a more balanced dstrbuton of tme across prortes, 1, and 4, wth the latter beng the domnant one. The mean runnng tme for evcton events s 68.92 mnutes, whle the one for fnsh s equal to 25.55 mnutes. An all-embracng look at Fgure 4 easly dscloses the predomnance of resources wasted due to evcton events wth respect to the tme consumed by successful tasks: tme used by evcton events s 1.23x hgher than the one spent by fnsh task executons. 4.1.2 Wasted Resource Demand Users of the cluster are requred to specfy a gven amount of requested resources for every submtted task. These resources are allocated to tasks when they are scheduled on machnes. To quantfy the physcal resources wasted by evcton events, we ntroduce the metrc of resource demand, defned as the product of requested resources and runnng tme (defned n Secton 4.1.1). We look at three types of resources, namely CPU, RAM, and DISK. As resources are normalzed between [, 1], the absolute value of resource demand can also be nterpreted as the amount of resources suppled by the bggest machnes n the cluster. We defne the unt of measurement of resource demand as RES s. For example, 5 RAM s means: the amount of RAM suppled for 5 seconds by the machne equpped wth the largest amount of memory n the system. Fgure 5 summarzes the resource demand for evcton and fnsh events, and further categorze t by prorty. Overall, CPU s the most demanded resource n all prortes, followed by RAM, whle DISK mpact s neglgble for all the events. Resource demand for successful tasks s mostly focused on prorty 4, accordng to the hgh number of fnsh events that affects ths prorty. On the contrary, resources wasted by evcton events focus mostly on prorty. In terms of absolute values, the resource demand of evcton and fal events s almost the same. Ths means that the scheduler allocates a hgh amount of resources to tasks that wll be soon termnated wthout reachng completon. Allocatng resources to evcted tasks not only waste machne resources, but can also cause the deschedulng of other tasks, due to tasks congeston. In summary, evcton events request about the same quantty of resources that s requested by successful tasks, and consume even more machne tme for useless computatons. Our analyss hghlghts the large amount of resources wasted by evcton events, resultng n the mparment of the entre cluster performance. 4.2 Tme Dependency Throughout ths secton, we determne whether evcton events show some dependences, ether among themselves or on other types of unsuccessful executons. Partcularly, we am to dentfy temporal correlatons between evcton, fal, and kll events. We frst study the temporal dependency among evcton events through the analyss of ther tme seres and the correspondng autocorrelaton functon (ACF) n Secton 4.2.1. Then, we focus on the dependences between dfferent types of unsuccessful executons, by consderng two dfferent granulartes,.e., cluster and task levels, n Secton 4.2.2. 4.2.1 Dependency among Evcton Events In ths secton, we focus on the tme dependency among evcton events. In partcular, we am to fnd out whether an evcton event ncreases or decreases the chances of experencng a subsequent evcton n the near future. To such an end, we present the tme seres of evcton events n Fgure 6(a), and study ts correspondng autocorrelaton functon (ACF) n Fgure 6(b), based on the number of events per hour. We add the 95% confdence ntervals to contrast the sgnfcance of the ACF values n Fgure 6(b). From the ACF, we can observe that there are strong tme dependences n the frst few hours, ndcatng that the frequences of events tend to be smlar n adjacent hours. Evcton events show a very strong ACF pattern, ndcated by the large number of lags exceedng the 95% confdence ntervals. Judgng from the shapes of the ACF, we propose that Movng Average models are approprate to descrbe the tme dependency of evcton. Overall, evcton shows a strong repettve pattern n the order of few hours after an evcton event. 4.2.2 Dependency between Unsuccessful Events Here, we study the temporal dependency of evcton on kll and fnsh events. Durng our analyss, we focus on

Number of events 8 x 14 Evcton Mean 6 4 2 1 2 3 4 5 6 7 Tme [h] (a) Tme seres plot of evcton events. The horzontal lne marks the mean number of events over the whole traces and s equal to 8426 evcton/h. Autocorrelaton 1.5 3 6 9 12 15 Tme [h] (b) ACF of evcton events. Horzontal lnes ndcate 95% confdence bounds. We show only tme lags rangng from to 18 hours. Fgure 6: Tme seres and autocorrelaton functon (ACF) of evcton events. Cross correlaton.4.2 Cross correlaton.4.2.2 8 6 4 2 2 Tme [h] 4 6 8 (a) Evcton and Fal..2 8 6 4 2 2 Tme [h] 4 6 8 (b) Evcton and Kll. Fgure 7: Cross correlaton functons (CCFs) between task endng events. Horzontal lnes ndcate 95% confdence bounds. two dfferent levels,.e., cluster and task. At the cluster level, we analyze all events of the same type by aggregate statstcs, such as the cross correlaton functon (CCF). At the task level, we zoom nto events of the same task, studyng how unsuccessful events are dstrbuted wthn sngle tasks. Cluster Level Analyss. In the effort to fnd out the temporal dependency between unsuccessful executons at the cluster level, we compute the cross correlaton functons (CCFs) of evcton wth fal and kll events, and show them n Fgure 7(a) and (b), respectvely, wth hourly tme lags. Through them, we am to study how fluctuatons n the termnaton rate of one type of event mpact on the other endng event. When consderng only sgnfcant correlatons,.e., values wthn the 5% sgnfcance lmts, the behavor s very dfferent among CCFs. Fgure 7(a) exhbts almost only postve values of sgnfcant correlatons, ndcatng that there exsts an ncrease (reps. decrease) n the termnaton rate of one event after the ncrease (reps. decrease) of the other one. As Fgure 7(a) shows postve correlatons mostly n the fourth quarter,.e., where tme lags are negatve, fluctuatons n the evcton rate are followed by correspondng fluctuatons n the fal workload. On the contrary, the balanced presence at dfferent tme lags of both postve and negatve correlatons n Fgure 7(b) suggests ndependence between evcton and kll termnaton rates, unlke the prevous case. Task Level Analyss. When consderng task-level dependences among dfferent types of events, evcton and fal do not appear to be correlated,.e., only 13.72% of evcted tasks experence at least one falure durng ther lfe. Ths percentage shows that evcton and fal nterest dfferent subsets of tasks. Dfferently, 57.89% of evcted tasks are also klled at least once, resultng n a more evdent correlaton. We also consder the orderng among unsuccessful executons of the same task: among all executons, 99.42% of evctons are followed by at least one kll, whle the opposte relatonshp, where kll events are followed by at least one evcton, takes place only n 4.6% of klled executons. In the computaton of the prevous values, we consder only tasks showng both types of events at least once. On the contrary, the orderng between fal and evcton affectng the same task s less evdent: 7.26% of evcton events are preceded by at least one fal, and the opposte holds for 88.87% of cases. These values, both hgh and smlar, confrm our prevous ntuton that these two events are ndependent at task-level. 5. KICK-IN VS KICKED-OUT ANALYSIS In ths secton, we deepen our knowledge on the underlyng mechansms of prorty schedulng, analyzng the characterstcs of tasks and machnes nvolved n the evcton process. An evcton event conssts of one or more tasks beng removed from executon n order to allow other tasks of hgher

Percentage of tasks 1 8 6 4 2 Kck n Kcked out 1 2 3 4 5 6 7 8 9 1 11 Prorty (a) Percentage of kck-n and kcked-out tasks per prorty. Average n. of kcked out tasks per kck n 3 2 1 Prorty Other 1 2 3 4 5 6 7 8 9 1 11 Kck n prorty (b) Average number of kcked-out tasks per kck-n. Vertcal bars mark the standard devaton for each prorty. Fgure 8: Frequency and ntensty of kck-n and kcked-out tasks. prorty to run. To better mark the dfferent role of tasks n an evcton event, we name kck-n the task that preempts lower prorty tasks, whle we use the term kcked-out for the preempted ones. In order to understand the basc prncples behnd prorty schedulng n ths cluster, we analyze the mpact of prorty and system load on kck-n and kcked-out tasks, n terms of frequency and ntensty. Moreover, we look at machnes status before and after evcton events, to determne to what extent kck-n tasks beneft from prorty schedulng, and whether machnes suffer from overcommtment of resources after an evcton event. 5.1 Task Prorty Our analyss on task prorty focuses on two dfferent aspects: how frequently kck-ns and kcked-outs happen, and how many tasks are kcked-out for each kck-n across prortes. Due to the format of the trace, we adopt a twostep approach,.e., frst we dentfy all the kck-ns for every kcked-out task, then we reversely map kcked-outs to every kck-n. We conduct the frst step as follows: for every kckedout task, we consder as correspondng kck-n every task that 1) has hgher prorty, and 2) s scheduled on the same machne wthn 3 seconds from the evcton tmestamp. We summarze the percentages of kck-n/kcked-out events occurred n dfferent prortes and the average number of kck-outs due to a sngle kck-n task across prortes n Fgure 8(a) and 8(b), respectvely. From Fgure 8(a), one can observe straghtforwardly that prorty 4 has the hghest percentage of kck-ns and prorty has the hghest percentage of kcked-outs,.e., roughly 96.3% of evcted tasks have prorty. Ths mples that arrvals of prorty 4 tasks preempt prorty ones. Indeed, the percentage of prorty 4 tasks kckng out prorty ones s roughly 7% over all kck-n/kcked-out pars. Moreover, nearly 1% (99.9%) of kcked-out tasks belong to the lowest two prortes. These fndngs hghlght how prorty plays a key role n task evcton, as tasks of low prorty have clearly the hghest chance of beng evcted when preempton s necessary. Our results explan the huge mpact of evcton on wasted tme and resource demand observed at low prortes, as detaled n Secton 4.1. Fgure 8(b) depcts the average number of kcked-outs per kck-n, broken down by kck-n prorty. We separate kckedout tasks nto two categores: all prortes and prorty. In addton to the average values, we also nclude the standard Probablty of evcton.2.1 Used CPU Used RAM Requested CPU Requested RAM 5 1 15 2 25 Utlzaton / Reservaton [%] Fgure 9: Probablty dstrbuton functon (PDF) of resource utlzaton and reservaton level at evcton tme. devatons. In conjuncton wth the prevous observaton, the fgure shows that not only prorty s heavly evcted, but also that ths happens ndependently of the kck-n prorty. Overall, each kck-n task causes on average about 1.5 kckedouts. Prorty seems to affect ths number, as hgh prortes can evct up to 3.1 tasks per evcton. Ths s supported by the value of the coeffcent of correlaton between prorty and average kcked-outs per kck-n beng equal to.5347. 5.2 System Load Gven the strong nfluence of task prorty n the evcton process, as shown by our prevous analyss, here we verfy that evcton happens manly to clam resources for hgher prorty tasks, by showng that there exsts a postve dependency between frequency of kcked-out tasks and system load. To such an end, we defne and use two metrcs to quantfy system load,.e., (1) the resource utlzaton level, defned as the sum of the average amount of resources used by kckedout and co-executed tasks dvded by the equpped resources n the machne, as tracked by the traces at evcton tme, and (2) the resource reservaton level, defned as the sum of the resources requested by kcked-out and co-executed tasks at evcton tme dvded by the machne equpped resources. Fgure 9 shows the probablty dstrbuton functon (PDF) of resource utlzaton and reservaton level at evcton tme. Note that resource overcommtment s allowed by the scheduler,.e., resources may be allocated beyond the machne capacty. The fgure hghlghts how the majorty of evcton

Cum. functon 1.8.6.4.2 CPU RAM -2-15 -1-5 5 1 15 2 Resource Equlbrum Index (REI) Fgure 1: Cumulatve dstrbuton functon (CDF) of the Resource Equlbrum Index (REI). The two vertcal lnes mark values of REI equal to and 1. For the sake of clarty, x axs s truncated. takes place at hgh level of resource reservaton, most of them even wth values above 1%. Ths s true for 49% of evcton events when lookng at RAM, and 68% of them when lookng at CPU. The tals of the PDFs show that the reservaton level of co-executed tasks can reach up to 238% for memory and 16% for CPU. As for the actual resource utlzaton, no strong trend s observed between evcton frequency and machne usage. Our analyss confrms that prorty-drven evcton tends to happen on machnes wth hgh CPU and memory reservaton level. 5.3 Resource Demand and Supply Motvated by the observatons n the prevous subsecton, here we try to fnd out whether the scheduler evcts a suffcent number of kcked-outs to accommodate the resources requested by kck-ns. For each kck-n, we compare ts requested resources wth the resources avalable on the machne where t was scheduled. To denote the relatonshp between the supply and demand of resources for kck-ns, we ntroduce the metrc of Resource Equlbrum Index (REI), defned as follows: REI = RES free + {1..N} RES kcked out, RES kck n (1) where RES free denotes the amount of unused resources on the machne,.e., not allocated to any tasks, RES kck n refers to the amount of resources requested by the kck-n task, and RES kcked out, s the amount of resources requested by the -th kcked-out task out of N. We compute a value of REI for each kck-n task. When REI s greater than one, a kck-n task acqures suffcent resources; on the contrary, kck-ns wth REI rangng between [, 1) do not obtan all the resources they requested. Due to the fact that RES free can be a negatve value, meanng that all the machne resources have already been allocated to tasks before the evcton, the value of REI can be negatve as well, ndcatng that the supply of resources from the kcked-outs s not enough to leave the overcommtment state. Fgure 1 reports the CDF of REI for CPU and RAM. Smlar analyss on mass storage s not possble, as the traces do not report the amount of DISK equpped n machnes. On the one hand, there s a non-neglgble percentage of kck-ns that experence shortage of CPU and memory. One can observe that 28% of kck-n tasks experence CPU shortage (REI < 1), causng resource overcommtment, whle 12% of them are scheduled on machnes where resources are already overcommtted (REI < ). As for memory, a stronger trend can be observed,.e., 36% of kck-n tasks have a REI value lower than one and 2% of kck-ns are scheduled on machnes that suffer from memory overcommtment. On the other hand, the long tals of both CDFs show that the amount of avalable resources can be very hgh for kck-ns. For memory, more than 46% of kck-ns are scheduled on machnes supplyng more than twce the requested RAM, and n almost 11% of cases the surplus s up to 1 tmes. CPU follows a smlar trend, but wth a shorter tal, meanng that the surplus of avalable CPU for kck-ns s lower than the memory one. Our analyss shows that resource overcommtment plays an mportant role n the underlyng evcton polces of prorty schedulng, and that an evcton event does not necessarly remove overcommtment of resources from a machne. 6. RELATED STUDIES Workload characterzaton s a fundamental step towards desgnng new schedulng strateges for bg data clusters, amng to mprove system effcency and response tmes. The man focus of exstng characterzaton studes s on capturng the resources utlzaton and statstcal propertes of workloads, usng several publcly avalable traces from major datacenter operators [6, 1, 24]. Those traces provde not only a good understandng of underlyng systems, but also a vehcle to further develop varous schedulng polces, va smulaton [5, 11, 2, 21] or prototypng [6]. Gven a plethora of feld studes on hardware [18, 19] and software falures [13, 26], task evcton and the resultng performance mpact have been scarcely dscussed, provdng no quanttatve analyss. We summarze the studes related to our analyss n two drectons,.e., the characterzaton studes usng the same traces, and the prorty schedulng polces n dstrbuted systems. Google Trace Analyss. These Google traces [24] have been analyzed by several pror studes [2, 28], each of whch hghlghtng dfferent workload propertes, lke heterogenety and dynamcty [15], schedulng classes of jobs [12], and tasks classfcaton [8]. We underlne ther smlartes and dfferences wth our analyss. In our prevous works, we dentfy prorty schedulng as the man cause of evcton, and conduct a characterzaton analyss on dfferent task prortes [5] and system load [17]. Whle our pror works focus on workload patterns and average resource demands, n ths study we are partcularly nterested to the resources wasted due evcted tasks, the dependences among evcton events, and the underlyng mechansms of prorty schedulng. Ress et al. [15] provde one of the frst characterzaton study on these traces, hghlghtng the hgh degree of heterogenety and dynamcty of the workload. They hnt the correlaton between evcton and prorty, by checkng whether evcton happens wthn half a second from the schedulng of hgher prorty tasks on the same machne. However, no precse statstcs are presented to backup ther statement. D et al. [7, 8] apply statstcal learnng technques to classfy tasks, and present a small dscusson on successful and unsuccessful events on a sngle machne only. Lu et al. [12] study the wasted CPU due to unsuccessful events by computng used CPU per schedulng class on a sngle day of

the traces. In contrast, our analyss s based on a longer perod, consders the whole set of machnes and ncludes also RAM and DISK. Overall, aforementoned studes pay lttle attenton to the characterstcs of the evcton process and the underlyng prorty schedulng. Prorty Schedulng. Prorty schedulng has been well explored by the commercal [2, 22, 25] and research communty [9, 2, 23, 27], due to ts promnence n guaranteeng dfferentated servces. Schedulng polces take prorty as nput to determne the allocaton of resources, the executon order of jobs, and the preempton of lower prorty tasks. In most of the related work, prortes are gven and reman constant, except [2]. We hghlght the desgn features on selected systems for varous applcatons. Omega [2] s an ensemble of schedulers for the entre Google cluster that has complete freedom to allocate resources gven approprate prortes. Prorty preempton s appled on the entre cluster. Mesos [9] s a two-level scheduler, consstng of multple local cluster schedulers and one central coordnator, allocatng resources to computng frameworks based on max-mn farness or prortes. Moab and Mau [2] apply smlar schedulng polces,.e., hgher prorty jobs are prvleged to request hgher amounts of resources and have shorter queueng tme, wth preempton enabled. Some schedulers for Hadoop mplement prorty schedulng, for example the Hadoop Capacty Scheduler [25] and the Hadoop Far Scheduler [22]. The common polcy employed by both of them s that hgher prortes have advantages n schedulng over lower prortes. Although prorty schedulng s wdely adopted n today s dstrbuted systems, the theoretcal results regardng the mpact of prorty on response tme center on non-dstrbuted envronments, n partcular on sngle machnes [1], and consder a rather low number of prortes [23]. 7. CONCLUSIONS Our fndngs can be summarzed as follows. The amount of unsuccessful executons s very hgh, accountng for roughly 74% of schedulng events and affectng 32% of tasks. Evcton events happen repettvely on a small subset of tasks, mparng the chances of completng ther executons correctly. When compared to successful tasks, evcton events consume more machne tme (by a factor of 1.23) and the same amount of requested resources, resultng n non-neglgble waste of system resources. Evcton events show strong tme dependences wthn few hours, and are correlated wth other types of unsuccessful events at both cluster and task levels. Task prorty has a central role n the dstrbuton of kck-n and kcked-out tasks, resultng n a domnant concentraton of evcton events n low prortes. The prorty of kck-n tasks affects the number of kcked-outs per kck-n, but does not affect the prorty of the evcted tasks. On average, a kck-n task results nto 1.5 kcked-out tasks. Prorty 4 shows a strong domnance of kck-n tasks, whle kcked-out tasks are prevalent n prorty. The probablty of experencng an evcton event s hgher when machne resource reservaton s hgh, whle s not dependent on resource utlzaton. Fnally, kck-n tasks mght not obtan enough resources to run after an evcton, leadng machnes to resource overcommtment. In summary, we conduct an analyss on evcted tasks and prorty schedulng n bg data systems. Our analyss hghlghts the huge mpact of evcton events on cluster resources, the central role of prorty n the evcton mechansms, and sheds lght on the underlyng preempton polces n bg data prorty schedulng. Acknowledgments Ths work has been supported by the Swss Natonal Scence Foundaton (project 221 1412) and EU commsson under FP7 GENC project (68826). 8. REFERENCES [1] J. Abate and W. Whtt. Asymptotcs for M/G/1 Low-prorty Watng-tme Tal Probabltes. Queueng Syst. Theory Appl., 25(1/4):173 233, Jan. 1997. [2] Adaptve Computng. Ten Reasons to Swtch from Mau Cluster Scheduler to Moab R HPC Sute. http://www.adaptvecomputng.com/wp-content/ uploads/collateral/ TenReasonsToSwtchFromMauToMoab212-1-5.pdf. [3] L. Barroso, J. Dean, and U. Hölzle. Web Search for a Planet: The Google Cluster Archtecture. IEEE Mcro, 23(2):22 28, Mar. 23. [4] R. Brke, L. Y. Chen, and E. Smrn. Data Centers n the Cloud: A Large Scale Performance Study. In IEEE CLOUD, pages 336 343, 212. [5] D. Çavdar, A. Rosà, L. Y. Chen, W. Bnder, and F. Alagöz. Quantfyng the Brown Sde of Prorty Schedulers: Lessons from Bg Clusters. SIGMETRICS Perform. Eval. Rev., 42(3):76 81, Dec. 214. [6] Y. Chen, S. Alspaugh, D. Borthakur, and R. Katz. Energy Effcency for Large-scale MapReduce Workloads wth Sgnfcant Interactve Analyss. In ACM EuroSys, pages 43 56, 212. [7] S. D, D. Kondo, and W. Crne. Characterzaton and Comparson of Cloud versus Grd Workloads. In IEEE CLUSTER, pages 23 238, 212. [8] S. D, D. Kondo, and C. Franck. Characterzng Cloud Applcatons on a Google Data Center. In ICPP, pages 468 473, 213. [9] B. Hndman, A. Konwnsk, M. Zahara, A. Ghods, A. D. Joseph, R. Katz, S. Shenker, and I. Stoca. Mesos: A Platform for Fne-graned Resource Sharng n the Data Center. In USENIX NSDI, pages 295 38, 211. [1] S. Kavulya, J. Tan, R. Gandh, and P. Narasmhan. An Analyss of Traces from a Producton MapReduce Cluster. In IEEE/ACM CCGrd, pages 94 13, 21. [11] M. Ln, L. Zhang, A. Werman, and J. Tan. Jont optmzaton of overlappng phases n MapReduce. Perform. Eval., 7(1):72 735, 213. [12] Z. Lu and S. Cho. Characterzng Machnes and Workloads on a Google Cluster. In SRMPDS, pages 397 43, 212. [13] S. Lu, S. Park, E. Seo, and Y. Zhou. Learnng from mstakes A Comprehensve Study on Real World Concurrency Bug Characterstcs. In ASPLOS, pages 329 339, 28. [14] K. Ousterhout, P. Wendell, M. Zahara, and I. Stoca. Sparrow: Dstrbuted, Low Latency Schedulng. In ACM SOSP, pages 69 84, 213. [15] C. Ress, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogenety and dynamcty of clouds at scale: Google trace analyss. In ACM SoCC, pages 7:1 7:13, 212.

[16] C. Ress, J. Wlkes, and J. L. Hellersten. Google cluster-usage traces: format + schema. Techncal report, Google Inc. Revsed 212.3.2. http://code.google.com/p/googleclusterdata/ wk/traceverson2. [17] A. Rosà, L. Y. Chen, and W. Bnder. Predctng and Mtgatng Jobs Falures n Bg Data Clusters. In IEEE/ACM CCGrd, 215. [18] B. Schroeder and G. A. Gbson. Dsk Falures n the Real World: What Does an MTTF of 1,, Hours Mean to You? In USENIX FAST, pages 1 16, 27. [19] B. Schroeder and G. A. Gbson. A Large-Scale Study of Falures n Hgh-Performance Computng Systems. IEEE Trans. Dependable Sec. Comput., 7(4):337 351, 21. [2] M. Schwarzkopf, A. Konwnsk, M. Abd-El-Malek, and J. Wlkes. Omega: flexble, scalable schedulers for large compute clusters. In ACM EuroSys, pages 351 364, 213. [21] B. Sharma, V. Chudnovsky, J. L. Hellersten, R. Rfaat, and C. R. Das. Modelng and Syntheszng Task Placement Constrants n Google Compute Clusters. In ACM SoCC, pages 3:1 3:14, 211. [22] The Apache Software Foundaton. Hadoop MapReduce Next Generaton - Far Scheduler. http: //hadoop.apache.org/docs/current2/hadoop-yarn/ hadoop-yarn-ste/farscheduler.html, Feb 214. [23] A. Werman, T. Osogam, M. Harchol-Balter, and A. Scheller-Wolf. How Many Servers Are Best n a Dual-prorty M/PH/K System? Perform. Eval., 63(12):1253 1272, 26. [24] J. Wlkes. More Google cluster data. Google research blog. https://code.google.com/p/ googleclusterdata/wk/clusterdata211_1, Nov 211. [25] Yahoo! Capacty Scheduler Gude. http://archve.cloudera.com/cdh/3/hadoop/ capacty_scheduler.html, Mar 213. [26] D. Yuan, D. Park, P. Huang, Y. Lu, M. Lee, Y. Zhou, and S. Savage. Be Conservatve: Enhancng Falure Dagnoss wth Proactve Loggng. In USENIX OSDI, pages 293 36, 212. [27] M. Zahara, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoca. Delay Schedulng: A Smple Technque for Achevng Localty and Farness n Cluster Schedulng. In ACM EuroSys, pages 265 278, 21. [28] Q. Zhang, M. F. Zhan, S. Zhang, Q. Zhu, R. Boutaba, and J. L. Hellersten. Dynamc Energy-aware Capacty Provsonng for Cloud Computng Envronments. In USENIX ICAC, pages 145 154, 212.

On Energy-aware Allocaton and Executon for Batch and Interactve MapReduce Yjun Yng EPFL Staton 14 Lausanne, Swtzerland yjun.yng@epfl.ch Lyda Y. Chen IBM Research-Zurch Lab Säumerstrasse 4 Rüschlkon, Swtzerland yc@zurch.bm.com Robert Brke IBM Research-Zurch Lab Säumerstrasse 4 Rüschlkon, Swtzerland br@zurch.bm.com Gautam Natarajan Texas A&M Unversty College Staton TX, 77843, USA gautam@tamu.edu Cheng Wang Penn State Unversty Unversty Park PA, 1682, USA cxw967@cse.psu.edu ABSTRACT The energy-performance optmzaton of datacenters becomes ever challengng, due to heterogeneous workloads featurng dfferent performance constrants. In addton to conventonal web servce, MapReduce presents another mportant workload class, whose performance hghly depends on data avalablty/localty and shows dfferent degrees of delay senstvtes, such as batch vs. nteractve MapReduce. However, current energy optmzaton solutons are manly desgned for a subset of these workloads and ther key features. Here, we present an energy mnmzaton framework, n partcular, a concave mnmzaton problem, that specfcally consders tme varablty, data localty, and delay senstvty of web, batch-, and nteractve-mapreduce. We am to maxmze the usage of MapReduce servers by usng ther spare capacty to run non-mapreduce workloads, whle controllng the workload delays through the executon of MapReduce tasks, n partcular batch ones. We develop an optmal algorthm wth complexty O(T 2 ) n case of perfect workload nformaton, T beng the length of the tme horzon n number of control wndows, and derve the structure of optmal polcy for the case of uncertan workload nformaton. Usng extensve smulaton results, we show that the proposed methodology can effcently mnmze the datacenter energy cost whle fulfllng the delay constrants of workloads. 1. INTRODUCTION Datacenters are standard IT (Informaton Technology) platforms, whch consume a sgnfcant amount of energy to host a wde varety of conventonal and emergng workloads, such as web servces vs. MapReduce, featurng dfferent performance requrements and workload characterstcs. Typcally, web servces nteract wth clents, who requre strngent response tmes and thus real tme processng. To guarantee the throughput of large-scale data processng, MapReduce/Hadoop [1] s a smple yet powerful framework to process large amounts of data chunks organzed n dstrbuted fle systems, e.g., Hadoop Dstrbuted Fle Systems (HDFS). Moreover, wth the recent adopton of MapReduce alongsde real tme queres [6,9], MapReduce workloads evolve from tradtonal throughput senstve batch jobs to ncreasngly delay senstve nteractve Copyrght s held by author/owner(s). jobs, such as Sparks [2] on stream processng, Natjam [8] on batch, and BlnkDB [4] on nteractve queres. Energy-aware resource allocaton for such a dverse set of workloads s thus ever challengng, and unfortunately exstng solutons are often desgned for a subset of these three workload types. As web and servce workloads show strong tme-varyng behavors, dynamc szng of the datacenter [13] by controllng the number of actve servers s shown effectve to mnmze energy. To better utlze the equpped capacty and beneft from tme-varyng power supply, consoldatng and delayng the executon of data ntensve applcatons durng web workload low load perods can further harvest sgnfcant cost savngs for datacenters [14, 17]. However, the ssues related to data localty of MapReduce workloads are unfortunately often overlooked or over smplfed. To smultaneously address data avalablty and energy mnmzaton, dedcated MapReduce clusters leverage two types of control knobs ndependently,.e., the fracton of on-tmes of the entre cluster and the fracton of on-servers at each tme perod. On one hand, clusters delay the executon of batch MapReduce jobs, such as Google ndex computaton [3] or bank nterest calculaton, to process them on the entre cluster [11], dependng on the energy prce or other workload demands. As such, the maxmum degree of data localty s guaranteed mnmzng executon tme and energy consumpton. On the other hand, motvated by the practce of duplcatng data chunks (usually by a factor of three [2]), a few solutons modfy the underlyng fle system so that a fracton of servers, e.g., one thrd of the servers [12], are always avalable and contan one copy of all the data chunks. Nonetheless, unavodable delay can ncur n both solutons and ths s not acceptable for nteractve MapReduce. Beamer [6], specally desgned for nteractve MapReduce systems, shows promsng energy savngs by servng nteractve MapReduce on a subset of servers all the tme and batchng MapReduce on the entre cluster once a day. However, t s not clear how one can dynamcally execute(delay) the MapReduce workloads on allocated servers so to acheve the optmal tradeoff between data localty and energy effcency. In ths paper, we address the queston how to mnmze energy consumpton of executng web applcatons, batch MapReduce, and nteractve MapReduce, consderng ther dstnct workload characterstcs,.e., tme-varablty and data localty, and dfferent performance requrements,.e., throughput vs. delay. The system of nterest s composed of web servers and dedcated MapReduce

servers where a dstrbuted fle system s deployed. As our am s to desgn a non-ntrusve soluton,.e., not to modfy the underlyng fle systems, we propose to keep all MapReduce servers on at all tmes so that data avalablty s ensured. In order to mnmze the energy consumpton of the entre system,.e., total number of on servers, we try to execute all three types of workloads on MapReduce servers only as to sze the web cluster as small as possble. In partcular, we consder two control varables over dscrete wndows: delayng the executon of batch MapReduce and allocatng a fracton of MapReduce servers for batch and nteractve workloads. We employ dynamc programmng to derve the optmal decsons, and smulaton to evaluate the proposed solutons under varous workload scenaros. Formally, we formulate an energy mnmzaton problem over a dscrete horzon, constraned on dfferent degrees of delay n batch and nteractve MapReduce a concave mnmzaton problem. The specfc control varables are the number of servers for MapReduce workloads and the amount of batch MapReduce per control wndow, whch thus specfes the amount of batch jobs to be delayed. Assumng perfect knowledge on all three types of workloads, we develop an algorthm, whch can effcently acheve the optmal soluton wth a complexty of O(T 2 ), where T s the number of control wndows. Fnally, we buld an event drven smulator to evaluate the proposed algorthm under dfferent workload scenaros, n comparsons wth smple algorthms that overlook the data localty and delay senstvty of batch and nteractve MapReduce. Our specfc contrbutons can be summarzed n the followng. To mnmze the energy cost of executng delay and throughput senstve applcatons, we consder mportant tradeoffs among crucal parameters,.e., data localty, tme-varablty, and delay senstvty, of web applcatons, batch MapReduce and nteractve MapReduce. We are able to dynamcally and optmally determne the fracton of batch MapReduce workloads to be delayed by allocaton of number of MapReduce servers, va analytcal constructons, as well as, event drven smulatons under varous workload scenaros. The outlne of ths work s as follows. Secton 2 provdes an overvew of the system and a formal defnton of the problem statement. The algorthm for perfect workload nformaton s detaled n Secton 3. Secton 4 presents the expermental set up and smulatons results. Secton 5 compares related work, followed by summares and conclusons n Secton 6. 2. SYSTEM AND PROBLEM STATEMENT In ths secton, we frst descrbe the system and assumptons consdered by ths study and formally present the problem statement. 2.1 System The system hosts three types of workloads: web applcatons, batch MapReduce, and nteractve MapReduce, characterzed by dfferent degrees of data localty, tme varablty, and delay senstvty. In terms of data localty, both types of MapReduce workloads requre data access to the dstrbuted fle system hosted across the MapReduce servers. Web applcatons and nteractve MapReduce are very senstve to delay and have prorty to be executed mmedately wth suffcent server capacty. On the contrary, batch MapReduce s latency nsenstve and ts executon can be delayed. As for tme varablty, web servce workload types are known to have regular tme varyng patterns. In ths paper we consder a tme slotted system model wth each control wndow of length τ. The decsons are made at the begnnng of each control wndow. We use λ,t to denote the task arrval rates of type {w,b,c} workloads at tme perod t, where w,b,c Batch MR Controller Web workload dle/off Interactve MR Web Task scheduler Web cluster Batch job delayed 3 mn MR cluster Tasks dspatched MR workload Fgure 1: System schematc. On the hgh level, the controller takes as nput the workload characterstcs and outputs dynamc server allocaton decsons for the three types of workloads; the scheduler then dspatches tasks accordng to dfferent schedulng polces onto the lower level. represent web, batch MapReduce, and nteractve MapReduce. We assume task arrval rates of both MapReduce are drawn from geometrc dstrbutons. Moreover, we assume the average executon rate per task s µ, {w,b,c}. As for the performance constrants, web workloads need to satsfy certan response tme targets, where batch and nteractve MapReduce tasks just need to be completed wthn certan perods of tme, such as a day vs. control wndow. As web workloads have no dependency on the fle system, we assume web workloads can be executed on both web and MapReduce servers. We further assume that the nterferences between CPUntensve web applcatons and IO-ntensve HDFS are neglgble. Therefore, the number of actve web servers can be dynamcally szed dependng on the workload demands. On the contrary, to ensure 1% data avalablty and acheve energy savng, we propose that all MapReduce servers are kept on and allocated to the three types of workloads, when deemed approprate. Moreover, due to the concern of nterferences between web and MapReduce workloads [21, 22], we do not co-execute on the same server. Essentally, there are m t and m t number of servers dedcated for MapReduce and web workloads, respectvely. Among m t servers, nteractve MapReduce tasks have hgher prorty over batch MapReduce tasks. The decsons of m t and m t are based on the workload demands and performance constrants. Consequently, the total number of actve servers s max{m t + m t,m}. One can thus wrte the energy cost for the entre cluster for control wndows {1...t...T } as T K max{m t + m t,m}, (1) t=1 where M s the total number of servers n MapReduce cluster, T the tme horzon and K the unt energy cost per on/actve server. Note that we assume the energy cost of off/nactve servers to be zero. Scheduler and Controller. When dfferent tasks arrve at the system, they are mmedately sent to the scheduler, whch can employ dfferent schedulng polces usng dfferent queue structures. Tasks are then scheduled on servers, accordng to ther types. Another mportant system component s the controller, whch mplements the server allocaton algorthm across the three types of workloads n dscrete control wndows. The hgh level schematc s summarzed n Fgure 1.

2.2 Assumpton on Data Localty Data localty defnes that a task can fnd a copy of data on the local executon machne nstead of gettng the data from a remote machne. Denote the average executon tme of a task usng local copy by 1/µ. The executon tme of usng remote copy s much hgher, here, assumed by a slowdown factor of α,.e., µ 1 α. Note that we assume batch and nteractve MapReduce tasks have the same average executon tme,.e., µ b = µ c = µ, snce n MapReduce, large jobs wll be dvded nto small tasks and each task wll deal wth a constant amount of data, e.g., 64 MB by default n Hadoop. To estmate the throughput of MapReduce, t s very mportant to know the probablty of tasks beng executed wth local data, denoted as P l. Followng the common practce of data replcaton n MapReduce clusters, we assume a data chunk to have γ replca, 1 γ M. Gven an allocaton of m MapReduce servers, the probablty of fndng at least one local copy wthn the m servers can be computed as one mnus the probablty that no local copy s found,.e.: ( M γ ) γ 1 m P l (m) = 1 ( M = 1 m) = M m. M Pl(m) s always equal to 1 when γ > Mm, because one can fnd at least one replca among any m number of servers. As a functon of m, P l (m) can change n every control wndow, as the number of allocated MapReduce servers, m t, changes n tme wth workload demands. We further note here that P l (m) s an upper bound, assumng the underlyng scheduler s always able to schedule the task on the server havng a local copy. Assume each control wndow to have length τ, one can estmate the MapReduce throughput (n unts of tasks per control wndow) of a server beng τ X(m) = P l (m) µ + 1 P. (2) l(m) µ/α Note that as P l (m) s an upper bound, the throughput presented here s also an optmal case, assumng optmal schedulng. Puttng t all together, the MapReduce optmal throughput of the entre system s m X(m). We provde a numercal example to llustrate how MapReduce throughput changes wth the number of allocated MapReduce servers under dfferent replcaton factors γ n Fgure 2, where the cluster has 1 nodes. From the fgure one can see that the throughput of one server,.e., X(m), s ncreasng n m, snce the data localty mproves as m ncreases. However, the real throughput s smaller than the optmal case and t depends on the scheduler. In Secton 4, we use smulatons to show that the optmal throughput s achevable. Thus, n analytcal models n ths paper, we denote the real throughput of m servers as f (m) = m X(m). 2.3 Problem Statement Our objectve s to mnmze the energy cost of the entre system, whch conssts of all MapReduce servers and fracton of web servers, shown n Equaton (1). Two varables are m t and m t, whch need to fulfll MapReduce deadlnes and web target response tmes. To capture the web target response tmes, we resort to smple M/M/m t queung model [5],.e., fnd a mnmum m t, such that response tme R w,t under the arrval rate of λ w,t and the servce rate of µ w s below the target, R w,t = C(m t,λ w,t /µ w ) m t µ w λ w,t + 1 µ w R, (3) where C s the Erlang C Formula. Snce we can derve the threshold for m t from analytcal model, here we consder m t as a functon of Normalzed throughput 1.8.6.4.2 γ=1 γ=3 γ=5 α=1 2 4 6 8 1 Allocated servers [#] Fgure 2: Normalzed optmal throughput vs. number of allocated MapReduce servers: wth dfferent replcaton factors γ combned wth remote slowdown α = 4 and no remote slowdown α = 1 λ w,t. As for nteractve MapReduce, we requre that MapReduce servers allocated n each control wndow are enough to serve ncomng nteractve MapReduce workloads,.e., f (m t ) τ λ c,t. And, for batch MapReduce tasks, they can be delayed by multple control wndows. Denote r t as the aggregate resdual batch tasks at the begnnng of control wndow t, we have r 1 =, and r t+1 = (r t + τ λ b,t ( f (m t ) τ λ c,t ) + ) +, t = 2,...,T, (4) whch means that nteractve tasks (τ λ c,t ) have hgher prorty to be served, and batch tasks (r t +τ λ b,t ) take the remanng capacty ( f (m t )). In our problem, we consder 3 mnutes as one control wndow and one day as the horzon of our problem. And we set the end of the day as the deadlnes for all batch MapReduce tasks. The formal presentaton of our problem s: T mn m t K max{m,m t + m t}, t=1 s.t. f (m t ) τ λ c,t, t, r T +1 =, m t M, t, constrants (3),(4), In each control wndow, the nteractve MapReduce workloads determne the mnmum number of m t, whle the flexblty of m t comes from the delayable MapReduce tasks. 3. PERFECT WORKLOAD INFORMATION We frst solve the problem, assumng we have perfect knowledge of the future, namely, we know all the parameters (λ b,t,λ c,t,λ w,t ) at the begnnng of tme. Snce the data localty mproves as m ncreases, we assume f (m) to be a convex and strctly ncreasng functon. 1 Further, we assume the MapReduce cluster has enough servers to fnsh all the batch tasks wthn one day. Otherwse we should smply provson more servers to serve the workload, whch s not the man focus of our work. Ths optmzaton problem can be transformed to a concave mnmzaton problem [1, 16], whch s to mnmze a concave objectve functon. In concave mnmzaton problems, the local mnmum les on the boundary of the feasble regon. Snce there can 1 Notce that wth γ 3, the optmal throughput functon s nonconvex. However, t can stll be approxmated by a convex functon qute well, snce ts second order dervatve keeps ncreasng n most part of the feasble regons when m s not close to M. (5)

exst exponental number of local mnma wth the ncrease of the dmensons, tradtonal determnstc or randomzed non-lnear programmng solvers cannot solve ths knd of problems, even f we just want a sub-optmal numercal soluton. Concave mnmzaton problems belong to the hard global optmzaton problems. It has been proved that most concave mnmzaton problems are NP-hard. However, our problem has some specal structures, such that we can solve t optmally usng a greedy algorthm wth complexty O(T 2 ). 3.1 Problem Transformaton Snce each server n the MapReduce cluster cannot be swtched off at any tme, the number of actve servers cannot be smaller than M. Thus, the followng lemma holds. LEMMA 1. In problem (5), f (m 1,...,m T ) s an optmal soluton, then soluton (max{m 1,M m 1 },...,max{m T,M m T }) s feasble, and also optmal. 2 The problem can be transformed to the followng problem. T mn m t K (m t + m t), t=1 s.t. f (m t ) τ λ c,t, t, r T +1 =, (M m t) + m t M, t, constrants (3),(4). The followng lemma follows drectly from Lemma 1. LEMMA 2. Any optmal soluton for Problem (6) s also an optmal soluton for Problem (5). We defne the nverse functon of f (m) as g, whch s a concave and strctly ncreasng functon. By replacng m t, f (m t ), (M m t) +, M wth g(a t ), a t, g(l t ), g(a), respectvely, we rephrase Problem (6) as the followng concave mnmzaton problem: T mn a t K (g(a t ) + m t), t=1 s.t. a t τ λ c,t, t, r T +1 =, l t a t A, t, constrants (3),(4). 3.2 Algorthm We propose Algorthm 1 to solve the optmzaton problem, whch s a two-stage greedy algorthm. In the frst stage, t greedly allocates the batch workload b u nto some later tme t to acheve the throughput lower bound l t. In the second stage, t goes backwards over tme whle greedly allocatng the remanng batch workloads. The followng theorem shows the complexty of ths algorthm. THEOREM 3. If the feasble set s non-empty, the algorthm always fnshes n O(T 2 ) tme. 3.3 Optmalty of the Algorthm We denote L (b,c,l,a) as Problem (7) wth parameters A, l t, λ b,t = b t /τ, and λ c,t = c t /τ. 4 2 Detals of the proofs can be found n Appendces. 3 For argmn/argmax, we always break the te arbtrarly. 4 The m t terms n the objectve functon does not affect the optmal soluton. (6) (7) Algorthm 1 The Optmal Algorthm 1: for t from 1 to T do 2: b t τ λ b,t 3: c t τ λ c,t 4: end for 5: for v from 1 to T do Stage One 6: f c v < l v then 7: for u from v to 1 do 8: mn{b u,l v c v } 9: f > then 1: b u b u, c v c v + 11: end f 12: end for 13: end f 14: c v l v 15: end for End of Stage One 16: for u from T to 1 do Stage Two 17: whle b u > do 18: v argmax t {c t t u,c t < A} 3 19: mn{b u,a c v } 2: b u b u, c v c v + 21: end whle 22: end for End of Stage Two 23: for t from 1 to T do 24: m t c t 25: end for In each step of ether stage, the algorthm reduces the current problem nto another problem by decreasng b u and ncreasng c v. At the end of stage two, we get a problem such that b t = and c t A for each t, whose optmal soluton s trvally m t = c t. By Theorem 3 and Theorem 4, t holds that the algorthm always returns the optmal soluton. THEOREM 4. For any L (b,c,l,a) wth non-empty feasble set, the algorthm keeps the optmal soluton feasble n both stage one and stage two. For Theorem 4, t follows drectly by Lemma 5, Lemma 6 and Lemma 7, whch show that n Algorthm 1, Lne 1, Lne 14 and Lne 24 keep the optmal soluton feasble, respectvely. LEMMA 5. For any problem L (b,c,l,a), f v = mn{t c t < l t } and u = max{t t v,b t > } exst, settng c and b by c v = c v + and b u = b u for = mn{b u,l v c v }, whle keepng all other elements n b and c unchanged, t holds that any optmal soluton for L (b,c,l,a) s also an optmal soluton for L (b,c,l,a). LEMMA 6. For any problem L (b,c,l,a), f v = mn{t c t < l t } exsts and b t = holds for any t, settng c by c t = l t whle keepng all other elements the same, the feasble sets of L (b,c,l,a) and L (b,c,l,a) are equvalent. LEMMA 7. For any problem L (b,c,l,a) wth a non-empty feasble set, f c t l t holds for any t and u = max{u b u > } and v = argmax t {c t t u,a t < A} exst, settng c and b by c v = c v + and b u = b u for = mn{b u,a c v }, whle keepng all other elements the same, t holds that any optmal soluton for L (b,c,l,a) s also an optmal soluton for L (b,c,l,a). Here we present the ntutons behnd these lemmas. Frst, Lemma 5 tells us f at some tme nteractve MapReduce and web applcaton workloads are not enough to make use of the whole MapReduce cluster, we should try to look for the batch tasks arrvng no later

7 6 Idle γ=1 Act γ=1 Idle γ=3 Act γ=3 Idle γ=5 Act γ=5 7 6 Idle γ=1 Act γ=1 Idle γ=3 Act γ=3 Idle γ=5 Act γ=5 Energy [MWh] 5 4 3 2 Energy [MWh] 5 4 3 2 1 1 PK D1 D2 Base (a) PK D1 D2 Base (b) Fgure 3: Energy consumpton for the MapReduce cluster: (a) geometrc arrvals and (b) day-nght arrvals. than that tme to fll the resdual cluster capacty, snce MapReduce servers can not be turned off. Then, Lemma 7 tells us that for the remanng batch workloads we should greedly assgn them at the control wndow whch has already been assgned the largest MapReduce throughput, untl achevng MapReduce cluster capacty. Snce the cost functon g s concave n the amount of admtted MapReduce workload a t, the more we admt, the more energyeffcent we are. 4. EVALUATION In ths secton, we use smulaton to frst verfy our assumpton on data localty. Second we compare our optmal perfectknowledge (PK) controller wth two dummy allocaton schemes referred to as D1 and D2. In each control perod, D1 allocates enough servers to complete the expected MapReduce workload, whle D2 uses a constant allocaton based on the average workload over the whole tme horzon. 4.1 Expermental Setup We consder a system comprsng both a MapReduce cluster and a web farm of sze 1 and 18, respectvely, over a 1-day tme horzon. At the begnnng of each 3 mnutes, the control wndow sze τ, the controller decdes how many servers m t to allocate to both the batch and nteractve MapReduce jobs, whle the remanng MapReduce servers are used as web servers. The web workload can not be controlled by the system and we assume to know the requred web servers m t. Both batch and nteractve MapReduce tasks are handled by a prorty scheduler whch dspatches them to the currently avalable servers. To satsfy the dfferent delay requrements of batch and nteractve MapReduce, the scheduler gves strct prorty to nteractve MapReduce. The task servce rate s homogeneous across all servers, but each server checks f the task beng served s local or remote. For the latter the servce rate s decreased by a slowdown factor α = 4. Moreover, we set the unt energy cost per on/actve server K = 25 Watt [18]. We consder two workload scenaros. The frst scenaro uses geometrcal dstrbuted arrval rates for batch MapReduce λ b,t and nteractve MapReduce λ c,t. The second scenaro uses a day-nght pattern where λ b,t, λ c,t follow a snusodal pattern across the day. Ths pattern represents better the workloads montored n real datacenters. In all cases nter-arrval rates and task executon tmes are exponentally dstrbuted wth mean 1 and µ = 1s, respectvely. λ Each task s randomly assgned to a data chunk whch s unformly dstrbuted across the MapReduce servers wth replcaton factor γ = [1,3,5]. Once a task s scheduled we use the probablty p l (m) to decde f t s local or remote such as to acheve the optmal throughput functon f (m). We repeat each experment 5 tmes and present the mean values over all repettons. The resultng confdence ntervals are qute narrow between +.25% and -.25% of the mean values n all cases except for batch MapReduce response tmes whch have a maxmum confdence nterval between +7.8% and -7.8% of the mean. 4.2 Energy savng The objectve of the controller s to mnmze the system energy consumpton through the number of allocated MapReduce servers m t. We present the mpact of the allocaton decsons taken by each controller on the energy consumpton n Fgure 3 (a) and (b). As baselne we also present the always on energy consumpton of the whole MapReduce cluster. Note that the consumed energy refers only to the MapReduce cluster snce the number of web servers s not controlled and adds 23.9 MWh to all scenaros. The fgures dstngush between power used for busy servers and power used for dle servers. Idle servers are possble because durng low load perods the total workload of MapReduce and web does not suffce to saturate the capacty of the MapReduce cluster. One can observe that n both scenaros our PK scheduler outperforms the baselne by up to 59.3% and the dummy controllers by up to 3.7%. Moreover, one can observe hgher energy savngs when ncreasng the replcaton factor. A hgher replcaton factor allows to rase the local probablty and the achevable throughput and hence dmnshes the number of servers requred to satsfy the same demand. 4.3 Un-/Fnshed tasks/localty Whle energy mnmzaton s the optmzaton objectve, we stll want to satsfy all the requests comng to the system. Consderng agan all three controllers and the same scenaros as n our prevous secton, we depct n Fgure 4 (a) and (b) the amount of fnshed tasks over the whole tme horzon splt per data localty and type and compare t to the nput load. One can observe that the D1 controller s able to almost always complete all the tasks, whle PK and, defntely, D2 lag behnd. Here the PK controller suffers from ts open-loop operaton whch prevents t to react to errors n each control perod and these errors accumulate over tme. Even f the D1 controller succeeds to fnsh all jobs, D1 lacks the energy savng possbltes gven by the PK controller. Hence, PK acheves the trade-off between energy and fnshed jobs. In a practcal scenaro, the problem of unfnshed tasks can be easly treated as addtonal workload n the frst control perod of the new tme horzon. 4.4 Response tmes We conclude our evaluaton presentng the effect of each control allocaton on the mean response tme. These are shown n Fgure 5 (a) and (b) splt by nteractve and batch MapReduce. One can

Fnshed Tasks [M#] 3 2.5 2 1.5 1.5 B-Remote I-Remote B-Local I-Local Fnshed Tasks [M#] 3 2.5 2 1.5 1.5 B-Remote I-Remote B-Local I-Local PK γ=1 D2 PK γ=3 D2 PK γ=5 D2 Input PK D1 γ=1 D2 PK γ=3 D2 PK γ=5 D2 Input (a) (b) Fgure 4: Fnshed tasks over 1-day vs. nput load: (a) geometrc arrvals and (b) day-nght arrvals. 1 B-MR γ=1 I-MR γ=1 B-MR γ=3 I-MR γ=3 B-MR γ=5 I-MR γ=5 1 B-MR γ=1 I-MR γ=1 B-MR γ=3 I-MR γ=3 B-MR γ=5 I-MR γ=5 Response tme [mn] 1 1 1 Response tme [mn] 1 1 1.1 PK D1 D2 (a).1 PK D1 D2 (b) Fgure 5: Response tme for (I)nteractve and (B)atch MapReduce (MR) tasks: (a) geometrc arrvals and (b) day-nght arrvals. observe how the PK controller delays batch MapReduce by lookng at ther sgnfcant delay ncrement over the D1 and D2 cases. More n general, the best results are obtaned by D1 followed by D2. Overall the energy mnmzng allocaton of the PK controller pays the prce that, by reducng the number of MapReduce servers, t ncreases the MapReduce cluster load whch negatvely affects the response tmes. Even so, PK stll manages to keep the nteractve MapReduce delay ncrease lmted. 5. RELATED WORK There are plethora of studes amng to mprove energy effcency for dfferent types of systems, rangng from conventonal web servce systems [7, 13] to modern bg data systems [6, 11, 12, 15, 19], such as MapReduce. To mnmze energy consumptons, server resources are dynamcally provsoned and workloads are scheduled correspondngly. Thereby, we summarze the related work n the area of dynamc szng wth specal focus on MapReduce systems. Dynamc szng s proven effectve for workloads that show strong tme varyng patterns, e.g., swtchng on/off servers for web servces [7, 13]. Due to the ssues of data accessblty of underlyng fle systems [2] and performance dependency on data localty [9], dynamc szng s appled on n a partal manner. Current work ether control the fracton of tme of the entre cluster [11] or the fracton of servers all the tme [12]. On the one hand, to harvest the maxmum data localty, data ntensve workloads are batched and executed together on the MapReduce clusters [6, 11] for certan duraton, often at mdnght. On the other hand, another set of studes [6, 12] leverage the data replcaton factor (e.g., 3 replcas per data [3]) and propose the concept of coverng set, whch keeps only a fracton of servers on all the tme. Certan modfcatons on fle systems are usually needed. To address the emergng class of nteractve MapReduce workloads, BEEMR [6] combnes the merts of both types of approaches by parttonng the MapReduce clusters nto two zones, namely nteractve and batch. However, how to optmally (and dynamcally) allocate and execute nteractve and batch MapReduce s yet to be dscussed. Motvated by the complmentary performance requrements of servce and data-ntensve workloads, another hosts of studes [14, 17, 22] try to schedule MapReduce workloads accordng to the dynamcs of web workloads, consderng only the batch MapReduce and (often) overlookng the localty ssues. All n all, the related work falls short n addressng how to dynamcally sze the server resources for executng web, nteractve and batch MapReduce n an energy optmal fashon. Overall, t s not clear how to desgn a scheduler that can consder multple types of workloads wth dfferent performance and system requrements,.e., delay tolerances and data localty, n conjuncton wth dynamc szng. 6. CONCLUSION AND SUMMARY Ths study consders both control knobs of dynamc szng and schedulng polces smultaneously, for three types of workloads: web servce, batch MapReduce and nteractve MapReduce. We are able to acheve mnmum energy consumpton by allocatng optmal servers across the three types of workloads, factorng ther data localty, delay constrants, and workload uncertantes, whle dynamcally schedulng/delayng batch MapReduce. We developed an optmal algorthm under perfect knowledge wth complexty O(T 2 ). By means of extensve smulaton, we show energy savngs of up to 3% and 59% over dumb and no allocaton strateges, respectvely, wthout affectng nteractve MapReduce delay sgnfcantly. 7. ACKNOWLEDGEMENTS

Ths work has been partly funded by the EU Commsson under the FP7 GENC project (Grant Agreement No 68826). 8. REFERENCES [1] Hadoop. http://hadoop.apache.org. Accessed: 21-9-3. [2] Hadoop HDFS. http://hadoop.apache.org/docs/r1. 2.1/hdfs\_desgn.html. Accessed: 21-9-3. [3] How Google Works. http://www.baselnemag.com/c/ a/infrastructure/how-google-works-1/. Accessed: 21-9-3. [4] S. Agarwal, B. Mozafar, A. Panda, H. Mlner, S. Madden, and I. Stoca. BlnkDB: Queres wth Bounded Errors and Bounded Response Tmes on Very Large Data. In ACM EuroSys, pages 29 42, 213. [5] G. Bolch, S. Grener, H. de Meer, and K. S. Trved. Queueng Networks and Markov Chans: Modelng and Performance Evaluaton wth Computer Scence Applcatons. Wley-Interscence, New York, NY, USA, 1998. [6] Y. Chen, S. Alspaugh, D. Borthakur, and R. H. Katz. Energy effcency for large-scale MapReduce workloads wth sgnfcant nteractve analyss. In EuroSys, pages 43 56, 212. [7] Y. Chen, A. Das, W. Qn, A. Svasubramanam, Q. Wang, and N. Gautam. Managng server energy and operatonal costs n hostng centers. In SIGMETRICS, pages 33 314, 25. [8] B. Cho, M. Rahman, T. Chajed, I. Gupta, C. Abad, N. Roberts, and P. Ln. Natjam: Desgn and Evaluaton of Evcton Polces for Supportng Prortes and Deadlnes n Mapreduce Clusters. In ACM SoCC, pages 6:1 6:17, 213. [9] T. Conde, N. Conway, P. Alvaro, J. M. Hellersten, K. Elmeleegy, and R. Sears. MapReduce Onlne. In NSDI, pages 313 328, 21. [1] R. Horst. On the global mnmzaton of concave functons. Operatons Research Spektrum, 6(4):195 25, 1984. [11] W. Lang and J. M. Patel. Energy Management for MapReduce Clusters. PVLDB, 3(1):129 139, 21. [12] J. Leverch and C. Kozyraks. On the energy (n)effcency of Hadoop clusters. Operatng Systems Revew, 44(1):61 65, 21. [13] M. Ln, A. Werman, L. L. H. Andrew, and E. Thereska. Dynamc rght-szng for power-proportonal data centers. In INFOCOM, pages 198 116, 211. [14] Z. Lu, Y. Chen, C. Bash, A. Werman, D. Gmach, Z. Wang, M. Marwah, and C. Hyser. Renewable and coolng aware workload management for sustanable data centers. In SIGMETRICS, pages 175 186, 212. [15] N. Maheshwar, R. Nandur, and V. Varma. Dynamc energy effcent data placement and cluster reconfguraton algorthm for mapreduce framework. Future Generaton Comp. Syst., 28(1):119 127, 212. [16] P. M. Pardalos and J. B. Rosen. Methods for global concave mnmzaton: A bblographc survey. Sam Revew, 28(3):367 379, 1986. [17] B. Sharma, T. Wood, and C. R. Das. HybrdMR: A Herarchcal MapReduce Scheduler for Hybrd Data Centers. In ICDCS, pages 12 111, 213. [18] D. Wang, S. Govndan, A. Svasubramanam, A. Kansal, J. Lu, and B. Khessb. Underprovsonng backup power nfrastructure for datacenters. In Proceedngs of the 19th nternatonal conference on Archtectural support for programmng languages and operatng systems, pages 177 192. ACM, 214. [19] T. Wrtz and R. Ge. Improvng mapreduce energy effcency for computaton ntensve workloads. In IGCC, pages 1 8, 211. [2] M. Zahara, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Frankln, S. Shenker, and I. Stoca. Reslent Dstrbuted Datasets: A Fault-tolerant Abstracton for In-memory Cluster Computng. In USENIX NSDI, pages 2 2, 212. [21] W. Zhang, S. Rajasekaran, and T. Wood. Bg data n the background: Maxmzng productvty whle mnmzng vrtual machne nterference. In ASDB, 213. [22] W. Zhang, S. Rajasekaran, T. Wood, and M. Zhu. MIMP: Deadlne and Interference Aware Schedulng of Hadoop Vrtual Machnes. In CCGRID, pages 394 43, 214. APPENDIX A. PROOF FOR LEMMA 1 PROOF. Snce (m 1,...,m T ) s an optmal soluton, t meets all the constrants and acheves the mnmal value of the objectve functon. Consder the soluton (max{m 1,M m 1 },...,max{m T,M m T }). At each control wndow, max{m t,m m t} wll be no smaller than mt, whch means t can also fnsh all the tasks at the end of the control horzon. Thus, t s also a feasble soluton. Moreover, snce ths soluton has exactly the same value of the objectve functon as the orgnal one, t s also an optmal soluton. B. PROOF FOR THEOREM 3 PROOF. The runtme of stage one s trvally O(T 2 ). It suffces to show that stage two fnshes n O(T 2 ) tme. Or more precsely, each tme the WHILE loop (Lne 17-2) fnshes n lnear tme. In each teraton of the WHILE loop, we have ether = b u or = A c v. The algorthm ether fnshes the WHILE loop or removes c v from the sorted lst. Usng the trck of sortng the {c 1,...,c T } before stage two, f n stage two the set {c t t u,c t < A} s never empty, then the WHILE loop termnates n lnear tme. If at some tme durng the stage two, the set {c t t u,c t < A} s empty, then the problem wth the b,c,l,a values at that tme has empty feasble set. Thus, t suffces to show that f the orgnal problem has a non-empty feasble set, each step n stage one or stage two keeps the feasble set non-empty. Ths holds by Lemma 8, 9 and 6. LEMMA 8. In Algorthm 1, Lne 1 keeps the feasble set of L (b,c,l,a) non-empty. Namely, f before ths operaton, L (b,c,l,a) has a non-empty feasble set, then after ths operaton ts feasble set s stll non-empty. PROOF. The feasble set of problem L (b,c,l,a) s non-empty f and only f for any k c k < A, T (b t + c t ) (T k + 1) A. (8) t=k It suffces to show Lne 1 keeps these propertes. Frst, snce Lne 1 ensures c v l v = f (M m t) A, the frst property c v < A s kept. Then, we show that Lne 1 also keeps the second property. One can verfy that n Lne 1, f u < v, then b u+1,...,b v =, so that b t +c t A holds for t {u+1,...,v}, whch shows the second property s kept; otherwse u = v, then b v + c v s unchanged.

LEMMA 9. In Algorthm 1, Lne 2 keeps the feasble set of L (b,c,l,a) non-empty. Namely, f before ths operaton, L (b,c,l,a) has a non-empty feasble set, then after ths operaton ts feasble set s stll non-empty. PROOF. The proof s smlar. It suffces to show Lne 2 keeps (8). Snce n Lne 2, b u+1,...,b T =, t holds that b t +c t = c t A for t {u + 1,...,T }. Thus, f we ncrease c v for some v > u, (8) stll holds. Otherwse u = v, then b v + c v s unchanged. C. PROOFS FOR LEMMAS 5, 6 AND 7 Before showng formal proofs, we need the followng lemmas. LEMMA 1. For the functon f (x,y) = (x y) +, for any y and, t holds that x x = f (x,y) f (x,y). LEMMA 11. For the functon f (x,y) = (x y) +, for any x and, t holds that y y = f (x,y ) f (x,y). We omt the proofs for Lemma 1 and 11, snce they hold trvally. LEMMA 12. For any problem L (b,c,l,a), f we set c and b the same as c and b except that c v = c v + and b u = b u, for some u,v such that u v and some non-negatve, and stll keepng c v A, b u, t holds that the feasble regon of L (b,c,l,a) s a subset of the feasble regon of L (b,c,l,a) PROOF. Consder an arbtrary feasble soluton of L (b,c,l,a), a = (a 1,...,a T ). Denote that r 1 =, r t+1 = (r t + b t + c t a t) + for problem L (b,c,l,a), and r 1 =, r t+1 = (r t + b t + c t a t) + for problem L (b,c,l,a). If u = v, the lemma holds trvally snce r t = r t for any t. So now we consder the case that u < v. Notce that for any t u, r t = r t. Snce b u = b u, by Lemma 1, we have r u+1 r u+1. Then by nducton we have r t r t for any t {u + 1,...,v}. Snce c v = c v +, we have r v+1 r v+1. By nducton we have r T +1 r T +1, whch means r T +1 = mples r T +1 =. Thus, f a s a feasble soluton of L (b,c,l,a), then t s also a feasble soluton of L (b,c,l,a). Now we present the formal proof for Lemma 5, 6 and 7. C.1 Proof for Lemma 5 Lemma 5 follows drectly by the followng lemma. LEMMA 13. For problem L (b,c,l,a), assume there exsts c v whch has the smallest ndex v such that c v < l v, and there exsts b u whch has the largest ndex u such that u v and b u >. Denote that = mn{b u,l v c v }. Set c and b the same as c and b, except that c v = c v + and b u = b u. It holds that any optmal soluton for L (b,c,l,a) s also an optmal soluton for the orgnal problem. PROOF. For notaton smplcty, n ths proof we use L to denote L (b,c,l,a), and use L to denote L (b,c,l,a). It follows by Lemma 12 that the feasble regon of L s a subset of the feasble regon of L. Moreover, these two problems have the same objectve functon. Thus, t suffces to show any optmal soluton of L, a = (a 1,...,a T ), s also a feasble soluton of L. We show ths n two cases: u = v and u < v. In the former case, the two problems are equvalent snce c v < c v l v and c v + b v = c v + b v. For the latter case that u < v, assume (r 1,...,r T +1 ) and (r 1,...,r T +1 ) are values for resdual batch tasks correspondng to the soluton a n L and L, respectvely. Wth the same argument n the proof of Lemma 12, we get that { r t = r t, t u, r t = (r t ) + (9), u < t v. Now we show the followng clam: CLAIM 1. If a s an optmal soluton for L, t holds that r t for u < t v. We argue ths by contradcton. Assume the statement does not hold, then there must exst some k, such that u k < v and r k+1 = (r k +b k +c k a k )+ <. Wthout loss of generalty, we assume k s the smallest one of all the possble values and assume r k+1 = δ. Then, t holds that 1. a k c k + δ, 2. r v + b v + c v a v δ. The frst statement holds because r k + b k (f k = u, t s also true because b u ), and (r k +b k +c k a k )+ = δ. The second statement holds because r v δ, b v = and a v c v +, where r v δ holds because r k+1... r v, followng by the fact that b k+1 =... = b v 1 = and a t c t for any t, and a v c v + holds because a v l v and = mn{b u,l v c v }. Then we get a better soluton a for L, such that a t = { a t δ, t = k a t, otherwse. Ths soluton stll meets all the constrants of L, because 1) a k δ c k l k, and 2) rv+1 = r v+1,...,rt = r T, where rt are resdual batch tasks correspondng to a. To show the latter statement, notce that by Lemma 11, at = at δ mples rk+1 r k+1 +δ; by Lemma 1, ths further mples rv r v + δ. Thus, we have rv + b v + c v a v (r v + δ) + b v + c v a v, whch mples rv+1 = r v+1 =. And rv+2 = r v+2,...,rt = r T follows snce at = at for t = v + 1,...,T. We have constructed the contradcton. Thus, Clam 1 holds. Combnng Clam 1 wth (9), we get that r v = r v, whch mples r v+1 = r v+1 (recall c v = c v + ). Snce from tme slot v + 1 the two problems L and L are exactly the same, we have r T +1 = r T +1 =, whch means a s also a feasble soluton of L. C.2 Proof for Lemma 6 Lemma 6 follows drectly by the followng lemma. LEMMA 14. For problem L (b,c,l,a), f there exsts c v, v {1,...,T }, such that c v < l v, and for any t v we have b t =, then by settng c same as c except that c v = l v, we get a new problem L (b,c,l,a), whose feasble set s equvalent to the feasble set of L (b,c,l,a). PROOF. We can clearly see that for any feasble soluton a for L (b,c,l,a), snce a t max{c t,l t } and b t = for t {1,...,v}, we have that r v+1 = r v+1 =. By nducton, t holds that r t = r t = for any t > v. Thus a s also a feasble soluton for L (b,c,l,a). C.3 Proof for Lemma 7 We need the followng non-slackness lemma and concave mnmum lemma, whch are the two key nsghts of Lemma 7. LEMMA 15 (NON-SLACKNESS). For any problem L (b, c, l, A) such that c t l t for any t, the optmal soluton a = (a 1,...,a T ) wll make r t + b t + c t a t for any t.

PROOF. The lemma holds, snce otherwse we can decrease a t. LEMMA 16 (CONCAVE-MINIMUM). For a concave functon g(x), gven any x 1,x 2,x 3,x 4, such that x 1 x 3 x 4 and x 1 + x 4 = x 2 + x 3, t holds that g(x 1 ) + g(x 3 ) g(x 2 ) + g(x 3 ). PROOF. Ths lemma follows drectly by the defnton of the concave functon. Lemma 7 follows drectly by the followng lemma. LEMMA 17. For any problem L (b,c,l,a) such that c t l t for any t, assume b u s the largest ndex u such that b u >, and v s the ndex such that v = argmax t {c t t u,a t < A} (break the te arbtrarly). Denote that = mn{b u,a c v }. Set c and b same as c and b, except that c v = c v + and b u = b u. It holds that any optmal soluton for L (b,c,l,a) s also an optmal soluton for L (b,c,l,a). PROOF. For notaton smplcty, we use L to denote L (b,c,l,a), and use L to denote L (b,c,l,a). Frst we show that an optmal soluton of L exsts such that a v c v +. Assume we have an optmal soluton a for L, such that a v < c v +, then there must exst v u (v v) such that a v > c v, otherwse we cannot fnsh all batch tasks at the end. We set a v = a v + δ and a v = a v δ, where δ = mn{a v c v,a a v}, and keep at = at for other t s. Then, t holds that a s also a feasble soluton snce t meets the constrants and can also complete all tasks at the end 5. Observe that δ = a v c v mples c v = a v a v a v, snce by defnton c v c v ; whle δ = A a v mples a a v a v = A. Ether case mples g(a v )+g(a v ) g(a v )+g(a v) (see Lemma 16), whch shows a s at least as good as a. We can do ths knd of adjustment agan and agan, wthout decreasng the value of the objectve functon, untl ether a v = A (saturated), or a t = c t for any t {u,...,t }/{v}, whch ndcates that we can eventually get an optmal soluton such that a v c v +. Then we show that f an soluton of L satsfyng that a v c v +, then t s also a feasble soluton of L, because 1) the constrants c t a t A stll hold (we only need to check the case when t = v); 2) n order to show r T +1 =, we have r T +1 = T t=u (r t + b t + c t a t ) + (a) = ((r u + b T u) + c T t a t ) + t=u t=u T T (b) = ((r u + b u ) + + c t a t ) + t=u t=u = r T +1 =, where (a) holds snce b u+1 =... = b T =, and (b) holds snce b u = b u, c v = c v + and r u = r u (we can use nducton to show r t = r t for all t u). v 5 Recall that by defnton we have b t = for any t {u+1,...,n 1}, so as long as the constrants c t a t A are met and we also have r u +b u = (a u c t )+...+(a T c T ), t can fnsh all the tasks at the end so t s a feasble soluton.

Mult-resource far sharng for multclass workflows Jan Tan, L Zhang, Mn L, Yandong Wang IBM T. J. Watson Research Center, NY 1598, USA tanj@us.bm.com, zhangl@us.bm.com, mnl@us.bm.com, yandong@us.bm.com ABSTRACT Mult-resource sharng for concurrent workflows necesstates a farness crtera to allocate multple resources to workflows wth heterogeneous demands. Recently, ths problem has attracted ncreasng attenton and has been nvestgated by assumng that each workflow has a sngle class of jobs and that each class contans jobs of the same demand profle. The demand profle of a class represents the requred mult-resources of a job. However, for typcal applcatons n cloud computng and dstrbuted data processng systems, a workflow usually needs to process multple classes of jobs. Relyng on the concept of slowdown, we characterze farness for mult-resource sharng and address schedulng for multclass workflows. We optmze the mxture of dfferent classes of jobs for a workflow as optmal operaton ponts to acheve the least slowdown, and dscuss desrable propertes for these operaton ponts. These studes assume that the jobs are nfntely dvsble. When jobs are non-preemptve and ndvsble, any farness crtera that only reles on the nstantaneous resource allocaton cannot be strctly mantaned at every tme pont. To ths end, we relax the nstantaneous farness to an average metrc wthn a tme nterval. Ths relaxaton ntroduces a tme average to farness and allows occasonal, but not too often, volatons of nstantaneous farness. In addton, t brngs flexblty and opportuntes for further optmzaton on resource utlzaton, e.g., usng bn-packng, wthn the constrant on farness. 1. INTRODUCTION Mult-resource far allocaton s a fundamental problem n desgnng computng systems shared by multple workflows. A workflow conssts of multple classes of jobs, and each class contans multple jobs. Ths general defnton descrbes many applcatons. For example, a MapReduce [1] job can be vewed as a workflow. It contans two classes of jobs: map tasks and reduce tasks. Dryad [12] and Spark [2] can support more than two classes of jobs. These appl- Copyrght s held by author/owner(s). catons n cloud computng and dstrbuted data processng systems rely on effcent allocaton of multple resources, e.g., CPU tme, memory space, I/O bandwdth, network transmsson, software lcenses, to name just a few. Ths practcal requrements foster the research on far sharng of multple resource types [8, 13, 17, 14, 5, 4]. In a degenerate case when all workflows consume a sngle type of resource, farness has been well studed [21, 11], prmarly due to the complete preference order nduced by the fracton of the resource used for each runnng workflow. Recently, the problem on far sharng of multple resources among multple workflows has attracted much attenton [8, 13, 17, 14, 4]. These studes assume that a workflow contans multple dentcal jobs and that each job demands a certan amount of resource for each of the multple types of resources. Therefore, a job class can be characterzed by Fgure 1: Mult-resource requrement of 2 workflows a demand profle, as a vector descrbed by the requred unts of each type of resource for a unt of job. As llustrated n Fg. 1, a job of workflow 1 uses 8 CPUs and 6 memores, represented by a profle (8, 6). Smlarly, the job profle of workflow 2 s (4, 13). Because of heterogeneous demands, t s qute common that some jobs requre more CPU than memory and others requre more I/O bandwdth than CPU. If each job class only consumes a sngle type of resource, then the workflow s resource usage can be represented by a scalar, whch mples a complete preference order. However, wth multple resources, the resource requrements are vectors accordng to the demand profles, allowng only a partal order. Therefore, a farness crtera s needed to schedule multple workflows on a shared computng cluster. Specfcally, n a sngle resource settng, most schedulng polces satsfy the workconservng property, n the sense that all resources are fully utlzed when there are unfnshed workloads. For the multresource settng, due to job profles that have proportonal

resource requrements for dfferent types, t s possble that some resources have leftover that cannot be utlzed at all. For example, only two jobs of profle (8, 6) can be served on a cluster wth capacty (16, 3). It wll have 3 2 6 = 18 unts of leftover for the second type of resource. Therefore, farness can mpact system effcency. To ths purpose, a farness crtera called domnant resource farness [8] has been proposed, whch s to allocate resources on the domnant shares accordng to max-mn farness. It has been shown [8, 17, 13, 14] that ths farness has a number of propertes, such as sharng ncentve, envy free, Pareto optmal and strategy proof. However, for many real applcatons, e.g., MapReduce [1], Dryad [12], Spark [2], a workflow usually contans multple classes of jobs, where each class has a dfferent demand profle for resources. For example, a MapReduce job conssts of both map tasks and reduce tasks. It s possble that a map task s CPU ntensve and a reduce task s I/O ntensve. In addton, n some applcatons dfferent classes of jobs of the same workflow can be executed n parallel wth arbtrary combnatons, and n other cases there are restrctons that are caused by job dependency constrants so that only specfc combnatons of dfferent job classes are possble. For all of these applcatons, t s mportant to study the farness and the optmal mxng of dfferent classes of jobs snce a workflow cannot be smply characterzed by a sngle demand profle as assumed n [8, 13, 14]. The exstng frameworks, e.g., domnant resource farness [8, 13], and proportonal farness [4], are thus not drectly applcable for ths settng. In ths study, we dstngush two dfferent cases: 1) jobs are nfntely dvsble and, 2) jobs are ndvsble and nonpreemptve. For the frst case, f the allocated resources are not enough for an entre job, a fracton of the job can be launched. Ths s the assumpton made n [8, 13, 14, 4]. For the second case, every job has to be launched n ts entrety, and the job keeps runnng untl fnsh. Durng the executon of the job, the occuped resources wll not be used by any other workflows. For the frst case, we consder nstantaneous farness that needs to be satsfed at each tme pont. Ths crtera only depends on the resource allocaton at that specfc tme. Thus, we do not need to consder workflow arrvals and departures. However, for the second case, snce jobs are non-preemptve and ndvsble, nstantaneous farness cannot be guaranteed. We relax the nstantaneous farness to an average measurement on a tme nterval. Ths relaxaton ntroduces a tme average to farness and allows occasonal, but not too often, volatons of nstantaneous farness. In addton, t brngs flexblty and opportuntes for further optmzaton on resource utlzaton, e.g., usng bn-packng, wthn the constrant on farness. 2. ILLUSTRATIVE EXAMPLE To llustrate the problem, consder a system of 1 CPUs and 1GB memory wth two workflows A and B. A has two classes of jobs. Each class can have multple jobs. A job of class A 1 requres 1 CPU and 125GB memory, and a job of class A 2 requres 5 CPUs and 8GB memory. Thus, A has two job profles (1, 125) and (5, 8). B has three classes of jobs, wth B 1 requrng 2 CPU and 5GB memory for each job, B 2 requrng 4 CPUs and 25GB memory for each job, and B 3 requrng 6 CPUs and 8GB memory for each job. It has three job profles (2, 5), (4, 25) and (6, 8). We frst revew domnant resource farness [8] by assumng that workflows A and B only has a sngle class,.e., A 1 and B 1 wth job profles (1, 125) and (2, 5), respectvely. In addton, we assume that the jobs are nfntely dvsble. After calculatons, the domnant resource farness allocates the lmted resources by launchng 4/9 jobs for A and 25/9 jobs for B. We use a noton on slowdown of a workflow to evaluate the allocaton process from a slghtly dfferent perspectve. By ths noton, we nterpret domnant resource farness n a way that can be generalzed to the settng when jobs are non-preemptve and ndvsble. Suppose that the whole cluster s dedcated to serve A. Then mn{1/1, 1/125} = 8 jobs of type A 1 can be launched. By the same reason, mn{1/2, 1/5} = 5 jobs of type B 1 can be launched for workflow B. Because of sharng, only N A 8 and N B 5 numbers of jobs can be launched for A and B, respectvely. The slowdown of workflow A s equal to N A/8 and the slowdown of workflow B s equal to N B/5. A resource allocaton s sad to be far for A and B, f ther slowdowns are equal,.e., η = N A/8 = N B/5. The goal s to fully utlze the avalable resources, thus, max η subject to N A 1 + N B 2 1, for CPU whch yelds N A 125 + N B 5 1, η = N A/8 = N B/5, for farness for memory η = mn{1/ (8 1 + 5 2), 1/ (8 125 + 5 5)} = 5/9. Ths result mples that N A = 8 5/9 = 4/9 and N B = 5 5/9 = 25/9. In ths case, CPU s fully utlzed. The problem becomes more complcated when A and B have multple classes of jobs of dfferent profles. In order to farly allocate resources to mult-class workflows, we choose to generalze the noton of slowdown for ths settng. Then, equalzng the slowdowns of dfferent workflows yelds a far allocaton. If we specfy the slowdowns of all jobs to be equal, t s possble that there are stll leftover resources of some types whle other types are saturated. Ths problem has been dscussed n [17]. In ths case, f a class does not need these saturated types of resources, we can further ncrease the number of runnng jobs of ths class, whch wll eventually make the slowdowns to be unequal. To address ths ssue, we can ether make an assumpton that every job class requres all types of resources, or allow multple rounds of job allocaton. In each round, we only consder those types of classes that can be further ncreased, constraned by farness and the avalable leftover resources. Consder the stuaton when the whole cluster s dedcated to serve workflow A, wth job profles A 1 and A 2 as descrbed earler. The number of jobs that can be supported for servng workflow A form a feasble set S A = {(x, y) : x(1, 125) + y(5, 8) (1, 1), x, y }, where x and y denote the number of runnng jobs of type A 1 and A 2, respectvely. The boundary subset P A = {(x, y) : (x, y) S A and there s no w S A wth w > (x, y)} defnes a Pareto curve where the capacty constrant s satsfed. The executon speed of the workflow A ncreases by x c/x = y c/y tmes when the operaton pont (x, y) proportonally ncreases to a pont (x c, y c) P A. The slowdown of operaton pont (x, y) can be defned by x/x c, whch s equal to y/y c. The same

arguments can be appled for other workflows, e.g., B, whch has three types of job profles, B 1, B 2, B 3. Fgure 2: Slowdown wth 2 classes of jobs When jobs are ndvsble, the numbers of runnng jobs of every class have to take nteger values. In ths case, we can stll map an operaton pont (x, y), x, y N + to a pont (x c, y c), x c, y c N + wthn the feasble set, such that x/x c = y/y c and there exsts no other feasble pont (x c, y c) wth x c > x c, y c > y c. Thus, the pont (x c, y c) may not le on the Pareto curve. The slowdown can be defned by the rato between these two ponts, smlar to the dvsble case. 3. MODEL DESCRIPTION Assume that a job of class j, 1 j J belongng to workflow, 1 I requres R j(r) of resource type r. In the rest of the paper, we use A T to denote the transpose of a matrx A. Thus, the demand profle of a class j of workflow can be descrbed by a vector R j = (R j(1),, R j(r)) T. Let N j denote the number of jobs allocated to class j of workflow, whch s determned by the schedulng algorthm at the run tme. We call {N j} the operaton pont of workflow. Ths operaton pont moves dynamcally when the other workflows that share wth changes. Denote by C r the capacty of resource r. We have the followng constrants, for all r, R j(r)n j C r. j Let R = (R 1,, R J) be a I J matrx, and N = (N 1,, N J) T, C = (C 1,, C r) T be two vectors. We have I R N C. (1) =1 We consder two dfferent assumptons. The frst assumes that jobs are lke flud and nfntely dvsble. Any fracton of a job can be served. The second studes the stuaton when jobs are ndvsble and non-preemptve. Every job has to be served n a whole unt, and the occuped resources wll not be used by any other jobs durng the executon of the job. For the frst case, we can defne an nstantaneous farness at every tme pont. Therefore, we do not need the nformaton on when the workflows arrve and depart. Instead, we assume that at a generc tme pont, all workflows are processed smultaneously n the shared cluster. Snce N j can take any postve real values, n vew of (1), t s clear that the set N S = {N R J : R N C, N } forms a convex set. For each workflow, we can characterze a Pareto curve for the vector N by assumng that the whole cluster wth resources C s dedcated to servce. Specfcally, let N P be the maxmal subset of N S such that { } N P = z N S : there s no w N S such that w > z, whch represents the set of optmal operaton ponts when workflow s served exclusvely n the cluster. As multple workflows are processed smultaneously on the same cluster, the operaton pont of each workflow wll move below the optmal curve and le wthn the set N S. To be far for all workflows wth dvsble jobs, we can specfy that the slowdowns of all workflows to be equal. However, t s possble that there are stll leftovers for some types of resources whle the other types are saturated, as already addressed n [17]. If a class of jobs does not need these saturated types of resources, then they can utlze the stll avalable resources, whch can further decrease ther slowdowns. Therefore, eventually we wll observe uneven slowdowns. Ths problem can be avoded f every job requres all types of resources. Otherwse, we need to run multple rounds of job allocaton. In each round, we only consder those types of classes that can stll be ncreased usng the avalable resources. The second case s when jobs are non-preemptve and ndvsble. Unlke the prevous case that only needs to focus on a specfc tme pont, we assume that workflow arrves at tme s and departs the cluster at tme f. Every job, once started, wll take some executon tme, durng whch the occuped resources wll not be used by others. We requre that N j take only nteger values. The set (2) N S = {N Z J : R N C, N } (3) contans dscrete elements. We can stll defne the set of optmal operaton ponts the same as n (2). Snce N j only take nteger values, t s possble that no further jobs can be launched whle at the same tme every types of resources are not saturated. For example, consder the avalable resources (5, 1) wth a job profle (4, 8). Only one job can be launched and (1, 2) wll be the leftover resources. Though both types of resources have surpluses, they are not enough for servng a whole unt of job. 4. DIVISIBLE JOBS In ths secton, we assume that the jobs are nfntely dvsble N j R +. Because of ths assumpton, we only need to nvestgate farness at a generc tme, snce the followng argument apples for every tme pont. As explaned earler, we need to ntroduce a defnton of slowdown for multclass workflows. We say an allocaton s far for concurrent workflows f they experence the same slowdown. For an operaton pont p N S for workflow, we defne ts slowdown η by relatng p to a correspondng pont p C N P. Defnton 1. The slowdown of workflow at operaton pont p N S s defned as { η (p) = nf x : p/x N S, x R +}. We can specfy that all concurrent workflows have the same slowdown at any tme, under the assumpton that R j >

for all, j. Wthout ths assumpton, the slowdowns do not need to be equal, as explaned n Secton 3. There could be a constrant on the combnatons of dfferent classes of jobs that can run smultaneously for a workflow. For some applcatons, the percentages of dfferent job classes of a workflow reman fxed,.e., the ratos between N j beng fxed for dfferent j. In ths stuaton, multple jobs of dfferent classes are combned proportonally, whch essentally s equvalent to a sngle class workflow. For other scenaros, a workflow has the flexblty to change the percentages of runnng jobs of dfferent classes. Dependng on how dfferent jobs are mxed together, the workflow slowdown can also change accordngly. Therefore, a decson on the optmal combnaton of dfferent classes of jobs s crtcal. Our objectve s to maxmze the slowdowns of all concurrent workflows. In addton, we want to fully utlze all avalable resources and maxmze the resource usages. In ths secton, we consder a specal case wth only two types of resources r 1, r 2, e.g., CPU and memory. We normalze the demand profle such that the largest element s equal to 1. Because of ths normalzaton, the demand profles are on the set {(r 1, r 2) : r 1 1, r 2 = 1} {(r 1, r 2) : r 2 1, r 1 = 1}. Denote by P1, P2,, PJ the profles of workflow wth Pj = (p j1, p j2), and by x 1, x 2,, x J the percentages for mxng these job profles together, J j=1 x j = 1. Then, the equvalent profle of mxng these jobs s equal to J P = (p 1, p j=1 2) = x jpj J j=1 x j P, (4) j where (y 1, y 2) = max(y 1, y 2). For workflow, suppose that N j/ j Nj has a constrant that t s wthn an nterval [ξ l, ξ r]. Then, the feasble combned profles can be shown to le n a close set. Usng the mappng defned by (4), we can compute the set S of normalzed profles for workflow. Fgure 3: Equvalent profle for mxng 4 profles Frst, we maxmze the slowdown by the followng procedure. In the set S, we can always fnd out an edge subset S contanng at least one but at most two ponts n the form of ether (1, y ) or (x, 1), where x, y are mnmal. We fnd the soluton to the followng optmzaton ζ = mn max p j, (5) p j S j=1,2 and the maxmal slowdown η s equal to 1/ζ. Ths soluton also defnes a correspondng operaton pont, at whch at least one of the resources s saturated. If both resources are saturated, t gves the optmal operaton pont. Otherwse, there s a type of resource r ast that stll has leftover. The next step s to ncrease the resource usage of r ast. To ths end, we adjust the mxng of the classes of jobs. For all jobs wth profles that do not have a domnant share for r ast, we can ncrease the value of the profle for r ast untl ether r ast s saturated or ths value cannot be ncrease anymore. In ths step, t s possble to obtan multple optmal operaton ponts. Thus, the result s not always unque. Usng the aforementoned procedure, we can obtan the optmal operaton ponts. It can be shown that these ponts all satsfy sharng-ncentve n the sense that each job gets no less than a 1/n share of ts domnant resource for n concurrent workflows. In addton, these allocatons also satsfy envy-free property, whch means that any workflow cannot get a better slowdown by usng another workflow s allocaton. An nterestng result s that the envy-free property s not satsfed when the system works n a operaton pont that s far but not effcent,.e., all workflows havng the same, but not maxmal, slowdown. Furthermore, t can be shown that the system s strategy-proof that a workflow cannot ncrease ts slowdown by modfyng ts edge subset S. It s possble that a workflow has the same set S after changng ts job profles. 5. INDIVISIBLE JOBS The prevous study assumes that jobs are nfntely dvsble: f the allocated resources are not enough for an entre job, a fracton of the job can be launched. Ths s the assumpton made n [8, 13, 14, 4], under whch the farness can be guaranteed nstantaneously at each tme. However, for many applcatons, jobs are non-preemptve and each can only be launched as a whole unt. Therefore, nstantaneous farness s not always possble. For example, when a new workflow arrves and sees no avalable resources, t has to wat untl some runnng jobs fnsh before t can launch some of ts jobs. Durng the watng perod, the new workflow obtans no servce at all, whch s consdered to be unfar f the farness crtera only depends on the allocated resources at each tme. Therefore, the nstantaneous farness cannot be used and an average slowdown over a tme nterval may be a better opton. In order to deal wth ths dffculty, we ntroduce an average farness measured over a tme nterval, nstead of the strct defnton only dependng on the allocated resources of a specfc tme pont. The job profle of a workflow defnes the requred amount of resource of each type for every job of ths user. Once a job s launched, t takes some tme to execute. We assume that the job executon tmes of dfferent classes are all upper bounded by a fnte constant C. Snce the jobs are non-preemptve, the occuped resources cannot be used by others durng the executon. After the job fnshes, t returns the resources to the cluster, whch can be further allocated to other jobs. Smlar to the dvsble case, usng the defnton n (3), the slowdown of workflow at operaton pont p N S s defned as { η (p) = nf x : p/x N S, x R +}. In ths secton, we assume that each workflow has a sngle class, and thus we replace workflow by user to emphasze the dfference. Suppose that user arrves at a cluster at tme s, and spend some tme t n executon before t leaves the cluster at tme f.

Frst, we use an example to llustrate that the nstantaneous farness that only reles on the allocated resources at the current tme pont can lead to unfar schedulng decsons when the jobs are non-preemptve and ndvsble. Consder a cluster wth 1 unts of a sngle type of resource shared by user A and B. Both A and B has a lot of jobs to fnsh. Each job of user A requres 3 unts of resources, and each job of user B requres 1 unts of resources. In addton, we assume that a job of user A takes 1 unts of tme to fnsh, compared to a job of user B, whch only needs 1 unt of tme to fnsh. Suppose that at tme two jobs of user A and 4 jobs of user B run smultaneously n the cluster, whch consumes 2 3 + 4 1 = 1 resources. At tme 1, the two jobs of user A fnshes. Snce user B only receves 4% of the resources, due to farness, the scheduler wll ncrease the number of runnng jobs for user B. Therefore, user A can only run one job at tme 1. Durng the tme perod 1 to 11, user A runs one job and user B can run 7 jobs. When the jobs of user B fnsh at tme 11, the scheduler notce that user A only receves 3% of the resources, and t wll ncrease the number of runnng jobs for A to 2. Roughly speakng, the allocaton of 2 jobs for user A and 4 jobs of user B runnng smultaneously wll occur for most of the tme durng the course of executon. As the process contnues, user A wll get more computaton resources from the cluster than user B over a long run. The percentages are approxmately equal to 6% and 4% for user A and B, respectvely. Ths unfarness can be more pronounced when consderng the fact that a cluster conssts of many small computer nodes of dfferent szes. The resources allocated to jobs are from these separate dscrete computer nodes, and a runnng job needs to use the resources from a sngle node. Therefore, f the resources on one computer node are not enough for a job, the job stll cannot be served even though the total accumulated avalable resources on the whole cluster are abundant. We say that a set of resources are elgble for a user f they can be used to serve the respectve jobs. Consder the followng a cluster that has two subsets S 1 wth 5 computer nodes of sze (1, 1) and S 2 wth 5 nodes of sze (4, 5). If the user A has the job profle (5, 5), then the only elgble set of nodes are S 1 wth 5 ones of sze (1, 1). Though the user can use the whole cluster, t s as f the user has been deprved of the access to S 2. Therefore, when we compare farness of user A wth any other user, say B, we should not consder the usage of user B on S 2. Instead, we can consder the maxmal set of resources that both A and B can use. They are consder to have far shares f ther allocaton on that set of resources are far. Let N (t) N denote the number of runnng jobs of user at tme t, wth N N beng the maxmum number of jobs that user can be launched n the cluster assumng that no other users are present. The slowdown of user at tme t s equal to η (t) = N (t)/n. The average slowdown over a tme nterval [a, b], b a = l s defned as η ([a, b]) = b η(t)dt/l. a Defnton 2. For two users, u 1 and u 2, who can also share resources wth other users, an allocaton s (l, ɛ)-far for u 1 and u 2, f durng any tme nterval [a, b] of length l = b a when u 1 and u 2 are served smultaneously n the cluster, the rato between the average slowdowns of u 1 and u 2 on [a, b] Fgure 4: Illustraton of far allocaton are dfferent by at most 1 + ɛ, ɛ >,.e., 1 1 + ɛ ηu 1([a, b]) η u2 ([a, b]) 1 + ɛ. Next, we propose a schedulng algorthm so that all sharng users are treated (l, ɛ)-far, under the assumpton that the executon tmes of all jobs are bounded by a constant C. We want to have the followng property: for any gven ɛ >, there exsts a fnte l > such that the schedulng algorthm guarantees all workflows to be (l, ɛ)-far. Fgure 5: Dynamcs of M(t) and Z(t) Let M(t) be the number of users runnng smultaneously n the cluster at tme t. The slowdown of user s equal to η (t). For a user that stll has unfnshed workload at tme t, defne Z (t) = η (t) M(t) n=1 ηn(t), as llustrated n Fg. 5. The followng functon for user at tme t > s on a tme nterval [max(t l, s ), t] s defned by C (t) = t max(t l,s ) Z (x)m(x)dx. We can use C (t) as a performance ndcaton functon, snce deally C (t) should be very close to l for all users and any tme t. Based on C (t), a straghtforward algorthm s to sort all users that are stll runnng n the cluster accordng to C (t) n ascendng order at each tme t, and launch a job from the user wth the smallest C (t) whenever possble. Ths ndcator functon captures tme dependence, whch s mssng n any approach that only specfes nstantaneous farness requrement. To see ths pont, consder an extreme case that each user can only run one job n the cluster. Wthout usng the past nformaton about the already accumulated servce, we cannot farly allocate the resources to multple users. Ths becomes mportant when multple users wth dfferent profles are sharng the cluster together. For example, t s possble that some users have jobs requrng a lot of resources and other small jobs each only need a small

amount of resources. These small jobs can be squeezed nto the resource fragments but large jobs cannot. We need to track the accumulated resource allocaton over tme. When the jobs are nfntely dvsble and the jobs have mult-dmensonal profles, t can be shown that the prevous algorthm s equvalent to domnant resource farness as l. Furthermore, f all job profles are dentcal, ths algorthm s the same as processor sharng when l. In the mult-resource scenaros, carefully packng jobs of varous profles can help to fully utlze the resources of dfferent types. However, f the farness crtera s too strct, there wll not be enough room for the schedulng algorthm to optmze the packng of jobs. Fgure 6: Relax the canddates for launchng jobs In order to make room for further optmzaton, we change the algorthm to the followng. Stll, we sort all users that are stll runnng n the cluster accordng to C (t) n ascendng order at each tme t. For the user u wth the smallest C u (t), we defne a set of users S(t, p) = { : C (t) (1+p)C u (t)} for a fxed p >. Every user n the set S(t, p) s elgble to launch a job mmedately after t. The decson on whch user to be selected can be determned by other objectve functons, e.g., mprovng resource utlzaton by bn-packng algorthms [9]. The reason why ths algorthm mantans farness can be heurstcally argued as follows. If a user v always obtans less than ts desred fracton of resources, ts C v(t) value wll keep decreasng. Ths process eventually leads the set S(t, p) to only contan the sngle user v. It thus wll be allocated wth resources to catch up wth other users that may have temporarly been treated more advantageously. 6. RELATED WORK As the frst attempt of solvng the mult-resource allocaton problem for dstrbuted computng, Ghods et al. [8] propose a domnant resource far sharng (DRF) mechansm for mult-resource allocaton n large scale cloud computng clusters. DRF generalzes max-mn farness sharng of sngle resource type to multple resource type wth a goal of maxmzng the mnmum domnant share for all users. It also satsfes a number of useful game theory propertes such as sharng ncentve (SI), Pareto effcency (PE), strategy proofness (SP) and envy-freeness (EF). DRF assumes that the resource pools are homogeneous and nfntely dvsble whch s essentally equvalent to provde a sngle ggantc machne to meet the resource requests from all users. After DRF, mult-resource allocaton has ganed sgnfcant attenton [17, 5, 1, 13, 14] from both computer scence and economcs communtes. Dolev et al. [5] ntroduce a new way of defnng farness based on bottleneck resources. They propose bottleneck-based farness (BBF) allocaton methods n whch users are assgned at least the enttlement of ther bottlenecked resources and cannot be justfed for complants. They then theoretcally prove the exstence of far allocatons that satsfy no justfed complants. Later on, a number of technques generalze DRF to ether consder a larger classes of farness functons or to account for more general cases [1, 13, 17]. Gutman and Nsan [1] extend and generalze BBF [5] and DRF [8] to a larger famly of utltes under Leontef preference from an economc perspectve. A new class of farness noton, named generalzed resource farness (GRF), are defned based on whch the authors desgn two polynomal tme algorthms for far allocaton. Note that the second polynomal algorthm targets an open problem of computng a BBF allocaton [5] relyng on a corollary of fndng compettve market equlbra n a Fsher market. Joe-Wong et al. [13] generalze DRF and present two classes of farness functons ncludng farness on domnant shares (FDS) and generalzed farness on jobs (GFJ). Through these functons, they explore the trade-off between effcency and farness, study the mpact of user requests wth mult-resource types and dscuss condtons where these farness functons satsfy the game theory propertes. In these technques, they all assume resources are contnuously dvsble. Parkes et al. [17] present another research effort that generalzes the DRF defnton to account for per job sharng weghts and zero demands, and that compares wth related work n economcs communty usng Leontef preference. They also nvestgate the socal welfare propertes of far allocaton algorthms. Gven the assumpton that user requests are ndvsble, they desgn a polynomal sequental mn-max algorthm whch allows job arrval and departure as well as changng task profles over tme. They dscuss the possblty and mpossblty of farness propertes when requests are ndvsble and provde results wth relaxed noton of farness under a certan settng. Our problem settng s dfferent from the above lterature n that we consder multple dfferent task profles of a sngle user request nstead of the assumpton that each user request s assocated wth a task profle. Two ndependent approaches [4, 14] nvestgate the mpact of dynamc job arrval and departure on farness and effcency of mult-resource allocaton problem. Inspred by the proportonal farness of bandwdth sharng n networks, Bonald et al. [4] argue that proportonal farness can obtan better effcency -farness trade-off. Kash et al. [14] on the other hand ntroduce new noton of dynamc Pareto optmalty (DPO) and show that DPO and envy-free are ncompatble. To tackle the ncompatblty, they propose two relaxatons: 1) dynamc envy-free (DEF) based on whch a dynamc DRF allocaton method s proposed to satsfy SI, DEF, SP and DPO; 2) cautous DPO based on whch cautous LP are desgned to satsfy SI,EF, SP and CDPO. Both models assume dvsble jobs. Whle a large porton of related pror art has concentrated on far allocaton under the assumpton that both the resource requests of jobs are dvsble, Fredman et al. [6] study the mult-resource far sharng under a more realstc assumpton that resource requests of jobs are ndvsble and there are a number of machnes whch provde avalable resources. Ther approach also consders the envronments where contaners are used to acheve performance solaton between tenants. Under ths settng, the authors show that a randomzed resource allocaton wth approxmate ex-post far, effcent and strategy-proof can be acheved by computng a weghted max-mn over the convex hull of the feasble

regon. Another research work by Psomas and Schwartz [19] consders the cases that task resources are ndvsble and apples bn packng algorthm for allocaton. They demonstrate that the desgned mergerdrf algorthm can ncrease utlzaton. In addton, DRF has been extended and appled n other felds such as network routng [7, 2], cluster resource allocaton for herarchcal organzatons [3] and IaaS cloud [16]. Egaltaran dvson [15] under Leontef preferences and the cake-cuttng problem [18] dscuss far resource allocaton problem n the economc context wth an assumpton that resources are dvsble. These efforts are dfferent from our work n that we handle the problem n a more realstc settng where a sngle user submts multple dfferent task profles concurrently. 7. CONCLUSION We generalze the noton of slowdown to address the farness for mult-class workflows that demand multple dfferent resources. Two cases are consdered when jobs are dvsble and ndvsble, respectvely. For the frst case when the operaton pont of a workflow moves dynamcally as the schedulng decsons change, we compare the nstantaneous operaton pont wth one that s mapped onto a Pareto curve by assumng that the workflow has the whole cluster at ts dsposal. We nvestgate farness among concurrent workflows usng slowdown and optmze the mxture of dfferent classes of jobs for a workflow for better slowdown. Then, for the second case, n order to handle the stuaton when jobs are non-preemptve and ndvsble, we relax the strct farness defnton that only depends on the consumed resources at a partcular tme to an average slowdown measured over a tme nterval. Ths relaxaton allows us to address farness for more realstc applcaton scenaros. References [1] Hadoop. http://hadoop.apache.org/. [2] Spark. http://spark.apache.org/. [3] A. A. Bhattacharya, D. Culler, E. Fredman, A. Ghods, S. Shenker, and I. Stoca. Herarchcal schedulng for dverse datacenter workloads. In Proceedngs of the 4th Annual Symposum on Cloud Computng, SOCC 13, pages 4:1 4:15, New York, NY, USA, 213. ACM. [4] T. Bonald and J. Roberts. Enhanced cluster computng performance through proportonal farness. pages 2 23, Turn, Italy, October 214. [5] D. Dolev, D. G. Fetelson, J. Y. Halpern, R. Kupferman, and N. Lnal. No justfed complants: On far sharng of multple resources. In Proceedngs of the 3rd Innovatons n Theoretcal Computer Scence Conference, ITCS 12, pages 68 75, New York, NY, USA, 212. ACM. [6] E. Fredman, A. Ghods, and C.-A. Psomas. Strategyproof allocaton of dscrete jobs on multple machnes. In Proceedngs of the Ffteenth ACM Conference on Economcs and Computaton, EC 14, pages 529 546, New York, NY, USA, 214. ACM. [7] A. Ghods, V. Sekar, M. Zahara, and I. Stoca. Multresource far queueng for packet processng. SIG- COMM Comput. Commun. Rev., 42(4):1 12, Aug. 212. [8] A. Ghods, M. Zahara, B. Hndman, A. Konwnsk, S. Shenker, and I. Stoca. Domnant resource farness: Far allocaton of multple resource types. In Proceedngs of the 8th USENIX Conference on Networked Systems Desgn and Implementaton, NSDI 11, pages 323 336, Berkeley, CA, USA, 211. USENIX Assocaton. [9] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Mult-resource packng for cluster schedulers. In Proceedngs of the 214 ACM Conference on SIGCOMM, SIGCOMM 14, pages 455 466, New York, NY, USA, 214. ACM. [1] A. Gutman and N. Nsan. Far allocaton wthout trade. In Proceedngs of the 11th Internatonal Conference on Autonomous Agents and Multagent Systems- Volume 2, pages 719 728. Internatonal Foundaton for Autonomous Agents and Multagent Systems, 212. [11] M. Harchol-balter, K. Sgman, and A. Werman. Asymptotc convergence of schedulng polces wth respect to slowdown. October 22. [12] M. Isard, M. Budu, Y. Yu, A. Brrell, and D. Fetterly. Dryad: Dstrbuted data-parallel programs from sequental buldng blocks. In Proceedngs of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 27, EuroSys 7, pages 59 72, New York, NY, USA, 27. ACM. [13] C. Joe-Wong, S. Sen, T. Lan, and M. Chang. Multresource allocaton: Farness-effcency tradeoffs n a unfyng framework. In INFOCOM, 212 Proceedngs IEEE, pages 126 1214, March 212. [14] I. Kash, A. D. Procacca, and N. Shah. No agent left behnd: Dynamc far dvson of multple resources. In Proceedngs of the 213 Internatonal Conference on Autonomous Agents and Mult-agent Systems, AA- MAS 13, pages 351 358, Rchland, SC, 213. Internatonal Foundaton for Autonomous Agents and Multagent Systems. [15] J. L and J. Xue. Egaltaran dvson under leontef preferences. Economc Theory, 54(3):597 622, 213. [16] H. Lu and B. He. Recprocal resource farness: Towards cooperatve multple resource far sharng n aas clouds. In Proceedngs of the Internatonal Conference for Hgh Performance Computng, Networkng, Storage and Analyss, pages 97 981. IEEE Press, 214. [17] D. C. Parkes, A. D. Procacca, and N. Shah. Beyond domnant resource farness: Extensons, lmtatons, and ndvsbltes. In Proceedngs of the 13th ACM Conference on Electronc Commerce, EC 12, pages 88 825, New York, NY, USA, 212. ACM. [18] A. D. Procacca. Cake cuttng: Not just chld s play. Commun. ACM, 56(7):78 87, July 213. [19] C.-A. Psomas and J. Schwartz. Beyond beyond domnant resource farness: Indvsble resource allocaton n clusters. Techncal report, Tech Report Berkeley, 213. [2] W. Wang, B. L, and B. Lang. Mult-resource round robn: A low complexty packet scheduler wth domnant resource farness. In ICNP, pages 1 1, 213. [21] A. Werman and M. Harchol-Balter. Classfyng schedulng polces wth respect to unfarness n an m/g/1. In Proceedngs of the 23 ACM SIGMET- RICS Internatonal Conference on Measurement and Modelng of Computer Systems, SIGMETRICS 3, pages 238 249, New York, NY, USA, 23. ACM.

Explotng Cloud Heterogenety to Optmze Performance and Cost of MapReduce Processng Zhuoyao Zhang Google Inc. Mountan Vew, CA 9443, USA zhuoyao@google.com Ludmla Cherkasova Hewlett-Packard Labs Palo Alto, CA 9433, USA lucy.cherkasova@hp.com Boon Thau Loo Unversty of Pennsylvana Phladelpha, PA 1914, USA boonloo@cs.upenn.edu ABSTRACT Cloud computng offers a new, attractve opton to customers for quckly provsonng any sze Hadoop cluster, consumng resources as a servce, executng ther MapReduce workload, and then payng for the tme these resources were used. One of the open questons n such envronments s the rght choce of resources (and ther amount) a user should lease from the servce provder. Typcally, there s a varety of dfferent types of VM nstances n the Cloud (e.g., small, medum, or large EC2 nstances). The capacty dfferences of the offered VMs are reflected n VM s prcng. Therefore, for the same prce a user can get a varety of Hadoop clusters based on dfferent VM nstance types. We observe that the performance of MapReduce applcatons may vary sgnfcantly on dfferent platforms. Ths makes a selecton of the best cost/performance platform for a gven workload a non-trval problem, especally when t contans multple jobs wth dfferent platform preferences. We am to solve the followng problem: gven a completon tme target for a set of MapReduce jobs, determne a homogeneous or heterogeneous Hadoop cluster confguraton (.e., the number, types of VMs, and the job schedule) for processng these jobs wthn a gven deadlne whle mnmzng the rented nfrastructure cost. In ths work, 1 we desgn an effcent and fast smulaton-based framework for evaluatng and selectng the rght underlyng platform for achevng the desrable Servce Level Objectves (SLOs). Our evaluaton study wth Amazon EC2 platform reveals that for dfferent workload mxes, an optmzed platform choce may result n 45-68% cost savngs for achevng the same performance objectves when usng dfferent (but seemngly equvalent) choces. Moreover, dependng on a workload the heterogeneous soluton may outperform the homogeneous cluster soluton by 26-42%. We provde addtonal nsghts explanng the obtaned results by proflng the performance characterstcs of used applcatons and underlyng EC2 platforms. The results of our smulaton study are valdated through experments wth Hadoop clusters deployed on dfferent Amazon EC2 nstances. 1. INTRODUCTION Cloud computng offers a new delvery model wth vrtually unlmted computng and storage resources. Ths s an attractve opton for many users because acqurng, settng up, and mantanng a complex, large-scale nfrastructure such as a Hadoop cluster requres a sgnfcant up-front nvestment n the new nfrastructure, 1 Ths paper s an extended verson of our ealer workshop paper [2]. Ths work was orgnated durng Z. Zhang s nternshp at HP Labs. Prof. B. T. Loo s supported n part by NSF grants CNS-1117185 and CNS-845552. Copyrght s held by author/owner(s). tranng new personnel, and then a contnuous mantenance and management support, that can be dffcult to justfy. Cloud computng offers a compellng, cost-effcent approach that allows users to rent resources n a pay-per-use manner. For many users ths creates an attractve and affordable alternatve compared to acqurng and mantanng ther own nfrastructure. A typcal cloud envronment provdes a selecton of dfferent capacty Vrtual Machnes for deployment at dfferent prces per tme unt. For example, the Amazon EC2 platform offers a choce of small, medum, and large VM nstances (among the other choces), where the CPU and RAM capacty of a medum VM nstance s two tmes larger than the capacty of a small VM nstance, and the CPU and RAM capacty of a large VM nstance s two tmes larger than the capacty of a medum VM nstance. Ths resource dfference s also reflected n the prce: the large nstance s twce (four tmes) more expensve compared wth the medum (small) VM nstance. Therefore, a user s facng a varety of platform and confguraton choces that can be obtaned for the same cost. To demonstrate the challenges n makng an optmzed platform choce we performed a set of experments wth two popular applcatons TeraSort and KMeans 2 on three Hadoop clusters 3. deployed wth dfferent type VM nstances: 4 small VMs, each confgured wth 1 map and 1 reduce slot; 2 medum VMs, each confgured wth 2 map and 2 reduce slots, and 1 large VMs, each confgured wth 4 map and 4 reduce slots. Therefore, the three Hadoop clusters can be obtaned for the same prce per tme unt, and they have the same number of map and reduce slots for processng (where each slot s provsoned wth the same CPU and RAM capactes). Fgure 2 shows the summary of our experments wth TeraSort and KMeans. Normalzed Job Completon Tme 6 5 4 3 2 1 small medum large (a) TeraSort Normalzed Job Completon Tme 3 2.5 2 1.5 1.5 small medum large (b) KMeans Fgure 1: Normalzed completon tme of two applcatons executed on dfferent types of EC2 nstances. Apparently, the Hadoop cluster wth 4 small VMs provdes the best completon tme for a TeraSort applcaton as shown n Fg- 2 In ths work, we use a set of 13 applcatons released by the Tarazu project [2] wth TeraSort and KMeans among them. Table 2 n Secton 6 provdes applcaton detals and correspondng job settngs (the number of map/reduce tasks, datasets szes, etc.) 3 We use Hadoop 1.. n all experments n the paper.

ure 1 (a). The completon tme of TeraSort on the cluster wth small VMs s 5.5 (2.3) tmes better,.e.,smaller, than on the cluster wth large (medum) VMs. Snce the cost of all three clusters per tme unt s the same, the shortest completon tme results n the lowest monetary cost the customer should pay. Therefore, the Hadoop cluster wth 4 small VMs offers the best soluton for TeraSort. By contrast, the Hadoop cluster wth 1 large VMs s the best opton for KMeans as shown n Fgure 1 (b). It outperforms the Hadoop cluster wth small VMs by 2.6 tmes when processng KMeans. Ths experment demonstrates that seemngly equvalent platform choces for a Hadoop cluster n the Cloud mght result n a dfferent applcaton performance that could lead to a dfferent provsonng cost. The problem of optmzed platform choce becomes even more complex when a gven workload contans multple jobs wth dfferent performance preferences. Intutvely, f performance of jobs n the set would beneft from the small VMs (or large VMs) then the platform choce for a correspondng Hadoop cluster s relatvely straghtforward. However, f a gven set of jobs has the applcatons wth dfferent performance preferences, then a platform choce becomes non-trval. Fgure 2 shows completon tmes (absolute, not normalzed) of TeraSort and KMeans on three Hadoop clusters deployed wth dfferent type VM nstances (these graphs resemble the normalzed results shown n Fgure 1). Apparently, when makng a decson on the best platform for a Hadoop cluster to execute both of these applcatons (as a set) n the most cost effectve way, one needs to look at the reducton of absolute executon tmes due to the choce of a common underlyng platform. Job Completon Tme (s) 6 5 4 3 2 1 small medum large (a) TeraSort Job Completon Tme (s) 4 35 3 25 2 15 1 5 small medum large (b) KMeans Fgure 2: Completon tme of two applcatons executed on dfferent types of EC2 nstances. Apparently, the absolute tme benefts from processng KMeans on the large VMs sgnfcantly overweght the benefts of processng TeraSort on the small VMs. In ths work, we am to solve the problem of the platform choce to provde the best cost/performance trade-offs for a gven MapReduce workload and the Hadoop cluster. As shown n Fgures 1, 2 ths choce s non-trval and depends on the applcaton characterstcs. The problem s even more dffcult when the performance objectve s to mnmze the makespan (the overall completon tme) of a gven job set. In ths work, we frst offer a framework for solvng the followng two problems. Gven a workload, select the type and sze of the underlyng platform for a homogeneous Hadoop cluster that provdes best cost/performance trade-offs: ) mnmzng the cost (budget) whle achevng a gven makespan target, or ) mnmzng the achevable jobs makespan for a gven budget. We also observe that a user mght have addtonal consderatons for a case wth node falure(s). Hadoop s desgned to support faulttolerance,.e., t wll fnsh job processng even n the case of a node falure by usng the remanng resources and restartng/recomputng faled tasks. However, f the cluster s based on 4 small VM nstances then a sngle node falure leads to a loss of 2.5% of the overall resources, and t mpacts only a lmted number of map and reduce tasks. Whle n the cluster based on 1 large VM nstances, a sngle node falure leads to a loss of 1% of the overall resources and a much hgher number of mpacted map and reduce tasks. We provde an extenson of the proposed framework for selectng and szng a Hadoop cluster to support the job performance objectves n case of node falure(s) n the cluster. In our earler work [22], we dscussed a framework for the optmzed platform selecton of a sngle homogeneous Hadoop cluster. However, a homogeneous cluster mght not always present the best soluton. Intutvely, f a gven set of jobs has the applcatons wth dfferent platform preferences then a heterogeneous soluton (that combnes Hadoop clusters deployed wth dfferent nstance types) mght be a better choce. To support the choce of the heterogeneous soluton, we ntroduce an applcaton preference rankng to reflect the strength of applcaton preference between dfferent VM types and the possble mpact on the provsonng cost (see our dscusson related to the absolute completon tmes of KMeans and TeraSort shown n Fgure 2). Ths preference rankng gudes the constructon of a heterogeneous soluton. In the desgned smulaton-based framework, we collect jobs profles from a gven set, create an optmzed job schedule that mnmzes jobs makespan (as a functon of job profles and a cluster sze), and then obtan the accurate estmates of the achevable makespan by replayng jobs traces n the smulator. Based on the cost of the best homogeneous Hadoop cluster, we provde a quck walk through a set of heterogeneous solutons (and correspondng jobs parttonng nto dfferent pools) to see whether there s a heterogeneous soluton that can process gven jobs wthn a deadlne but at a smaller cost. In our performance study, we use a set of 13 dverse MapReduce applcatons for creatng three dfferent workloads. Our experments wth Amazon EC2 platform reveal that for dfferent workloads, an optmzed platform choce may result n up to 45%-68% cost savngs for achevng the same performance objectves when usng dfferent (but seemngly equvalent) choces. Moreover, dependng on a workload the heterogeneous soluton may outperform the homogeneous one by 26-42%. The results of our smulaton study are valdated through experments wth Hadoop clusters deployed on dfferent Amazon EC2 nstances. The rest of the paper s organzed as follows. Secton 2 outlnes our approach and explan detals of the buldng blocks used n our soluton. Secton 3 descrbed the general problem defnton (two separate cases) for the homogeneous cluster case and outlnes both solutons. Secton 4 outlnes the extenson of the proposed framework for a case wth node falure(s). Secton 5 motvates the heterogeneous clusters soluton and provdes the correspondng provsonng algorthm. Secton 6 presents the evaluaton study by comparng the effectveness of the proposed algorthms and ther outcomes for dfferent workloads. Secton 7 outlnes related work. Secton 8 summarzes our contrbuton and gves drectons for future work. 2. BUILDING BLOCKS In ths secton, we outlne our approach and explan detals of the followng buldng blocks used n our soluton: ) collected job traces and job profles; ) an optmzed job schedule to mnmze the jobs executon makespan; ) the Map-Reduce smulator to replay the job traces accordng to the generated job schedule for obtanng the accurate estmates of jobs performance and cost values. 1) Job Traces and Profles: In summary, the MapReduce job executon s comprsed of two stages: map stage and reduce stage. The map stage s parttoned nto map tasks and the reduce stage s parttoned nto reduce tasks, and they are dstrbuted and executed across multple machnes. Each map task processes a logcal splt of the nput data that generally resdes on a dstrbuted fle system. The map task apples the user-defned map functon on each record and buffers the resultng output. Ths ntermedate data s hash-parttoned for the dfferent

reduce tasks and wrtten to the local hard dsk of the worker executng the map task. We use the past job run(s) for creatng the job traces that contan recorded duratons of all processed map and reduce tasks 4. A smlar job trace can be extracted from the Hadoop job tracker logs usng tools such as Rumen [1]. The obtaned map/reduce task dstrbutons can be used for extractng the dstrbuton parameters and generatng scaled traces,.e., generatng the replayable traces of the job executon on the large dataset from the sample job executon on the smaller dataset as descrbed n [13]. These job traces can be replayed usng a MapReduce smulator [12] and used for creatng the compact job profle for analytc models. For predctng the job completon tme we use a compact job profle that characterze the job executon durng map, shuffle, and reduce phases va average and maxmum task duratons. The proposed MapReduce performance model [14] evaluates lower bounds TJ low and upper bounds T up J on the job completon tme. It s based the Makespan Theorem [13] for computng performance bounds on the completon tme of a gven set of n tasks that are processed by k servers, (e.g., n map tasks are processed by k map slots n MapReduce envronment), the completon tme of the entre n tasks s proven to be at least: T low = avg n k and at most T up (n 1) = avg + max k The dfference between lower and upper bounds represents the range of possble completon tmes due to task schedulng non-determnsm. As was shown n [14], the average of lower and upper bounds (T avg J ) s a good approxmaton of the job completon tme (typcally, t s wthn 1%). Usng ths approach, we can estmate the duraton of map and reduce stages of a gven job as a functon of allocated resources (.e., on dfferent sze Hadoop clusters). In partcular, we apply ths analytc model n the process of buldng an optmzed job schedule to mnmze the overall jobs executon tme. 2) An Optmzed Job Schedule: It was observed [15, 21] that for a set of MapReduce jobs (wth no data dependences between them) the order n whch jobs are executed mght have a sgnfcant mpact on the jobs makespan,.e., jobs overall completon tme, and therefore, on the cost of the rented Hadoop cluster. For data-ndependent jobs, once the frst job completes ts map stage and begns the reduce stage, the next job can start executng ts map stage wth the released map resources n a ppelned fashon. There s an overlap n executons of map stage of the next job and the reduce stage of the prevous one. As an llustraton, let us consder two MapReduce jobs that have the followng map and reduce stage duratons: Job J 1 has a map stage duraton of J1 M = 1s and the reduce stage duraton of J1 R = 1s. Job J 2 has a map stage duraton of J2 M = 1s and the reduce stage duraton of J2 R = 1s. There are two possble executons shown n Fgure 3: J 1 s followed by J 2 shown n Fgure 3(a). The reduce stage of J 1 overlaps wth the map stage of J 2 leadng to overlap of only 1s. The total completon tme of processng two jobs s 1s + 1s + 1s = 21s. J 2 s followed by J 1 shown n Fgure 3(b). The reduce stage of J 2 overlaps wth the map stage of J 1 leadng to a much better ppelned executon and a larger overlap of 1s. The total makespan s 1s + 1s + 1s = 12s. 4 The shuffle stage s ncluded n the reduce task. For a frst shuffle phase that overlaps wth the entre map phase, only a complementary (non-overlappng) porton s ncluded n the reduce task. J 1M =1s J 1R =1s J 1 J2M =1s J 2R =1s J 2 (a) J 1 s followed by J 2. J 2M =1s J 2R =1s J 2 J 1M =1s J 1R =1s J 1 (b) J 2 s followed by J 1. Fgure 3: Impact of the job schedule on ther completon tme. There s a sgnfcant dfference n the jobs makespan (75% n the example above) dependng on the executon order of the jobs. Snce n ths work we consder a problem of mnmzng the cost of rented Hadoop cluster and the jobs completon tme drectly mpacts ths cost, we am to generate the job executons order that mnmzes the jobs makespan. Thus, let J = {J 1, J 2,..., J n} be a set of n MapReduce jobs wth no data dependences between them. For mnmzng the makespan of a gven set of MapReduce jobs, we apply the classc Johnson algorthm [6] that was proposed for buldng an optmal job schedule n two-stage producton systems. Johnson s schedule can be effcently appled to mnmzng the makespan of MapReduce jobs as t was shown n [15]. Let us consder a collecton J of n jobs, where each job J s represented by the par (m, r ) of map and reduce stage duratons respectvely. Note, we can estmates m and r by usng bounds-based model. Let us augment each job J = (m, r ) wth an attrbute D that s defned as follows: { (m, m) f mn(m D =, r ) = m, (r, r) otherwse. The frst argument n D s called the stage duraton and denoted as D 1. The second argument s called the stage type (map or reduce) and denoted as D 2. Algorthm 1 shows how an optmal schedule can be constructed usng Johnson s algorthm. Algorthm 1 Johnson s Algorthm Input: A set J of n MapReduce jobs. D s the attrbute of job J as defned above. Output: Schedule σ (order of jobs executon.) 1: Sort the orgnal set J of jobs nto the ordered lst L usng ther stage duraton attrbute D 1 2: head 1, tal n 3: for each job J n L do 4: f D 2 = m then 5: // Put job J from the front 6: σ head J, head head + 1 7: else 8: // Put job J from the end 9: σ tal J, tal tal - 1 1: end f 11: end for Frst, we sort all the n jobs from the orgnal set J n the ordered lst L n such a way that job J precedes job J +1 f and only f mn(m, r ) mn(m +1, r +1). In other words, we sort the jobs usng the stage duraton attrbute D 1 n D (t represents the smallest duraton of the two stages). Then the algorthm works by takng jobs from lst L and placng them nto the schedule σ from the both ends (head and tal) and proceedng towards the mddle. If the stage type n D s m,.e., represents the map stage, then the job J s placed from the head of the schedule, otherwse from the tal. The complex-

ty of Johnson s Algorthm s domnated by the sortng operaton and thus s O(n log n). 3) MapReduce Smulator: Snce the users rent Cloud resources n a pay-per-use fashon, t s mportant to accurately estmate the executon tme of a gven set of jobs accordng to a generated Johnson schedule on a Hadoop cluster of a gven sze. In ths work, we use the enhanced verson of MapReduce smulator SmMR [12]. Ths smulator can accurately replay the job traces and reproduce the orgnal job processng: the completon tmes of the smulated jobs are wthn 5% of the orgnal ones as shown n [12]. Moreover, SmMR s a very fast smulator: t can process over one mllon events per second. Therefore, we can quckly explore the entre soluton space (n brute-force search manner). The man structure of SmMR s shown n Fgure 4. Our soluton s based on a smulaton framework: n a brute-force manner, t searches through the entre soluton space by exhaustvely enumeratng all possble canddates for the soluton and checkng whether each canddate satsfes the requred problem s statement. Fgure 5 shows the dagram for the framework executon n decson makng process per selected platform type. For example, f the plat- Workloads wth job profles Cluster descrpton Fgure 5: Outlne of the homogeneous cluster soluton. scheduler Smulator engne Fgure 4: MapReduce Smulator SmMR. The basc blocks of the smulator are the followng: Makespan costs 1. Trace Generator a module that generates a replayable workload trace. Ths trace s generated ether from the detaled job profle (provded by the Job Profler) or by feedng the dstrbuton parameters for generatng the synthetc trace (ths path s taken when we need to generate the job executon traces from the sampled executons on the smaller datasets). 2. Smulator Engne a dscrete event smulator that takes the cluster confguraton nformaton and accurately emulates the Hadoop job master decsons for map/reduce slot allocatons across multple jobs. 3. Schedulng polcy a schedulng module that dctates the jobs orderng and the amount of allocated resources to dfferent jobs over tme. Thus, for a gven Hadoop cluster sze, gven set of jobs, and generated Johnson s schedule, the smulator can accurately estmate the jobs completon tme (makespan) by replayng the job traces accordngly to the generated schedule. 3. HOMOGENEOUS CLUSTER SOLUTION In ths work, we consder the followng two problems for the homogeneous cluster case. For a gven workload defned as a set of jobs W = {J 1, J 2,..., J n} to be processed wthn deadlne D, determne a Hadoop cluster confguraton (.e., the number and types of VM nstances, and the job schedule) for processng these jobs wthn a gven deadlne whle mnmzng the monetary cost for rented nfrastructure. For a gven workload W = {J 1, J 2,..., J n} and a gven a customer budget B, determne a Hadoop cluster confguraton (.e., the number and types of VM nstances, and the job schedule) for processng these jobs wthn an allocated monetary cost for rented nfrastructure whle mnmzng the jobs processng tme. forms of nterest are small, medum, and large EC2 VM nstances then the framework wll generate three trade-off curves. For each platform and a gven Hadoop cluster sze, the Job Scheduler component generates the optmzed MapReduce job schedule. Then the jobs makespan s obtaned by replayng the job traces n the smulator accordng to the generated schedule. After that the sze of the cluster s ncreased by one nstance (n the cloud envronment, t s equvalent to addng a node to a Hadoop cluster) and the teraton s repeated: a new job schedule s generated and ts makespan s evaluated wth the smulator, etc. We have a choce of stop condtons for teratons: ether a user can set a range of values for the cluster sze Nmax type (drven by budget B, whch a customer ntends to spend), or at some pont, the ncreased cluster sze does not mprove the achevable makespan. The latter condton typcally happens when the Hadoop cluster s large enough to accommodate all the jobs to be executed concurrently, and therefore, the ncreased cluster sze cannot mprove the jobs makespan. Assume that a set of gven jobs should be processed wthn deadlne D, and let P rce type be the prce of a type VM nstance per tme unt. Then a customer wth budget B can rent Nmax type of VMs nstances of a gven type: N type max = B/(D P rce type ) (1) Algorthm 2 shows the pseudo code to determne the sze of a cluster whch s based on the type VM nstances for processng W wth deadlne D and whch results n the mnmal monetary cost. The algorthm terates through the ncreasng number of nstances for a Hadoop cluster. It smulates the completon tme of workload W processed wth Johnson s schedule on a gven sze cluster and computes the correspondng cost (lnes 2-6). Note, that k defnes the number of worker nodes n the cluster. The overall Hadoop cluster sze s k + 1 nodes (we add a dedcated node for Job Tracker and Name Node, whch s ncluded n the cost). The mn cost type keeps track of a mnmal cost so far (lnes 7-8) for a Hadoop cluster whch can process W wthn deadlne D. One of the reader questons mght be why do we contnue teratng through the ncreasng number of nstances once we found a soluton whch can process W wthn deadlne D? At a glance, when we keep ncreasng the Hadoop cluster the soluton becomes more expensve cost-wse. In realty, t s not always true: where could be a stuaton when a few addtonal nodes and a dfferent Johnson schedule mght sgnfcantly mprove the makespan of a gven workload, and as a result the cost of ths larger cluster s smaller (due

Algorthm 2 Provsonng soluton for a homogeneous cluster to process W wth deadlne D whle mnmzng the cluster cost Input: W = {J 1, J 2,...J n} workload wth traces and profles for each job; type VM nstance type, e.g., type {small, medum, large}; Nmax type the maxmum number of nstances to rent; P rce type unte prce of a type VM nstance; D a gven tme deadlne for processng W. Output: N type an optmzed number of VM type nstances for a cluster; mn cost type the mnmal monetary cost for processng W. 1: mn cost type 2: for k 1 to Nmax type do 3: // Smulate completon tme for processng workload W wth k VMs 4: Cur CT = Smulate(type, k, W) 5: // Calculate the correspondng monetary cost 6: cost = P rce type (k + 1) Cur CT 7: f Cur CT D & cost < mn cost type then 8: mn cost type cost, N type k 9: end f 1: end for 4. GENERAL CASE WITH NODE FAILURES The applcaton performance of a customer workload may vary sgnfcantly on dfferent platforms. Seemngly equvalent platform choces for a Hadoop cluster n the Cloud mght result n a dfferent applcaton performance and a dfferent provsonng cost, whch leads to the problem of an optmzed platform choce that can be obtaned for the same budget. Moreover, there could be an addtonal ssue for the user to consder: the mpact of node falures on the choce of the underlyng platform for a Hadoop cluster. Hadoop s desgned to support fault-tolerance,.e., t fnshes job processng even n case wth node falures. It uses the remanng resources for restartng and recomputng faled tasks. Let us consder a motvatng example descrbed n Secton 1, where a user may deploy three dfferent clusters wth dfferent types of VM nstances for the same budget: 4 small VMs, each confgured wth 1 map and 1 reduce slot; 2 medum VMs, each confgured wth 2 map and 2 reduce slots, and 1 large VMs, each confgured wth 4 map and 4 reduce slots. to sgnfcantly mproved workload completon tme). Later, n the evaluaton secton, we wll show examples of such stuatons. We apply Algorthm 2 to dfferent types of VM nstances, e.g., small, medum, and large respectvely. After that we compare the produced outcomes and make a fnal provsonng decson. In a smlar way, we can solve a related problem, when for a gven a customer budget B, we need to determne a Hadoop cluster confguraton for processng a gven workload W wthn an allocated monetary cost for rented nfrastructure whle mnmzng the jobs processng tme. Algorthm 3 shows the pseudo code to determne the sze of a cluster whch s based on the type VM nstances for processng W wth a monetary budget B and whch results n the mnmal workload processng. Algorthm 3 Provsonng soluton for a homogeneous cluster to process W wth a budget B whle mnmzng the processng tme Input: W = {J 1, J 2,...J n} workload wth traces and profles for each job; type VM nstance type, e.g., type {small, medum, large}; Nmax type the maxmum number of nstances to rent; P rce type unte prce of a type VM nstance; B a gven monetary budget for processng W. Output: N type an optmzed number of VM type nstances for a cluster; mn CT type the mnmal completon tme for processng W. 1: mn CT type 2: for k 1 to Nmax type do 3: // Smulate completon tme for processng workload W wth k VMs 4: Cur CT = Smulate(type, k, W) 5: // Calculate the correspondng monetary cost 6: cost = P rce type (k + 1) Cur CT 7: f Cur CT < mn CT type & cost B then 8: mn CT type Cur CT, N type k 9: end f 1: end for Algorthms 2 and 3 follow a smlar structure: n a brute-force manner, they search through the entre soluton space by exhaustvely enumeratng all possble canddates for the soluton and checkng whether each canddate satsfes the requred problem s statement. Now, let us see how a node falure may mpact cluster performance when Hadoop nodes are based on dfferent VM types. If a Hadoop cluster s based on 4 small nstances then a sngle node falure leads to a loss of 2.5% of the overall resources and only lmted number of map and reduce tasks mght be mpacted. Whle n the cluster based on 1 large nstances a sngle node falure leads to a loss of 1% of the overall resources and a much hgher number of map and reduce tasks mght be mpacted. For a busness-crtcal, producton workload W, a user may consder the generalzed servce level objectves (SLOs) that nclude two separate condtons: a desrable completon tme D for the entre set of jobs n the workload W under normal condtons; an acceptable degraded completon tme D deg for processng W n case of 1-node falure. So, the problem s to determne a Hadoop cluster confguraton (.e., the number and types of VM nstances, and the job schedule) for processng workload W wth the makespan target D whle mnmzng the cost, such that the chosen soluton also supports a degraded makespan D deg n case of 1-node falure durng the jobs processng. The approach, proposed n the prevous Secton 3, can be generalzed for the case wth node falures. Fgure 6 shows the extended dagram for the framework executon and decson makng process per selected platform type wth node falures. For example, f the platforms of nterest are small, medum, and large EC2 VM nstances then the framework wll generate three dfferent tradeoff sets. For each platform and a gven Hadoop cluster sze N, the Job Scheduler component generates an optmzed MapReduce job schedule, based on Johnson s algorthm. Then the jobs makespan (n the normal mode) s obtaned by replayng the job traces n the smulator accordng to the generated schedule. In parallel (see, the lower branch n Fgure 6, that represents a case of 1-node falure), the jobs makespan s obtaned by replayng the job traces accordng to the same generated job schedule n the decreased cluster sze N 1. After both branches are fnshed, the sze of the cluster s ncreased by one nstance (n the cloud envronment, t s equvalent to addng a node to a Hadoop cluster) and the teraton s repeated: a new job schedule s generated and ts makespan s evaluated wth the smulator for both modes: normal and 1-node falure, etc.

Fgure 6: Soluton Outlne. The cluster sze (for each nstance type) s selected based on valdty of both SLOs condtons: for normal executon to satsfy D and n case of 1-node falure to support a degraded makespan D deg. Algorthm 4 shows the pseudo-code to determne the sze of a cluster whch s based on the type VM nstances for processng W wth generalzed SLOs and that results n the mnmal monetary cost 5. Algorthm 4 Provsonng soluton for a homogeneous cluster to process W wth a deadlne D and wth a degraded deadlne D deg n case of 1-node falure whle mnmzng the cluster cost Input: W = {J 1, J 2,...J n} workload wth traces and profles for each job; type VM nstance type, e.g., type {small, medum, large}; Nmax type the maxmum number of nstances to rent; P rce type unte prce of a type VM nstance; D a gven tme deadlne for processng W, D deg a gven degraded deadlne for processng W wth 1-node falure. Output: N type an optmzed number of VM type nstances for a cluster; mn cost type the mnmal monetary cost for processng W. 1: mn cost type 2: for k 1 to Nmax type do 3: // Smulate completon tme for processng workload W wth k VMs 4: Cur CT = Smulate(type, k, W) 5: // Smulate processng W n a degraded mode wth (k 1) VMs 6: Cur CT deg = Smulate deg (type, k 1, W) 7: // Calculate the correspondng monetary cost 8: cost = P rce type (k + 1) Cur CT 9: f cur CT D & cur CT deg D deg & cost < mn cost type then 1: mn cost type cost, N type k 11: end f 12: end for The algorthm terates through the ncreasng number of nstances for a Hadoop cluster. It smulates the completon tme of workload W processed wth Johnson s schedule on a cluster of a gven sze k. Note, that the same job schedule s used for processng W n case of a node falure,.e., on a cluster of sze k 1. The overall Hadoop cluster sze s k +1 nodes (k defnes the number of worker nodes n the cluster, and we add a dedcated node for Job Tracker and Name Node, whch s ncluded n the cost). The mn cost type keeps track of a mnmal cost so far (lnes 7-8) for a Hadoop cluster whch can process W wthn deadlne D under normal condtons and wthn deadlne D deg n case of 1-node falure. 5 The proposed soluton can be generalzed for a case wth multple node falures. 5. HETEROGENEOUS SOLUTION In Secton 1, we dscussed a motvatng example by analyzng TeraSort and KMeans performance on Hadoop clusters formed wth dfferent EC2 nstances, and observng that these applcatons beneft from dfferent types of VMs as ther preferred choce. Therefore, a sngle homogeneous cluster mght not always be the best choce for a workload mx wth dfferent applcatons, and a heterogeneous soluton mght offer a better cost/performance outcome. However, a sngle (ndvdual) applcaton preference choce often depends on the sze of a Hadoop cluster and gven performance goals. Contnung the motvatng example from Secton 1, Fgure 7 shows the trade-off curves for three representatve applcatons Tera- Sort, Kmeans, and AdjLst 6 obtaned as a result of exhaustve smulaton of applcaton completon tmes on dfferent sze Hadoop clusters. The Y-axs represents the job completon tme whle the X-axs shows the correspondng monetary cost. Each fgure shows three curves for applcaton processng by a homogeneous Hadoop cluster based on small, medum, and large VM nstances respectvely. Frst of all, the same applcaton can result n dfferent completon tmes when beng processed on the same platform at the same cost. Ths reflects an nterestng phenomenon of pay-per-use model. There are stuatons when a cluster of sze N processes a job n T tme unts, whle a cluster of sze 2 N may process the same job n T/2 tme unts. Interestngly, these two dfferent sze clusters have the same cost, and f the purpose s meetng deadlne D where T D then both clusters meet the performance objectve. Second, we can see an orthogonal observaton: n many cases, the same completon tme can be acheved at a dfferent cost (on the same platform type). Typcally, ths corresponds to the case when an ncreased sze Hadoop cluster does not further mprove the job processng tme. Fnally, accordng to Fgure 7, we can see that for TeraSort, the small nstances results n the best choce, whle for Kmeans the large nstances represent the most cost-effcent platform. However, the optmal choce for AdjLst s not very clear, t depends on the deadlne requrements, and the trade-off curves are much closer to each other than for TeraSort and Kmeans. Another mportant pont s that the cost savngs vary across dfferent applcatons, e.g., the executon of Kmeans on large VM nstances leads to hgher cost savngs than the executon of TeraSort on small VMs. Thus, f we would lke to partton a gven workload W = {J 1, J 2,...J n} nto two groups of applcatons each to be executed by a Hadoop cluster based on dfferent type VM nstances, 6 Table 2 n Secton 6 provdes detals about these applcatons and ther job settngs.

Makespan (s) 25 2 15 1 5 2 4 6 8 1 cost ($) (a) TeraSort small medum large Makespan (s) 35 3 25 2 15 1 5 1 2 3 4 5 cost ($) (b) Kmeans small medum large Makespan (s) 8 7 6 5 4 3 2 1 2 4 6 8 1 12 cost ($) (c) AdjLst small medum large Fgure 7: Performance versus cost trade-offs for dfferent applcatons. we need to be able to rank (order) these applcaton wth respect to ther preference strength between two consdered platforms. In ths work, we consder a heterogeneous soluton that conssts of two homogeneous Hadoop sub-clusters deployed wth dfferent type VM nstances 7. As an example, we consder a heterogeneous soluton formed by small (S) and large (L) VM nstances. To measure the strength of applcaton preference between two dfferent VM types we ntroduce an applcaton preference score P Score S,L defned as a dfference between the normalzed costs of smulated cost-performance curves (such as shown n Fgure 7): P Score S,L = 1 N S max CostS N S max 1 N L max CostL N L max where N S max and N L max are defned by Eq. 1 for Hadoop clusters wth small and large VM type nstances respectvely. The value of P Score S,L ndcates the possble mpact on the provsonng cost,.e, a large negatve (postve) value ndcates a stronger preference of small (large) VM nstances, whle values closer to reflect less senstvty to the platform choce. For optmzed heterogeneous soluton, we need to determne the followng parameters: The number of nstances for each sub-cluster (.e., the number of worker nodes plus a dedcated node to host JobTracker and Name Node for each sub-cluster). The subset of applcatons to be executed on each cluster. Algorthm 5 shows the pseudo code of our heterogeneous soluton. For a presentaton smplcty, we show the code for a heterogeneous soluton wth small and large VM nstances. Frst, we sort the jobs n the ascendng order accordng to ther preference rankng P Score S,L. Thus the jobs n the begnnng of the lst have a performance preference for executng on the small nstances. Then we splt the ordered job lst nto two subsets: frst one to be executed on the cluster wth small nstances and the other one to be executed on the cluster wth large nstances (lnes 4-5). For each group, we use Algorthm 2 for homogeneous cluster provsonng to determne the optmzed sze of each sub-cluster for processng the assgned workload wth a deadlne D) that leads to the mnmal monetary cost (lnes 6-7). We consder all possble splts by teratng through the splt pont from 1 to the total number of jobs N and use a varable mn cost S+L to keep track of the found mnmal total cost,.e, the sum of costs from both sub-clusters (lnes 9-12). 7 The desgned framework can be generalzed for a larger number of clusters. However, ths mght sgnfcantly ncrease the algorthm complexty wthout addng new performance benefts. (2) Algorthm 5 Provsonng soluton for heterogeneous clusters to process W wth a deadlne D whle mnmzng the clusters cost Input: W = {J 1, J 2,...J n} workload wth traces and profles, where jobs are sorted n ascendng order by ther preference score P Score S,L ; D a gven tme deadlne for processng W. Output: N S number of small nstances; N L number of large nstances; W S Lst of jobs to be executed on small nstance-based cluster; W L Lst of jobs to be executed on large nstance-based cluster; mn cost S+L the mnmal monetary cost of heterogeneous clusters. 1: mn cost S+L 2: for splt 1 to n 1 do 3: // Partton workload W nto 2 groups 4: Jobs S J 1,..., J splt 5: Jobs L J splt+1,..., J n 6: ( N S, mn cost S ) = Algorthm 2(Jobs S, small, D) 7: ( N L, mn cost L ) = Algorthm 2(Jobs L, large, D) 8: total cost mn cost S + mn cost L 9: f total cost < mn cost S+L then 1: mn cost S+L total cost 11: W S Jobs S, W L Jobs L 12: N S N S, N L N L 13: end f 14: end for 6. EVALUATION In ths secton, we descrbe the expermental testbeds and MapReduce workloads used n our study. We analyze the applcaton performance and the job profles when these applcatons are executed on dfferent platforms of choce, e.g., small, medum, and large Amazon EC2 nstances. The study ams to evaluate the effectveness of the proposed algorthms for selectng the optmzed platform for a Hadoop cluster and compare the outcomes for dfferent workloads. 6.1 Expermental testbeds and workloads In our experments, we use the Amazon EC2 platform. It offers dfferent capacty Vrtual Machnes (VMs) for deployment at dfferent prce. Table 1 provdes descrptons of VM nstance types used n our experments. As t shows, the compute and memory capacty of a medum VM nstance (m1.medum) s doubled compared to a small VM nstance (m1.small) and smlarly, a large VM nstance (m1.large) has a doubled capacty compared to the medum VM. These dfferences are smlarly reflected n prcng. We deployed Hadoop clusters that are confgured wth dfferent number of map and reduce slots per dfferent type VM nstances (accordng to the

capacty) as shown n Table 1. Each VM nstance s deployed wth 1GB of Elastc Block Storage (EBS). We use Hadoop 1.. n all the experments. The fle system blocksze s set to 64MB and the replcaton level s set to 3. Instance prce CPU capacty (relatve) RAM #m,r type (GB) slots Small $.6 ph 1 EC2 Compute Unt (1 vrtual core 1.7 1, 1 wth 1 EC2 Compute Unt) Medum $.12 ph 2 EC2 Compute Unt (1 vrtual core 3.75 2, 2 wth 2 EC2 Compute Unts) Large $.24 ph 4 EC2 Compute Unts (2 vrtual cores wth 2 EC2 Compute Unts) 7.5 4, 4 Table 1: EC2 Testbed descrpton. In the performance study, we use a set of 13 applcatons released by the Tarazu project [2]. Table 2 provdes a hgh-level summary of the applcatons wth the correspondng job settngs (e.g., the number of map/reduce tasks). Applcatons 1, 8, and 9 process synthetcally generated data. Applcatons 2 to 7 use the Wkpeda artcles dataset as nput. Applcatons 1 to 13 use the Netflx move ratngs dataset. These applcatons perform very dfferent data manpulatons, whch result n dfferent resource requrements. To provde some addtonal nsghts n the amounts of data flowng through the MapReduce processng ppelne, we also show the overall sze of the nput data, ntermedate data (.e., data generated between map and reduce stages), and the output data (.e., the data wrtten by the reduce stage). Applcaton Input data Input Interm Output #map,red (type) data data data tasks (GB) (GB) (GB) 1. TeraSort Synthetc 31 31 31 495, 24 2. WordCount Wkpeda 5 9.8 5.6 788, 24 3. Grep Wkpeda 5 1 1x1 8 788, 1 4. InvIndex Wkpeda 5 1.5 8.6 788, 24 5. RankInvIndex Wkpeda 46 48 45 745, 24 6. TermVector Wkpeda 5 4.1.2 788, 24 7. SeqCount Wkpeda 5 45 39 788, 24 8. SelfJon Synthetc 28 25.14 448, 24 9. AdjLst Synthetc 28 29 29 58, 24 1. HstMoves Netflx 27 3x1 5 7x1 8 428, 1 11. HstRatngs Netflx 27 2x1 5 6x1 8 428, 1 12. Classfcaton Netflx 27.8.6 428, 5 13. KMeans Netflx 27 27 27 428, 5 Table 2: Applcaton characterstcs. 6.2 Applcaton performance analyss We execute the set of 13 applcatons shown n Table 2 on three Hadoop clusters 8 deployed wth dfferent types of EC2 VM nstances (they can be obtaned for the same prce per tme unt): ) 4 small VMs, ) 2 medum VMs, and ) 1 large VM nstances. We confgure these Hadoop clusters accordng to ther nodes capacty as shown n Table 1, wth 1 addtonal nstance deployed as the NameNode and JobTracker. These experments pursue the followng goals: ) to demonstrate the performance mpact of executng these applcatons on the Hadoop clusters deployed wth dfferent EC2 nstances; and 2) to collect the detaled job profles for creatng the job traces used for replay by the smulator and trade-off analyss n determnng the optmal platform choce. 8 All the experments are performed fve tmes, and the measurement results are averaged. Ths comment apples to all the results. Fgure 8 (a) presents the completon tmes (CT) of 13 applcatons executed on the three dfferent EC2-based clusters. The results show that the platform choce may sgnfcantly mpact the applcaton processng tme. Note, we break the Y-axs as KMeans and Classfcaton executons take much longer tme to fnsh compared to other applcatons. Fgure 8 (b) shows the normalzed results wth respect to the executon tme of the same job on the Hadoop cluster formed wth small VM nstances. For 7 out 13 applcatons, the Hadoop cluster formed wth small nstances leads to the best completon tme (and the smallest cost). However, for the CPU-ntensve applcatons such as Classfcaton and KMeans, the Hadoop cluster formed wth large nstances shows better performance. Tables 3-5 summarze the job profles collected for these applcatons. They show the average and maxmum duratons for the map, shuffle and reduce phase processng as well as the standard devaton for these phases. The analyss of the job profles show that the shuffle phase duratons of the Hadoop cluster formed wth large nstances are much longer compared to the clusters formed wth small nstances. The reason s that the Amazon EC2 nstance scalng s done wth respect to the CPU and RAM capacty, whle the storage and network bandwdth s only fractonally mproved. As we confgure a hgher number of slots on large nstances, t ncreases the I/O and network contenton among the tasks runnng on the same nstance, and t leads to sgnfcantly ncreased duratons of the shuffle phase. At the same tme, the map task duratons of most applcatons executed on the Hadoop cluster wth large nstances are sgnfcantly mproved, e.g., the map task duratons of Classfcaton and KMeans applcatons mproved almost three tmes. The presented analyss of job profles show that a platform choce for a Hadoop cluster may have a sgnfcant mpact on the applcaton performance. Ths analyss further demonstrates the mportance of an effectve mechansm and algorthms for helpng to make the rght provsonng decsons based on the workload characterstcs. 6.3 Comparson of homogeneous and heterogeneous solutons In ths secton, we use workloads created from the applcatons shown n Table 2 for comparng the results of both homogeneous and heterogeneous provsonng solutons. The followng Table 6 provdes an addtonal applcaton characterzaton by reflectng the applcaton preference score P Score S,L. A postve value (e.g, Kmeans, Classfcaton) ndcates that the applcaton s more cost-effcent on large VMs, whle a negatve value (e.g., TeraSort, Wordcount) means that the applcaton favors small VM nstances. The absolute score value s ndcatve of the preference strength. When the preference score s close to (e.g., Adjlst), t means that the applcaton does not have a clear preference between the nstance types. Applcaton P Score S,L 1. TeraSort -3.74 2. WordCount -5.96 3. Grep -3.3 4. InvIndex -7.9 5. RankInvIndex -5.13 6. TermVector 3.11 7. SeqCount -4.23 8. SelfJon -5.41 9. AdjLst -.7 1. HstMoves -1.64 11. HstRatngs -2.53 12. Classfcaton 19.59 13. KMeans 18.6 Table 6: Applcaton Preference Score. We perform our case studes wth three workloads W1, W2 and W3 descrbed as follows: W1 t contans all 13 applcatons shown n Table 2.

Job Completon Tme (s) (a) 36 3 12 1 8 6 4 2 small medum large TeraSort WordCount Grep InvIndex RankInvInd TermVect SeqCount SelfJon AdjLst HsMov HsRat Classfc KMeans (b) Fgure 8: Job completon tmes (CT) on dfferent EC2-based clusters: (a) absolute CT, (b) normalzed CT. Applcaton avgmap maxmap avgshuffle maxshuffle avgreduce maxreduce map STDEV shuffle STDEV reduce STDEV TeraSort 29.1 46.7 248.5 317.5 31.2 41.3.82% 4.51%.97% WordCount 71.5 147. 218.7 272. 12.1 22.4 1.16% 5.83% 3.68% Grep 19. 51.4 125.7 125.7 4.5 4.5 1.19% 26.43% 1.53% InvIndex 83.9 17. 196.8 265.2 18.2 27.6 1.33% 8.3% 3.96% RankInvIndex 35.4 68.8 376. 479. 81.9 12. 1.5% 3.79%.81% TermVector 98.9 16.3 36. 1239.5 137.2 111.7.78% 2.45% 2.45% SeqCount 11.2 23. 256.8 454.2 54.1 82.3 1.1% 3.63% 6.62% SelfJon 11.9 2.4 217.9 246.5 12.3 21.4.7% 4.87% 3.12% AdjLst 265.9 415.9 72.7 121.8 291.1 398.1 1.53% 6.57%.84% HstMoves 17.9 49.2 138.9 138.9 3.4 3.4 1.49% 4.85% 34.84% HstRatng 58.9 19.8 111.8 111.8 4.8 4.8 2.1% 35.58% 22.41% Classf 3147.3 447.2 58.5 61.5 4. 6.9 1.21% 12.76% 3.13% Kmeans 3155.9 3618.5 8.4 21.9 87.5 48.9.32% 3.9% 11.43% 8 7 6 5 4 3 2 1 TeraSort WordCount Grep InvIndex RankInvInd TermVect Table 3: Job profles on the EC2 cluster wth small nstances (tme n sec) Normalzed Completon Tme SeqCount SelfJon AdjLst HsMov HsRat Classfc small medum large KMeans Applcaton avgmap maxmap avgshuffle maxshuffle avgreduce maxreduce map STDEV shuffle STDEV reduce STDEV TeraSort 36.9 46.1 466.3 553.1 26.5 34.3 1.6% 14.7% 1.21% WordCount 83. 127.4 562.4 771.6 11.6 23..48% 7.1% 9.9% Grep 23.8 56.7 256.6 256.6 3.2 3.2 4.95% 24.13% 9.48% InvIndex 11. 15.4 449.5 536.3 13.6 2.2.52% 8.65% 1.62% RankInvIndex 45.7 81.1 741.6 876.5 64. 77.2.63% 9.4% 2.77% TermVector 128.1 189.3 432.4 1451.4 71.9 576.4.23% 7.8% 2.81% SeqCount 126.8 251. 482.1 557.1 35. 43.2.52% 21.7% 14.98% SelfJon 11.1 18.2 48.1 475.1 11.2 19.9.92% 13.86% 1.65% AdjLst 27.1 42. 163.2 221.6 26.4 281.8 2.74% 8.7% 1.16% HstMoves 2.1 47.7 246.7 246.7 3.7 3.7 3.14% 26.39% 17.4% HstRatng 71.7 13.7 24.4 24.4 5. 5..23% 31.39% 14.22% Classf 313.8 474.3 177.2 211.8 3.9 6.1.82% 44.3% 4.33% Kmeans 2994. 3681.2 189.7 392.1 51.7 28.4 3.93% 8.84% 6.96% Table 4: Job profles on the EC2 cluster wth medum nstances (tme n sec) Applcaton avgmap maxmap avgshuffle maxshuffle avgreduce maxreduce map STDEV shuffle STDEV reduce STDEV TeraSort 27.3 55.7 86.4 1128.4 2. 7.6.66% 7.78% 16.14% WordCount 54.7 126.3 128.6 1163.9 12.9 59.2 4.33% 1.24% 9.15% Grep 18.3 59.7 791.8 791.8 4.3 4.3 3.5% 16.48% 22.81% InvIndex 61.8 18.4 1152.6 1374.5 14.9 61.7 6.47% 5.1% 8.68% RankInvIndex 28.3 71.5 1155.8 138.6 4.5 88.5 1.49% 9.2% 8.19% TermVector 85.3 194.7 17.6 1573.9 3.2 259.2 3.88% 5.98% 1.4% SeqCount 62. 117.5 146.1 1283.2 37.6 9.9 1.51% 6.7% 2.1% SelfJon 16.4 32.4 115.7 1235.9 18.5 88.7 1.93% 4.86% 19.11% AdjLst 149. 311.3 436.9 531.5 149.1 348.1.56% 13.34% 2.78% HstMoves 22.3 8.2 724.2 724.2 5.2 5.2 6.97% 22.46% 17.25% HstRatng 51.4 187.8 628.6 628.6 3.6 3.6 1.59% 21.1% 4.83% Classf 14.6 1946.5 711.2 1113.3 3.9 9.4.87% 37.15% 27.74% Kmeans 124.6 244.7 716.9 866.9 58.5 364.3 1.31% 1.75% 5.25% Table 5: Job profles on the EC2 cluster wth large nstances (tme n sec)

Makespan (s) 2 15 1 5 heterogeneous small medum large Makespan (s) 2 15 1 5 heterogeneous small medum large Makespan (s) 2 15 1 5 heterogeneous small medum large 2 4 6 8 1 cost ($) (a) Workload W1. 1 2 3 4 5 6 7 cost ($) (b) Workload W2. 1 2 3 4 5 6 7 8 cost ($) (c) Workload W3. Fgure 9: Performance versus cost trade-offs for dfferent workloads. W2 t contans 11 applcatons: 1-11,.e., excludng KMeans and Classfcaton from the applcaton set. W3 t contans 12 applcatons: 1-12,.e., excludng KMeans from the applcaton set. Intutvely, there s a dfferent number of applcatons that strongly favor large VM nstances n each workload: W1 has both KMeans and Classfcaton, workload W2 does not have any of them, and workload W3 has only Classfcaton. Fgure 9 shows the smulated cost/performance trade-off curves for three workloads executed on both homogeneous and heterogeneous Hadoop cluster(s). These trade-off curves are results of the brute-force algorthm desgn, t searches through the entre soluton space by exhaustvely enumeratng all possble canddates for the soluton. So, these trade-off curves do show all the solutons that our algorthms terate through. For homogeneous provsonng, we show the three trade-off curves of Algorthm 2 for Hadoop clusters based on small, medum and large VM nstances respectvely. Fgure 9 (a) shows that workload W1 s more cost-effcent when executed on the Hadoop cluster wth large VMs (among the homogeneous clusters). Such results can be expected because W1 contans both KMeans and Classfcaton that have very strong preference towards large VM nstances (see ther hgh postve P Score S,L ). In comparson, W2 contans applcatons that mostly favor the small VM nstances, and as a result, the most effcent trade-off curve belongs to a Hadoop cluster based on the small VM nstances. Fnally, W3 represents a mxed case: t has Classfcaton applcaton that strongly favors large VM nstances whle most of the remanng applcatons prefer small VM nstances. Fgure 9(c) shows that a choce of the best homogeneous platform depends on the workload performance objectves (.e., deadlne D). The yellow dots n Fgure 9 represent the completon tme and monetary cost when we explot a heterogeneous provsonng case wth Algorthm 5. Each pont corresponds to a workload splt nto two subsets that are executed on the Hadoop cluster formed wth small and large VM nstances respectvely. Ths s why nstead of the explct trade-off curves as n the homogeneous cluster case, the smulaton results for the heterogeneous case look much more scattered across the space. To evaluate the effcency of our provsonng algorthms, we consder dfferent performance objectves for each workload: D= 2 seconds for workload W1; D= 1 seconds for workload W2; D= 15 seconds for workload W3. Tables 7-9 present the provsonng results for each workload wth homogeneous and heterogeneous Hadoop clusters that have mnmal monetary costs whle meetng the gven workload deadlnes. Among the homogeneous Hadoop clusters for W1, the cluster wth large VM nstances has the lowest monetary cost of $32.86, that Cluster type Number of Completon Monetary Instances Tme (sec) Cost ($) small (homogeneous) 21 15763 55.43 medum (homogeneous) 15 15137 53.48 large (homogeneous) 39 12323 32.86 small+large heterogeneous 48 small + 2 large 14988 24.21 Table 7: Cluster provsonng results for workload W1. Cluster type Number of Completon Monetary Instances Tme (sec) Cost ($) small (homogeneous) 87 7283 1.68 medum (homogeneous) 43 963 14.8 large (homogeneous) 49 9893 32.98 small+large heterogeneous 76 small + 21 large 6763 14.71 Table 8: Cluster provsonng results for workload W2. Cluster type Number of Completon Monetary Instances Tme (sec) Cost ($) small (homogeneous) 14 13775 32.37 medum (homogeneous) 7 13118 31.5 large (homogeneous) 36 13265 32.72 small+large heterogeneous 74 small + 15 large 113 18. Table 9: Cluster provsonng results for workload W3. provdes 41% cost savng compared to a cluster wth small VMs. By contrast, for workload W2, the homogeneous Hadoop cluster wth small VMs provdes the lowest cost of $1.68, that provdes 68% cost savng compared to a cluster wth large VM nstances. For W3, all the three homogeneous solutons lead to a smlar mnmal cost, and the Hadoop cluster based on medum VMs has a slghtly better cost than the other two alternatves. Intutvely, these performance results are expected from the tradeoff curves for three workloads shown n Fgure 9. The best heterogeneous soluton for each workload s shown n the last row n Tables 7-9. For W1, the mnmal cost of the heterogeneous soluton s $24.21 whch s 26% mprovement compared to the mnmal cost of the homogeneous soluton based on the large VM nstances. In ths heterogeneous soluton, the applcatons Self- Jon, WordCount, InvIndex are executed on the cluster wth small VMs and applcatons Classf, Kmeans, TermVector, Adjlst, Hst- Moves, HstRatng, Grep, TeraSort, SeqCount, RankInvInd are executed on the cluster wth large VM nstances. The cost benefts of the heterogeneous soluton s even more sgnfcant for W3 as shown n Table 9. The mnmal cost for heterogeneous cluster s $18. compared wth the mnmal cost for a homogeneous provson of $31.5, t leads to cost savngs of 42% compared to the mnmal cost of the homogeneous soluton. In ths heterogeneous soluton, the applcatons HstMoves, HstRatng, Grep, TeraSort, SeqCount, RankInvInd, SelfJon, WordCount, InvIndex are executed on the cluster wth small VMs and applcatons Classf,

TermVector, Adjlst are executed on the cluster wth large VMs. However, for workload W2, the heterogeneous soluton does not provde addtonal cost benefts as shown n Table 8. One mportant reason s that for a heterogeneous soluton, we need to mantan addtonal nodes deployed as JobTracker and NameNode for each subcluster. Ths ncreases the total provsonng cost compared to the homogeneous soluton whch only requres a sngle addtonal node for the entre cluster. The workload propertes also play an mportant role here. As W2 workload does not have any applcatons that have strong preference for large VM nstances, the ntroducton of a specal sub-cluster wth large VM nstances s not justfed. 6.4 Impact of node falures on the cluster platform s selecton In ths secton, we show how the cluster platform s selecton may be mpacted when a user addtonally consders a possblty of a node falure(s), and he/she s nterested n achevng the generalzed servce level objectves (SLOs) whch nclude two dfferent performance goals for workload executon under a normal scenaro and a case wth 1-node falure: a desrable completon tme D for the entre set of jobs n the workload W under normal condtons; an acceptable degraded completon tme D deg for processng W n case of 1-node falure. Intutvely, a node falure n the Hadoop cluster formed wth small EC2 nstances may have smaller mpact than a node falure n the cluster formed wth large EC2 nstances. Let us demonstrate the decson makng process for two applcatons from our set (see Table 2): TermVector and AdjLst. The completon tme versus cost curves for applcatons TermVector and AdjLst are shown n Fgures 1 and 7 (c) respectvely. Job Completon Tme (s) 5 4 3 2 1 5 1 15 2 25 3 35 cost ($) small medum large Fgure 1: Performance versus cost trade-offs for TermVector. From these fgures and the preference score P Score S,L shown n Table 6, we can see that TermVector slghtly favors large VM nstances, whle AdjLst s practcally neutral to the choce of small, medum, or large EC2 nstances. For these two applcatons, we apply our approach for selectng the underlyng platform (a choce between small, medum, and large EC2 nstances) to acheve the followng performance objectves: TermVector: D = 29 seconds (regular case, no node falures); D deg = 293 seconds (n case of 1-node falure). AdjLst: D = 194 seconds (regular case, no node falures); D deg = 1945 seconds (n case of 1-node falure). Table 1 summarzes the cluster provsonng results for a regular case and a scenaro wth 1-node falure for TermVector. We use abbrevatons CT reg and CT fal to denote the completon tme n regular and 1-node falure cases respectvely. Table 1 shows that n a regular case scenaro, a Hadoop cluster wth large EC2 nstances offers the best soluton. However, f a user has concerns about a possble node falure, and ams to meet strngent performance objectves then the platform choce based on small VM nstances s a better choce. VM Regular, No-Falure Case 1-Node Falure Scenaro type D = 29 sec D= 29 sec and D deg = 293 sec CT reg #VMs Cost CT reg CT fal #VMs Cost sec $ sec sec $ Small 2898 139 6.76 2898 293 139 6.76 Large 2877 34 6.71 2842 2877 35 6.83 Table 1: TermVector: cluster provsonng results for a regular case and a scenaro wth 1-node falure. Table 11 summarzes the cluster provsonng results for a regular case and a scenaro wth 1-node falure for AdjLst applcaton. VM Regular, No-Falure Case 1-Node Falure Scenaro type D = 194 sec D = 194 sec and D deg = 1945 sec CT reg #VMs Cost CT reg CT fal #VMs Cost sec $ sec sec $ Small 1939 139 4.52 1924 1939 14 4.52 Medum 1935 69 4.51 1931 1935 7 4.57 Table 11: AdjLst: cluster provsonng results for a regular case and a scenaro wth 1-node falure. In a regular case scenaro, a Hadoop cluster wth medum EC2 nstances offers the best soluton for AdjLst. However, n the scenaro wth 1-node falure, the platform choce based on small VM nstances s a better choce. The achevable cost and performance advantages are more sgnfcant for workloads that requre small-sze Hadoop clusters for achevng ther performance objectves. In large Hadoop clusters, a loss of 1-node results n a less pronounced performance mpact. 6.5 Valdaton of the smulaton results To valdate the accuracy of the smulaton results, we chose workload W2 and select the makespan target of 2 seconds. We use our smulaton results (shown n Fgure 9 (b)) and dentfy four closest ponts that represent the correspondng four solutons. The selected ponts correspond to smulated homogeneous Hadoop clusters wth 28, 2, 24 nodes formed by small, medum, and large EC2 nstances respectvely, and to a heterogeneous soluton wth two Hadoop sub-clusters based on 26 small nodes and 2 large nodes. We deployed the Hadoop clusters wth the requred number of nstances and have executed workload W2 (wth the correspondng Johnson job schedule) on the deployed clusters. Fgure 11 shows the comparson between the smulated and the actual measured makespan (we repeated measurements 5 tmes). Makespan (s) 3 25 2 15 1 5 Smulated tme Executon tme small medum large hetero Fgure 11: Valdaton of the smulaton results.

Table 12 summarzes valdaton results shown n Fgure 11. The smulated results wth small and large EC2 nstances, as well as the heterogeneous soluton show 2-8% error compared to measured results. small medum large heterogeneous Smulated tme (sec) 19327 213 19224 19612 Measured tme (sec) 19625 23537 18521 21368 Table 12: Summary of the valdaton results. We can see a hgher predcton error (17%) for medum nstances. Partally, t s due to a hgher varance n the job profle measurements collected on medum nstances. 6.6 Dscusson Towards a better understandng of what causes the applcaton performance to be so dfferent when executed by Hadoop clusters based on dfferent VM nstances, Fgure 12 shows a detaled analyss of the executon tme breakdown for Terasort and Kmeans on the small, medum, and large EC2 nstances (we use the same Hadoop cluster confguratons as descrbed n our motvatng example n Secton 1). Job Completon Tme (s) 6 5 4 3 2 1 map shuffle reduce small medum large (a) TeraSort Job Completon Tme (s) 4 35 3 25 2 15 1 5 map shuffle reduce small medum large (b) KMeans Fgure 12: Analyss of TeraSort and KMeans on dfferent EC2 nstances. Terasort performance s domnated by the shuffle phase. Moreover, the shuffle duraton s ncreasng when executed by Hadoop based on on medum and large nstances compared to the Hadoop executon based on the small EC2 nstances. The sgnfcantly longer shuffle tme leads to an ncreased overall job completon tme as shown n Fgure 12 (a). One explanaton s that the ncreased sze EC2 nstances are provded wth a scaled capactes of CPU and RAM, but not of the network bandwdth. As we confgure more slots on the large EC2 nstances, t ncreases amount of the I/O and network traffc (as well as the contenton) per each VM, and ths leads to the ncreased duraton of the shuffle phase. On the contrary, for Kmeans shown n Fgure 12 (b), the map stage duraton domnates the applcaton executon tme, and the map phase executon s sgnfcantly mproved when executed on large EC2 nstances. Ths can be explaned by checkng the CPU models of the underlyng server hardware used to host dfferent types of EC2 nstances. Over a month, every day we have reserved 2 nstances of small, medum, and large EC2 nstances to gather ther CPU nformaton from the servers used for hostng these nstances. Table 13 below summarzes the CPU models statstcs accumulated durng these samplng experments. Majorty of large EC2 nstances (75%) are hosted on a later generaton, more powerful, and faster CPU model compared to the small and medum EC2 nstances. Also, practcally the same CPU models are used for hostng the small and medum EC2 nstances, whch explans why the performance dfference between small and medum EC2 nstances were sgnfcantly smaller compared to the large ones, e.g., see Kmeans performance shown n Fgure 12 (b). Instance type Small Medum Large CPU type 9% Intel(R) Xeon(R) CPU E5-265 @ 2.GHz 9% Intel(R) Xeon(R) CPU E5645 @ 2.4GHz 1% Intel(R) Xeon(R) CPU E543 @ 2.66GHz 83% Intel(R) Xeon(R) CPU E5-265 @ 2.GHz 8% Intel(R) Xeon(R) CPU E557 @ 2.27GHz 7% Intel(R) Xeon(R) CPU E5645 @ 2.4GHz 2% Intel(R) Xeon(R) CPU E543 @ 2.66GHz 75% Intel(R) Xeon(R) CPU E5-2651 v2 @ 1.8GHz 12% Intel(R) Xeon(R) CPU E557 @ 2.27GHz 8% Intel(R) Xeon(R) CPU E543 @ 2.66GHz 3% Intel(R) Xeon(R) CPU E5645 @ 2.4GHz 2% Intel(R) Xeon(R) CPU E5-265 @ 2.GHz Table 13: CPU types used by dfferent EC2 nstances. 7. RELATED WORK In the past few years, performance modelng and smulaton of MapReduce envronments has receved much attenton, and dfferent approaches [5, 4, 14, 13] were offered for predctng performance of MapReduce applcatons, as well as optmzng resource provsonng n the Cloud [7, 11]. A few MapReduce smulators were ntroduced for the analyss and exploraton of Hadoop cluster confguraton and optmzed job schedulng decsons. The desgners of MRPerf [16] am to provde a fne-graned smulaton of MapReduce setups. To accurately model nter- and ntra rack task communcatons over network MRPerf uses the well-known ns-2 network smulator. The authors are nterested n modelng dfferent cluster topologes and n ther mpact on the MapReduce job performance. In our work, we follow the drectons of SmMR smulator [12] and focus on smulatng the job master decsons and the task/slot allocatons across multple jobs. We do not smulate detals of the TaskTrackers (ther hard dsks or network packet transfers) as done by MRPerf. In spte of ths, our approach accurately reflects the job processng because of our proflng technque to represent job latences durng dfferent phases of MapReduce processng n the cluster. SmMR s very fast compared to MRPerf whch deals wth network-packet level smulatons. Mumak [3] s an open source Apache s MapReduce smulator. It replays traces collected wth a log processng tool, called Rumen [1]. The man dfference between Mumak and SmMR s that Mumak omts modelng the shuffle/sort phase that could sgnfcantly affect the accuracy. There s a body of work focusng on performance optmzaton of MapReduce executons n heterogeneous envronments. Zahara et al. [19], focus on elmnatng the negatve effect of stragglers on job completon tme by mprovng the schedulng strategy wth speculatve tasks. The Tarazu project [2] provdes a communcaton-aware schedulng of map computaton whch ams at decreasng the communcaton overload when faster nodes process map tasks wth nput data stored on slow nodes. It also proposes a load-balancng approach for reduce computaton by assgnng dfferent amounts of reduce work accordng to the node capacty. Xe et al. [18] try mprovng the MapReduce performance through a heterogenety-aware data placement strategy: a faster nodes store larger amount of nput data. In ths way, more tasks can be executed by faster nodes wthout a data transfer for the map executon. Polo et al. [9] show that some MapReduce applcatons can be accelerated by usng specal hardware. The authors desgn an adaptve Hadoop scheduler that assgns such jobs to the nodes wth correspondng hardware. Another group of related work s based on resource management that consders monetary cost and budget constrants. In [1], the authors provde a heurstc to optmze the number of machnes for a bag of jobs whle mnmzng the overall completon tme under a gven budget. Ths work assumes the user does not have any knowledge about the job completon tme. It starts wth a sngle machne

and gradually adds more nodes to the cluster based on the average job completon tme updated every tme when a job s fnshed. In our approach, we use job profles for optmzng the job schedule and provsonng the cluster. In [17], the authors desgn a budget-drven schedulng algorthm for MapReduce applcatons n the heterogeneous cloud. They consder teratve MapReduce jobs that take multple stages to complete, each stage contans a set of map or reduce tasks. The optmzaton goal s to select a machne from a fxed pool of heterogeneous machnes for each task to mnmze the job completon tme or monetary cost. The proposed approach reles on a pror knowledge of the completon tme and cost for a task executed on a machne j n the canddate set. In our paper, we am at mnmzng the makespan of the set of jobs and desgn an ensemble of methods and tools to evaluate the job completon tmes as well as ther makespan as a functon of allocated resources. In [8], Kllap et al. propose schedulng strateges to optmze performance/cost trade-offs for general data processng workflows n the Cloud. Dfferent machnes are modelled as contaners wth dfferent CPU, memory, and network capactes. The computaton workflow contans a set of nodes as operators and edges as data flows. The authors provde both greedy and local search algorthms to schedule operators on dfferent contaners so that the optmal performance (cost) s acheved wthout volatng budget or deadlne constrants. Compared to our proflng approach, they estmate the operator executon tme usng the CPU contaner requrements. Ths approach does not apply for estmatng the duratons of map/reduce tasks ther performance depends on multple addtonal factors, e.g., the amount of RAM allocated to JVM, the I/O performance of the executng node, etc. The authors present only smulaton results wthout valdatng the smulator accuracy. 8. CONCLUSION In ths work, we desgned a novel smulaton-based framework for evaluatng both homogeneous and heterogeneous Hadoop solutons to enhance prvate and publc cloud offerngs wth a cost-effcent, SLO-drven resource provsonng. We demonstrated that seemngly equvalent platform choces for a Hadoop cluster mght result n a very dfferent applcaton performance, and thus lead to a dfferent cost. Our case study wth Amazon EC2 platform reveals that for dfferent workloads an optmzed platform choce may result n 45-68% cost savngs for achevng the same performance objectves. In our future work, we plan to use a set of addtonal mcrobenchmarks to profle and compare generc phases of the MapReduce processng ppelne across Cloud offerngs, e.g., comparng performance of the shuffle phase across dfferent EC2 nstances to predct the general performance mpact of dfferent platforms on the user workloads. 9. REFERENCES [1] Apache Rumen: a tool to extract job characterzaton data from job tracker logs. https://ssues.apache.org/ jra/browse/mapreduce-728. [2] F. Ahmad et al. Tarazu: Optmzng MapReduce on Heterogeneous Clusters. In Proc. of ASPLOS, 212. [3] Apache. Mumak: Map-Reduce Smulator. https://ssues.apache.org/jra/browse/ MAPREDUCE-751. [4] H. Herodotou, F. Dong, and S. Babu. No One (Cluster) Sze Fts All: Automatc Cluster Szng for Data-Intensve Analytcs. In Proc. of ACM Symposum on Cloud Computng, 211. [5] H. Herodotou, H. Lm, G. Luo, N. Borsov, L. Dong, F. Cetn, and S. Babu. Starfsh: A Self-tunng System for Bg Data Analytcs. In Proc. of 5th Conf. on Innovatve Data Systems Research (CIDR), 211. [6] S. Johnson. Optmal Two- and Three-Stage Producton Schedules wth Setup Tmes Included. Naval Res. Log. Quart.,1954. [7] K. Kambatla, A. Pathak, and H. Pucha. Towards optmzng hadoop provsonng n the cloud. In Proc. of the Frst Workshop on Hot Topcs n Cloud Computng, 29. [8] H. Kllap et al. Schedule Optmzaton for Data Processng Flows on the Cloud. In Proc. of the ACM SIGMOD 211. [9] J. Polo et al. Performance management of accelerated mapreduce workloads n heterogeneous clusters. In Proc. of the 41st Intl. Conf. on Parallel Processng, 21. [1] J. N. Slva et al. Heurstc for Resources Allocaton on Utlty Computng Infrastructures. In Proc. of MGC 28 wokshop. [11] F. Tan and K. Chen. Towards Optmal Resource Provsonng for Runnng MapReduce Programs n Publc Clouds. In Proc. of IEEE Conference on Cloud Computng (CLOUD 211). [12] A. Verma, L. Cherkasova, and R. H. Campbell. Play It Agan, SmMR! In Proc. of Intl. IEEE Cluster, 211. [13] A. Verma, L. Cherkasova, and R. H. Campbell. Resource Provsonng Framework for MapReduce Jobs wth Performance Goals. Proc. of the 12th Mddleware Conf., 211. [14] A. Verma et al. ARIA: Automatc Resource Inference and Allocaton for MapReduce Envronments. Proc. ICAC 211. [15] A. Verma et al. Two Sdes of a Con: Optmzng the Schedule of MapReduce Jobs to Mnmze Ther Makespan and Improve Cluster Performance. Proc. of MASCOTS, 212. [16] G. Wang, A. Butt, P. Pandey, and K. Gupta. A Smulaton Approach to Evaluatng Desgn Decsons n MapReduce Setups. In Intl. Symposum on Modellng, Analyss and Smulaton of Computer and Telecommuncaton Systems (MASCOTS), 29. [17] Y. Wang and W. Sh. On Optmal Budget-Drven Schedulng Algorthms for MapReduce Jobs n the Hetereogeneous Cloud. Techncal Report TR-13-2, Carleton Unv., 213. [18] J. Xe et al. Improvng mapreduce performance through data placement n heterogeneous hadoop clusters. In Proc. of the IPDPS Workshops: Heterogenety n Computng, 21. [19] M. Zahara et al. Improvng mapreduce performance n heterogeneous envronments. In Proc. of OSDI, 28. [2] Z. Zhang, L. Cherkasova, and B. T. Loo. Explotng Cloud Heterogenety for Optmzed Cost/Performance MapReduce Processng. In Proc. of the 4th Intl. Workshop on Cloud Data and Platforms (CloudDP 214), 214. [21] Z. Zhang et al. Automated Proflng and Resource Management of Pg Programs for Meetng Servce Level Objectves. In Proc. of IEEE/ACM ICAC 212. [22] Z. Zhang et al. Optmzng Cost and Performance Trade-Offs for MapReduce Job Processng n the Cloud. In Proc. of IEEE/IFIP NOMS, May, 214.

Optmal Map Reduce Job Capacty Allocaton n Cloud Systems Marzeh Malekmajd Sharf Unversty of Technology, Iran malekmajd@ce.sharf.edu Danlo Ardagna Poltecnco d Mlano, Italy danlo.ardagna@polm.t Mchele Cavotta Poltecnco d Mlano, Italy mchele.cavotta@polm.t Alessandro Mara Rzz Mauro Passacantando Poltecnco d Mlano, Italy Unverstà d Psa, Italy alessandromara.rzz@polm.t mauro.passacantando@unp.t ABSTRACT We are enterng a Bg Data world. Many sectors of our economy are now guded by data-drven decson processes. Bg Data and busness ntellgence applcatons are facltated by the MapReduce programmng model whle, at nfrastructural layer, cloud computng provdes flexble and cost effectve solutons for allocatng on demand large clusters. Capacty allocaton n such systems s a key challenge to provde performance for MapReduce jobs and mnmze cloud resource costs. The contrbuton of ths paper s twofold: () we provde new upper and lower bounds for MapReduce job executon tme n shared Hadoop clusters, () we formulate a lnear programmng model able to mnmze cloud resources costs and job rejecton penaltes for the executon of jobs of multple classes wth (soft) deadlne guarantees. Smulaton results show how the executon tme of MapReduce jobs falls wthn 14% of our upper bound on average. Moreover, numercal analyses demonstrate that our method s able to determne the global optmal soluton of the lnear problem for systems ncludng up to 1, user classes n less than.5 seconds. 1. INTRODUCTION Nowadays, many sectors of our economy are guded by datadrven decson processes [14]. In complex systems that do not lend themselves to ntutve models (e.g., natural scences, socal and engneered systems [11]), data-drven modelng and hypothess generaton have a key role to understandng system behavor and nteractons. The adopton of data ntensve applcatons s well recognzed as able to enhance effcency of enterprses and the qualty of our lves. A recent McKnsey analyss [19] has shown, for nstance, that Bg Data could produce $3 bllon potental annual value to US health care. The analyss has also shown how Europe publc sector could potentally reduce expendture of admnstratve actvtes by 15 2%, wth an ncrease of value rangng between $223 and $446 bllon [11, 19]. From the technologcal perspectve, the MapReduce programmng model s recognzed to be the most promnent soluton for Bg Data applcatons [16]. Its open source mplementaton, Hadoop, s able to manage large datasets over Copyrght s held by author/owner(s). ether commodty clusters and hgh performance dstrbuted topologes [29]. MapReduce has attracted nterest of both ndustry and academa, snce analyzng large amounts of unstructured data s a hgh prorty task for many companes and overtakes the scalablty level that can be acheved by tradtonal data warehouse and busness ntellgence technologes [16]. Lkewse, cloud computng s becomng a manstream soluton to provde very large clusters on a pay-per-use bass. Cloud storage provdes an effectve and cheap soluton for storng Bg Data as modern NoSQL databases demonstrated good extensblty and scalablty n storng and accessng data [15]. Moreover, the pay-per-use approach and the almost nfnte capacty of cloud nfrastructures can be used effcently n supportng data ntensve computaton. Many cloud provders already nclude n ther offerng Map Reduce based platforms such as Google MapReduce framework, Mcrosoft HDnsght, and Amazon Elastc MapReduce [2, 4, 5]. IDC estmates that by 22, nearly 4% of Bg Data analyses wll be supported by publc cloud [6], whle Hadoop s expected to touch half of the world data by 215 [15]. A MapReduce job conssts of two man phases, Map and Reduce; each phase performs a user-defned functon on nput data. MapReduce jobs were meant to run on dedcated clusters to support batch analyses. Nevertheless, MapReduce applcatons have evolved and t s not uncommon that large queres, submtted by dfferent user classes, need to be performed on shared clusters, possbly wth some guarantees on ther executon tme. In ths context the man drawback [17, 26] s that the executon tme of a MapReduce job s generally unknown n advance. In such systems, capacty allocaton becomes one of the most mportant aspects. Determnng the optmal number of nodes n a cluster, shared among multple users performng heterogeneous tasks, s an mportant and challengng problem [26]. Moreover, capacty allocaton polces need to decde jobs executon and rejecton rates n a way that users workloads meet ther deadlnes and the overall cost s mnmzed. Capacty and Far schedulers have been ntroduced n the new versons of Hadoop to address capacty allocaton challenges and effectve resource management [1, 3]. The man goal of Hadoop 2.x [25] s maxmzng cluster utlzaton, whle avodng short (.e., nteractve) job starvaton. Our focus n ths paper s on dynamc capacty allocaton. Frst, we determne new upper and lower bounds for MapReduce job executon tmes n shared Hadoop clusters adoptng

capacty and far schedulers. Next, we formulate the capacty allocaton problem as an optmzaton problem, wth the am of mnmzng the cost of cloud resources and penaltes for jobs rejectons. We then reduce our mnmzaton problem to a Lnear Programmng (LP) problem, whch can be solved very effcently by state of the art solvers. We valdate the accuracy of our bounds through the YARN Scheduler Load Smulator (SLS) [7]. The scalablty of our optmzaton approach s demonstrated by consderng a very large set of experments. The largest nstance we consder, ncludng 1, user classes, can be solved to optmalty n less than.5 seconds. Moreover, smulaton results show that average job executon tme s around 14% lower than our upper bound. To the best of our knowledge, the only work provdng upper and lower bounds for MapReduce jobs executon tme s [26], where only dedcated clusters and FIFO schedulng are consdered (that are not able to fulfll job concurrency and resource sharng requrements for current MapReduce applcatons). Ths paper s organzed as follows. MapReduce job executon tme lower and upper bounds are presented n Secton 2. In Secton 3 the Capacty Allocaton (CA) problem s ntroduced and ts lnear formulaton s presented n Secton 4. The accuracy of the bounds and the scalablty of the soluton are evaluated n Secton 5. Secton 6 descrbes the related work. Conclusons are fnally drawn n Secton 7. 2. ESTIMATING JOB EXECUTION TIMES IN SHARED CLUSTERS In large clusters, multple classes of MapReduce jobs can be executed concurrently 1. In such systems we need to estmate job executon tmes for determnng the confguraton of mnmum cost, whle provdng servce level agreement (SLA) guarantees. Prevous works, e.g., [26], provded theoretcal bounds to desgn performance models for Hadoop 1., consderng n partcular the FIFO scheduler. Those bounds can be used to predct job completon tmes only for dedcated clusters. Nowadays, large shared clusters are ruled by newer schedulers,.e., Capacty and Far [1, 3]. In the followng, we derve new bounds for such systems. In partcular, Secton 2.1 ntroduces prelmnares and provdes a tghter bound wth respect to [26] for a sngle-phase (ether Map or Reduce) job. Secton 2.2 extends the analyss to the case of two sngle-phase jobs, whle Secton 2.3 provdes bounds for the case of multple (sngle-phase) jobs nvolved. Ultmately, we complete our analyss usng the bounds n Secton 2.4 to derve executon tme bounds for multple classes of complete MapReduce jobs. Such results are used n the remanng sectons to defne the constrants of the CA problem that guarantee job deadlnes are met. For space lmtaton, some proofs are omtted and reported n [18]. 2.1 Sngle job bounds Let us consder the executon of a sngle-phase MapReduce job J and let us denote wth k, n, µ, and λ the number of avalable slots, the number of tasks n a Map or Reduce phase of J, the mean and maxmum task duraton, respectvely. In the followng, we suppose that the assgnment of tasks to 1 A job class s a set of jobs characterzed by the same profle n terms of map, reduce and shuffle duraton. slots { k nµ (nµ )/k +(nµ )/k tme Fgure 1: Worst case of one job executon tme slots s done usng an on-lne greedy algorthm that assgns each task to the slot wth the earlest fnshng tme. Proposton 2.1. The executon tme of a Map or Reduce phase of J under a greedy task assgnment s at most U = n µ λ + λ. k Proof. By contradcton, we assume the executon tme s U + ɛ wth ɛ >. Note that n µ s the phase total workload, that s the duraton of consdered phase n the case of only one slot avalable. Let the last processed task has duraton t. All slots are busy before the startng of the last task (otherwse t would have started earler). The tme that has elapsed before startng the last task s (U + ɛ t). Snce all slots are busy for (U + ɛ t) tme, the total workload untl that pont s (U + ɛ t) k. At the end of the executon, the whole phase workload must be unchanged, hence (U + ɛ t) k + t = n µ ( ) nµ λ + λ + ɛ t k + t = nµ k (k 1) λ + ɛ k + t(1 k) = ɛ k = (t λ)(k 1). Snce t λ, we get ɛ k, that s a contradcton because we assumed ɛ > and k 1. The worst case scenaro s llustrated n Fgure 1, where job J starts wth k slots such that for nµ λ tme unts all k slots are busy. After that tme only one task wth duraton λ s left to be executed. One slot performs the last task whle all other slots are free. Fnally, after nµ λ + λ tme unts, k all tasks are executed and the phase s completed. Note that a smlar upper bound has been proposed n [26]. Our contrbuton mproves the prevous result by λ µ. 2.2 Two job bounds In order to provde fast response tmes to small jobs and maxmze the throughput and utlzaton of Hadoop clusters, Far and Capacty schedulers have been devsed. Far scheduler organzes jobs n pools such that every job gets, on average, an equal amount of resources over tme. A sngle runnng job uses the entre cluster however, f other jobs are submtted, the slots that are progressvely released are assgned to the new jobs. In addton, the Far scheduler can guarantee mnmum shares, enables preempton and lmts the number of concurrent runnng jobs/tasks. Capacty schedulers have smlar functonaltes. The feature set of the Capacty scheduler ncludes mnmum shares guarantee, securty, elastcty, mult-tenancy, preempton and job prortes.

{slots k{ j k n j µ j n µ {slots { k j k n µ <n j µ j... nj µj/ j k (n1 µ1 + n2 µ2)/k tme Fgure 2: Lower bound of two jobs n workconservng mode. A scheduler s defned to be work-conservng f t never lets a processor dle whle there are runnable tasks n the system. Both Far and Capacty schedulers can be confgured n work-conservng or non-work-conservng (whch vce versa, let avalable resources dle) mode. Let us consder the executon of two jobs J and J j. If the system s confgured n non-work-conservng mode, avalable slots are dvded statstcally and J dle slots are not allocated to J j. Note that the upper bound defned n Proposton 2.1 and the lower bound provded n [26] are stll vald, snce resources are parttoned. Vce versa, f the system s confgured accordng to work-conservng mode, when J fnshes, ts slots are allocated to J j f t stll has tasks watng to start. In ths stuaton, the bounds proposed n Propostons 2.2 and 2.3 hold. We assume both jobs start at the same tme and J has α percent of all the avalable k slots whereas α j percent of slots are reserved to J j,.e., α, α j (, 1) and α + α j = 1. Proposton 2.2. The executon tmes of a greedy task assgnment of { two jobs (J,J } j) n work-conservng mode are n µ nj µj n µ + nj µj at least mn, and, respectvely. α k α j k k Proof. The analyss of the executon of the frst fnshed job s equvalent to the case wth a sngle job n the system (the best lower and upper bound known n the lterature are gven by [26] and Proposton 2.1). As regards the second job, the number of slots changes at some pont of ts executon, n other words when the frst job fnshes, the second job gets all the slots of the system. Let us suppose that J termnates frst, hence J j receves all slots after at least n µ α tme unts (.e., after J lower k bound [26]). Let us denote wth t f the lower bound for J j executon tme. Frst J j has α j k slots untl tme nstant n µ α k (see the dotted area n Fgure 2), ( then J j ) receves all k slots for a perod of tme equal to t f n µ α. The maxmum k workload that can be executed accordng to the number of slots s greater than or equal to the workload of job J j: ( ) n µ α k αj k + t f nµ k n j µ j. α k Thus, by replacng α j wth 1 α we get n µ α k (1 α) k + ( t f n µ α k that s equvalent to t f (n µ + n j µ j)/k. ) k n j µ j, Proposton 2.3. In a system wth two jobs J and J j n work-conservng mode, the upper bound of the executon tme of job J s (n µ )/ k + tme Fgure 3: Upper bound of two jobs n workconservng mode for the job that ends the earlest {slots k{ j k n µ n j µ j (nj µj + n µ )/k + tme Fgure 4: Upper bound of two jobs n workconservng mode for the job that end the latest n µ λ + λ, f n j µ j k α kα j n µ λ k α, T = n j µ j + n µ λ + λ, otherwse. k Proof. Here we want to know the upper bound for a job when conservng-mode polcy allows usng dle slots. Hence, the upper bound s acheved when the mnmum dle slots become avalable and t happens when the other job makes ts slots busy. If n j µ j k α j n µ λ kα holds (see Fgure 3), then the slots of other job can be busy such that upper bound of ths job does not change. If the nequalty does not hold, then slots of the other job become avalable before ths job fnshes (see Fgure 4). Lkewse the prevous proof, n the worst case the last task (wth maxmum duraton) can only start after a perod of tme n whch all slots have been busy that s: (n jµ j + n µ λ )/k. 2.3 Multple class bounds In a shared system, let k be the number of slots and U be the set of job classes. In each class U, h concurrent jobs are executed by usng α percent of system slots. Each job J n class has n tasks wth mean task duraton µ and maxmum task duraton λ. Proposton 2.4. The lower bound for the executon tme of job J n presence of multple classes of jobs s n µ h k α. Proof. Each class has kα slots and h concurrent jobs, so each job has overall kα /h slots and, usng the bound provded n [26], we get as lower bound n µ kα = n µ h kα. h Proposton 2.5. The upper bound for the executon tme of job J n presence of multple classes of jobs s (n µ 2λ )h kα + 2λ.

Job Class c. k slots h. 2 1 One Class k h slots { k/ h slots 2 1 Tme Whole System (n µ 2 ) h +2 k One Job T1 { { { +(n µ 2 ) h +2 T1 T2 k T2 Fgure 5: Slots sharng n a system wth several classes of jobs Proof. Fgure 5 shows a system where slots are shared among several classes of jobs. The max number of slots dedcated to a sngle job of class s kα /h. Let us llustrate the worst case scenaro for job J. We assume that job J s executed before J and that each slot freed up from J s dedcated to J. We also assume kα /h 1 slots n the last wave of job J start performng a task wth maxmum duraton, and the frst slot freed up from job J s dedcated to J. In the worst case, ths slot also performs a task wth maxmum duraton. After duraton λ, remanng slots n J are freed up and are dedcated to job J. kα /h slots perform tasks of J for (n µ 2λ )h kα tme and after that there s just one task wth max duraton λ. So tme (n µ 2λ )h kα + 2λ s spent for performng job J To prove there s no larger upper bound we use contradcton. Let us assume that job J n Fgure 6 s executed n tme ɛ + n µ 2λ kα + 2λ, ɛ >. Let after tme t 1 λ of the h consdered job starts, all possble slots kα /h (far share) are allocated to J and the duraton of the last task s t 2 λ. The duraton ɛ + n µ 2λ kα + 2λ (t 1 + t 2) s the mnmum h amount of tme that the assumed job has kα /h slots. We calculate a bound by computng the mnmum amount of workloads that can be done W 1 and the amount of workload that has to be done W 2 = n µ. The mnmum amount s W 1 = (ɛ + nµ 2λ kα h + 2λ t 1 t 2) kα h + t 1 + t 2 as shown by the dotted area n Fgure 6. Note that, the frst term s the workload performed when k α /h slots are avalable, whle t 1 and t 2 are the workloads performed when there s at least one sngle slot. The followng relaton between W 1 and W 2 holds: ( ɛ + Snce 1 k α h ɛ k α h ) n µ 2λ k α + 2λ k α t 1 t 2 + t 1 + t 2 n µ. h h and t 1 + t 2 2λ, we get + n µ 2λ + (2λ t 1 t 2) k α h + t 1 + t 2 n µ,.e., ɛ t 1 +t 2 2 λ, whch s mpossble snce ɛ >. 2.4 Bounds for MapReduce Jobs Executon In ths secton, we extend the results presented n [26] for a MapReduce system wth S M Map slots and S R Reduce slots usng Far/Capacty scheduler. Smlar jobs are grouped Fgure 6: Executon of a sngle job n consderng the proof by contradcton together n a job class U and αm and αr are the percentage of all Map and Reduce slots dedcated to class, whle there are h jobs runnng concurrently. Let us denote wth Mavg, Mmax, Ravg, Rmax, Sh 1, avg, Sh 1, max, Sh avg and Sh max the average and maxmum duratons of Map, Reduce, frst Shuffle and typcal Shuffle tasks, respectvely. These values defne an emprcal performance profle for each job class, whle NM and NR are the number of Map and Reduce tasks of job J profle. By usng the bounds defned n the prevous sectons, a lower and an upper bound on the duraton of the entre Map phase can be estmated as follows: T low M = N M Mavg h, S M αm T up = (N M Mavg 2Mmax) h + 2 M M max. S M αm Smlar results can be obtaned for the Reduce stage, that conssts of the Reduce and part of the Shuffle phase. In fact, accordng also to the results dscussed n [26], we dstngush the non-overlappng porton of the frst shuffle wave from the duraton of the remanng tasks n the typcal shuffle. The tme of the typcal shuffle phase can be estmated as: ( ) T low N Sh = R h 1 Sh avg, S R αr T up = (N R Sh avg 2 Sh max) h + 2 Sh Sh max. S M αr Fnally, by puttng all parts together, we get: T low = A low h S M α M + B low h S Rα R + C low, (1) where A low = NM Mavg, B low = NR(Sh avg + Ravg) and C low = Sh 1(J ) avg Sh avg. In the same way, the executon tme of job J s at most: T up = A up h + B up S M αm where: A up = NM Mavg 2Mmax, B up C up h S Rα R + C up, (2) = N RSh avg 2Sh max + N RR avg 2R max, = 2Sh max + Sh 1() max + 2M max + 2R max. Accordng to the guarantees to be provded to the end users, we can use T up upper bound (beng conservatve) or the approxmated formula T avg = (T low + T up )/2 (3) to bound the executon tme of class jobs n the Capacty Allocaton problem descrbed n the next secton.

3. CAPACITY ALLOCATION PROBLEM In ths secton, we consder the jont Capacty Allocaton and Admsson Control problem for a cloud based shared Hadoop 2.x system. We assume that the system runs the far or capacty scheduler, servng a set of user classes, requestng the concurrent executon of jobs wth smlar executon profle. Each class s executed wth s M = αm S M Map slots and s R = αrs R Reduce slots wth a concurrency degree of h (.e., h jobs wth the same profle are executed concurrently). We also assume that the system mplements an admsson control mechansm boundng the number of concurrent jobs h executed by the system,.e., some jobs can be rejected. denotes a predcton for the number of jobs of class to be executed and we have h H up. Furthermore, n order to avod job starvaton, we also mpose h to be greater than a gven lower bound H low. Fnally, a (soft) deadlne D s assocated wth each class. Note that, gven s M, s R and h, the executon tme of a class job can be approxmated by: H up T = A h s M + B h s R + C, (4) where A, B and C are postve constants computed as dscussed n the prevous secton. We can use Equatons (2) (3) to derve (4), consderng conservatve upper bounds. In ths latter case D can be consdered as hard deadlne. In alternatve, as n [26], (4) can be obtaned from (1), (2), and (3). In that case, (4) s not a bound but an approxmated formula and D becomes soft deadlne. In ths work, we follow ths latter more flexble approach. We assume that our MapReduce mplementaton s hosted n a Cloud envronment that provdes on-demand and reserved (see, e.g., Amazon EC2 prcng model [2]) homogeneous vrtual machnes (VMs). Moreover, we denote wth c M and c R the number of Map and Reduce slots hosted n each VM,.e., each nstance supports c M Map and c R Reduce concurrent tasks for each job J n class. As a consequence, let x m and x r be the number of Map and Reduce slots requred by a certan job J, the number of VMs to be provsoned has to be equal to x m/c M + x r/c R. Let us denote wth δ and wth ρ < δ the cost of on-demand and reserved VMs, respectvely and wth r the number of reserved VMs avalable (.e., the number of VMs subscrbed wth a long term contract). Let d and r be the number of on-demand and reserved VMs, respectvely, used to serve end users requests. The am of the Capacty Allocaton (CA) problem we consder here s to mnmze the overall executon cost meetng, at the same tme, all deadlnes. The executon cost ncludes both the VM allocaton cost and the penalty cost for job rejecton. Gven p, the penalty cost for rejecton of a class job, the overall executon cost can be calculated as follows: δ d + ρ r + p (H up h ), (5) U where decson varables are d, r, h, s M and s R, for any U,.e., we have to decde the number of on-demand and reserved VMs, concurrency degree, and the number of Map and Reduce slots for each job class. The notaton adopted n ths paper s summarzed n Table 1. 4. OPTIMIZATION PROBLEM In ths secton, we formulate the CA optmzaton problem and propose a sutable and fast soluton technque for the System Parameters c M Number of Map slots hosted n a VM of class c R Number of Reduce slots hosted n a VM of class U Set of job classes p Penalty for rejectng jobs from class D Makespan deadlne of jobs from class A CPU requrement for the Map phase whch can be derved by nput data and job class B CPU requrement for the Reduce phase whch can be derved by nput data and job class C Tme constant factor depends on Map, Copy, Shuffle and Reduce phases that derved by nput data and job class r Number of avalable reserved VMs δ Cost of on-demand VMs ρ Cost of reserved VMs H up Upper bound on the number of class jobs to be executed concurrently H low Lower bound on the number of class jobs to be executed concurrently Decson Varables s M Number of slots to be allocated to class for executng Map task s R Number of slots to be allocated to class for executng Reduce task h Number of jobs of class to be executed concurrently r Number of reserved VMs to be allocated for job executon d Number of on-demand VMs to be allocated for for job executon Table 1: Optmzaton model: parameters and decson varable. executon of MapReduce jobs n Cloud envronments. The objectve s to mnmze the executon cost, whle meetng job (soft) deadlnes. The total cost ncludes VM provsonng costs and a penalty due to job rejecton. In equaton (5) the term c =1 p Hup s a constant ndependent from decson varables and can be dropped. The optmzaton problem can then be defned as follows: (P) subject to: A h s M + mn δ d + ρ r U B h s R p h + E, U, (6) r r, (7) ( ) s M + s R r + d, (8) c M c R U H low h H up, U, (9) r, (1) d, (11) s M, U, (12) s R, U, (13) where constrants (6) are derved from equaton (4) by mposng the executon of each job to end before ts deadlne (.e., E = C D < ). Constrant (7) ensures that no more than the avalable reserved VMs can be allocated. Constrant (8) guarantees that enough VMs are allocated to execute submtted jobs wthn ther deadlnes. Constrants (9) bound the job concurrency level for each user. We remark that, n the above problem formulaton, varables r, d, s M, s R, h are not nteger as n realty they should be. In fact, requrng varables to be nteger makes

the problem much more dffcult to solve. However, ths approxmaton s wdely used n the lterature (see, e.g., [9, 3]) snce relaxed varables can be rounded to the closest nteger at the expense of a generally very small ncrement of the overall cost (ths s ntutve for large-scale MapReduce systems that requre tens or hundreds of relatvely cheap VMs), justfyng the use of a relaxed model. Therefore, we decded to deal wth contnuous varables, consderng a relaxaton of the real problem. However, ths restrcton wll be removed n the numercal analyses reported n Secton 5. Problem (P) has a lnear objectve functon but constrants (6) are non-lnear and non-convex (the proof s reported n [18]). To overcome the non-convexty of the constrants, we ntroduce new decson varables Ψ = 1/h, for any U, to replace h. Then, problem (P) s equvalent to problem (P1) defned as follows: (P1) subject to: A s M Ψ + mn δ d + ρ r U p Ψ B s R Ψ + E, U, (14) r r, (15) ( ) s M + s R r + d, (16) c M c R U where Ψ low Ψ low = 1/H up Ψ Ψ up, U, (17) r, (18) d, (19) and Ψ up s M, U, (2) s R, U, (21) = 1/H low. We remark that now constrants (14) are convex (the proof s reported n [18]). The convexty of all the constrants of problem (P1) allows to prove the followng result. Theorem 4.1. In any optmal soluton of problem (P1), constrants (14) hold as equaltes and the number of slots to be allocated to job class, s M and s R, can be evaluated as follows: ( ) s M = 1 A B c M + A, (22) E Ψ s R = 1 E Ψ c R ( ) A B c R + B c. (23) M The proof of Theorem 4.1 s reported n [18]. The results of Theorem 4.1 allow to transform (P1) nto an equvalent lnear programmng problem, whch can be solved very quckly by state of the art solvers. Theorem 4.2. (P1) s equvalent to the followng problem: (P2) mn δ d + ρ r p h U subject to: r r, (24) γ h r + d, (25) U H low h H up, U, (26) r, (27) d, (28) where γ = γ 1 + γ 2 wth: γ 1 = 1 E c R γ 2 = 1 E c M ( A B c R c M + B ), (29) ( ) A B c M + A c, (3) R and the decson varables are r, d and h = 1/Ψ, for any U. The proof of Theorem 4.2 s reported n [18]. Snce (P2) s a lnear problem, commercal and open source solvers currently avalable are able to solve effcently very large nstances. A scalablty analyss s reported n the followng secton. The Karush-Kuhn-Tucker (KKT) condtons correspondng to problem (P2) guarantee that any optmal soluton of (P2) has the followng mportant propertes. Theorem 4.3. If (r, d, h ) s an optmal soluton of problem (P2), then the followng statements hold: a) r >,.e., reserved nstances are always used. b) U γ h = r + d,.e., γ can be consdered a computng capacty converson rato that allows to translate class concurrency level nto VM capacty resource requrements. c) If p /γ > δ, then h = H up,.e., class job are never rejected. d) If p /γ < ρ, then h = H low,.e., class concurrency level s set to the lower bound. e) If r > U γ Hup, then d =,.e., for property b), f the total capacty requrement can be satsfed through reserved nstances, on demand VMs are never used. f) If r < U γ Hlow, then r = r and d >,.e., for property b), f the mnmum job requrements exceed reserved nstance capacty, then on demand VMs are needed. Proof. The KKT condtons assocated to (P2) are: ρ ν + µ r λ r =, (31) δ ν λ d =, (32) p + γ ν + µ λ =, U, (33) ( ) ν γ h r d =, (34) U λ r r =, (35) µ r (r r) =, (36) λ d d =, (37) λ (h H low ) =, U, (38) µ (h H up ) =, U, (39) ν, λ r, µ r, λ d, (4) λ, µ, U. (41) a) Assume, by contradcton, that r =. Then d U γ h U γ H low >, thus λ d = and ν = δ. On the other hand, (36) mples that µ r = and λ r = ρ ν = ρ δ < whch s mpossble.

b) Snce r >, we have λ r =, hence (31) mples ν = ρ + µ r ρ >, thus constrant (25) s actve at (r, d, h ). c) It follows from (32) that ν = δ λ d δ, hence we have µ = λ + p γ ν p γ ν p γ δ >. Therefore h = H up. d) Snce ν ρ, we get λ = µ + γ ν p γ ν p γ ρ p >, hence h = H low. e) We have r = U γ h d U γ H up < r, thus µ r = and ν = ρ. Therefore, λ d = δ ρ > mples d =. f) We have d = U γ h r U γ H low r >, hence λ d = and ν = δ. Therefore, µ r = δ ρ > mples r = r. Property a) s obvous, snce reserved nstances are the cheapest ones. Property b) and Theorem 4.2 lead to an mportant theoretcal result. Indeed, γ parameters can be nterpreted as a computng capacty converson rato that allows to estmate VM capacty requrements n terms of class concurrency level. Accordngly, also propertes c) and d) become ntutve. The product γ δ s the unt cost for class job executon wth on-demand nstances. If γ δ s lower than the penalty cost, then class jobs wll always be executed. Vce versa, f γ ρ,.e., the class per unt reserved cost, s larger than the penalty, class jobs wll always be rejected. Fnally, propertes e) and f) relate the overall mnmum U γ Hlow and maxmum U γ Hlow capacty requrements to reserved nstance capacty and allow to establsh a prory f on demand VMs wll or wll not be used. 5. EXPERIMENTAL RESULTS In ths secton we: () valdate job executon tme bounds, () evaluate the scalablty of the CA problem soluton, and () nvestgate how dfferent (P2) problem settngs mpact on the cloud cluster cost. Our analyses are based on a very large set of randomly generated nstances. Bound accuracy s evaluated through the YARN Scheduler Load Smulator (SLS) [7]. In the followng secton, the desgn of experments s presented. Bound accuracy and scalablty analyses are reported n Sectons 5.2 and 5.3. Fnally, the analyss of how (P2) problem parameters mpact on cost s reported n Secton 5.4. 5.1 Desgn of experments Analyses n ths secton ntend to be representatve of real Hadoop systems. Instances have been randomly generated by pckng parameters accordng to values observed n real systems and logs of MapReduce applcatons. Afterwards, we use unform dstrbutons wthn the ranges reported n Table 2. In our model, the cloud cluster conssts of on-demand and reserved VMs. We consdered Amazon EC2 prces for VM hourly costs [2]. On demand and reserved nstance prces vared n the range ($.5,$.4), to consder the adopton of dfferent VM confguratons. Regardng MapReduce applcatons parameters, we used the values reported n [27], whch consder real log traces obtaned from four MapReduce applcatons: Twtter, Sort, WkTrends, and WordCount. Moreover, as n [27] we assume that deadlnes are unformly dstrbuted n the range (1, 2) mnutes. We use the job profle from [27] to calculate a reasonable value for penaltes. Frst, the mnmum cost for runnng a sngle job (let t be cj ) s evaluated by settng H up = H low and solvng problem (P 2), dsablng the admsson control mechansm. Then, we set the penalty value for job rejectons p = 1 cj as n [8]. We vared H up n the range (1, 3), and we set =.9 H up H low. Job Profle Cluster Scale NM (7, 7) H up ( ) (1, 3) NR (32, 64) Mmax (s) (16, 12) Job Rejecton Penalty Sh typ max (s) (3, 15) p ( ) (25, 25) Rmax (s) (15, 75) Sh 1() max (s) (1, 3) Cloud Instance Prce c M, c R (1, 4) ρ ( ) (5, 2) D (s) (6, 12) δ ( ) (5, 4) Table 2: Cluster characterstcs and Job Profles 5.2 Accuracy of Executon Tme Bounds The am of ths secton s to compare our tme bounds (1) and (2) aganst the executon tmes obtaned through YARN SLS [7], the offcal smulator provded wthn Hadoop 2.3 framework. YARN SLS requres an Hadoop deployment and t nteracts wth t by means of mocked NodeManagers and Applcaton- Masters wth the purpose of smulatng both a set of cluster nodes and the relatve workload. Those enttes nteract drectly wth Hadoop YARN, smulatng a whole runnng envronment wth a one to one mappng between smulated and real tmes (.e., the smulaton of 1 second of the Hadoop cluster requres 1 second smulaton). SLS requres as nput a cluster confguraton fle and an executon trace. Ths trace can be provded ether n Apache Rumen 2 format or n the SLS propretary format (the one we adopted), whch s a smplfed verson contanng only the data strctly needed for smulaton. In partcular, among other nformaton, t provdes for each job and each task the start and end tmes. In our evaluaton we consder the MapReduce job profles extracted from log traces avalable from Twtter, Sort, WkTrends, and WordCount reported n [27]. In order to use the SLS tool, we generated synthetc job traces representng these workloads. Frst of all, snce SLS does not provde shuffle phase executon tme, we have to use a smplfed verson of equatons (1) and (2). Therefore, we partally removed the shuffle phase, by gnorng the frst shuffle wave (to a certan extent overlapped wth the last Map wave, though) and by ncludng the remanng part (e.g., Sh avg) n the Reduce phase. We also consder the total number of avalable slots as shared between the Map and Reduce tasks, beng 2 A tool for extractng traces from Hadoop logs http://hadoop.apache.org/docs/r1.2.1/rumen.html

Twtter Sort No. of users T up 1 gap m 1 gap No. of users T up 2 gap m 2 gap 4 7.78% 1.24% 1 6.4%.61% 6 6.35% -.9% 8 8.83% 3.26% 5 18.53% 7.68% 4 19.79% 1.42% 4 16.1% 6.43% 6 12.79% 4.79% 8 6.7% -11.98% 7 2.85% -7.48% 3 17.4% 7.12% 7 14.24% 5.64% 6 4.65% -9.8% 1 6.35% -1.8% 6 2.26% -5.7% 6 5.7% -1.58% 9.49% -4.94% 7 2.43% -2.44% 4 8.24% 1.56% 1 5.28% -.45% Table 3: Two job classes analyss (Twtter and Sort) WordCount WkTrends No. of users T up 1 gap m 1 gap No. of users T up 2 gap m 2 gap 2 23.12% 5.27% 4 37.46% 23.93% 4 8.35% -4.2% 4 26.46% 16.78% 3 28.3% 7.8% 2 57.48% 39.47% 2 14.4% -.65% 3 23.28% 12.65% 4 19.15% 3.8% 3 48.68% 35.86% 5 17.32% 5.6% 4 34.95% 25.61% 3 21.97% 4.34% 3 35.58% 22.15% 3 37.22% 14.59% 2 62.11% 43.47% 5 15.89% 2.52% 3 37.19% 26.62% 2 17.5% 2.41% 5 26.1% 15.8% Table 4: Two job classes analyss (WordCount and WkTrends) unable to assgn them to a specfc phase. In partcular, we used a number of slots equals to the number of vrtual cores allocated n the smulator. These slots have been used n both phases so we set S M and S R equal to the avalable cores. Then, we set the ratos α R h = α M h equal to 1/ k U h k. Ths because the avalable resources are equally shared among the dfferent users, so each class wll have a rato of resources proportonal to ts users h : αr = αm = h / k U h k. In order to valdate our bounds, we must compute job duratons,.e., for each job, the dfference between ts submsson and completon tme. Snce SLS s a trace based smulator, we must generate a trace that nterleaves for each user the submsson of jobs by ther average duraton. However, we do not know ths duraton (that s the goal of ths smulaton), but we can obtan t by relyng on a fxed-pont teraton method. We consder a closed model n whch for each class, h users can concurrently submt multple jobs. Let approxmate the average job duraton T wth an ntal guess A, for each class U and run the smulaton of the generated trace. Then, we can refne our guess of T teratvely wth the value A,n, computed as follows: A,n = β T,n 1 + (1 β) A,n 1, (42) for each class U (we expermentally set β =.7), where T,n 1 s the average job duraton obtaned by SLS for class at the prevous run n 1. We terate ths procedure untl A,n and T,n are close enough for each class U. At that pont A,n T,n T for each job class. We stop the fxed-pont teraton method when the rato max U A,n T,n / T,n s below a gven threshold τ (set expermentally equal to.1). We then evaluate how far our bounds are from ths value, by comparng T,n wth the upper bound T up and the average of the two bounds m = (T low + T up )/2. Each smulaton trace has been bult by consderng dfferent user classes (drawn from WorkCount, Sort, Twtter Twtter Sort WordCount N. T up 1 gap m 1 gap N. T up 2 gap m 2 gap N. T up 3 gap m 3 gap 5 16.99% 7.24% 3 17.82% 9.46% 2 16.2% 5.7% 4 1.25% 1.6% 3 15.1% 6.85% 3 1.88%.26% 5 6.21% -1.26% 3 2.84% -3.3% 4 5.26% -3.3% 5 8.71% -7.89% 4 1.24% -4.6% 2 8.84% -1.32% 5 3.92% -3.39% 5 2.66% -3.46% 2 3.82% -4.62% 5 14.32% 5.43% 4 14.31% 6.44% 2 14.37% 4.34% 3 1.1% 2.21% 4 13.58% 6.38% 5 8.24% -.53% 4 21.64% 11.33% 2 17.84% 8.97% 4 18.58% 7.26% 2 11.51% 2.6% 3 8.37%.21% 5 9.74% -.74% 4 9.53% 1.68% 4 1.46% 3.46% 4 7.37% -1.32% Table 5: Three job classes analyss (Twtter, Sort and WordCount) Sort WordCount WkTrends N. T up 1 gap m 1 gap N. T up 2 gap m 2 gap N. T up 3 gap m 3 gap 4 5.15% -1.51% 4 6.28% -2.33% 4 15.77% 9.73% 4 8.1% -.12% 3 9.48% -.97% 3 23.12% 15.46% 5 2.56% -4.51% 2 2.48% -6.51% 4 14.4% 7.9% 4 8.79% -.22% 2 9.2% -2.38% 3 16.17% 8.19% 2 3.54% -4.25% 3 7.3% -3.18% 5 11.32% 4.39% 2 13.98% 5.7% 3 13.5% 1.18% 4 19.55% 11.28% 4 14.74% 6.6% 2 14.79% 3.8% 4 21.14% 13.55% 5 8.64% 1.6% 4 6.9% -2.51% 2 14.52% 7.97% 5.91% -5.64% 2 2.26% -6.75% 4 9.64% 3.37% 4 11.2% 4.4% 4 7.84% -.93% 4 14.43% 8.42% Table 6: Three job clasess analyss (Sort, Word- Count and WkTrends) and WkTrends traces) settng A, = T up for any U. In order to avod that jobs start smultaneously (unrealstc n real systems), we delay each job submsson by a random exponentally-dstrbuted tme value (.e., the user thnk tme set equal to a tenth of the estmated job executon tme). Ultmately, we scaled down by a factor of 1 the orgnal executon tmes n order to acheve a smulaton speedup. We consdered dfferent test confguratons wth two and three job classes and wth a random number of users n the range [2, 1]. Those scenaros represents lght load condtons that correspond to the worst case for the evaluaton of our bounds. Indeed, under lght load condtons the probablty that any user class s temporarly dle can be sgnfcant and, the Far and Capacty scheduler, would assgn the dle user class slots to other classes to boost ther performance. Vce versa, under heavy loads our upper bounds become tghter. Tables 3-6 report the results we acheved. For each run the number of users and the gap between T and both T up and m are reported (a negatve m gap means that T > m ). All the smulatons have been performed consderng a cluster wth 128 cores and usng the YARN far scheduler. Overall, for the two job classes, the gap between the upper bound and the jobs mean executon tme s around 19% on average, whle the gap wth respect to m s only 1% on average. For three classes the average between the upper bound and the jobs mean executon tme gap s 11%, whle the gap wth respect to m s 5%. Over all the set of experments the average between the upper bound and the jobs mean executon tme s 14%. Smulatons run on Mcrosoft Azure Lnux small nstances (.e., sngle core, 1.75GB VMs). The fxed-pont teraton procedure converges n 4.4 teratons on average. The smulaton tme of each fxed-pont procedure teraton was around 31 mnutes. 5.3 Scalablty analyss In ths secton, we evaluate the scalablty of our optmzaton soluton. We performed our experment on a VrtualBox vrtual machne based on Ubuntu 12.4 server runnng on an

ntel Xeon Nehalem dual socket quad-core system wth 32 GB of RAM. Optmal soluton to problem (P2) was obtaned by runnng CPLEX 12. where we also restrcted decson varables r, d and h to be nteger,.e., we consdered the Mxed Integer Lnear Programmng (MILP) verson of (P2). We performed experments consderng dfferent numbers of user classes. We vared the cardnalty of the set U between 2 and 1, wth step 2, and run each experment ten tmes. The results show that the tme requred to determne global optmal soluton for the MILP problem s, on average, less than.8 seconds. The nstances of maxmum sze ncludng 1, user classes can be solved n less than.5 second n the worst case. 5.4 Case Studes In ths secton, we nvestgate how dfferent (P2) problem settngs mpact on the cloud cluster cost. In partcular, we analyse three case studes to address the followng research questons: (1) Is t better to consder a shared cluster or to devote a dedcated cluster to ndvdual user classes? (2) What s the effect of job concurrency on cluster cost? (3) Whch s the cost mpact of more strct deadlnes? (s there a lnear relaton between the cost and job deadlnes?). Instances have been generated accordng to Sectons 5.1 and 5.3. Furthermore, to ease the results nterpretaton we excluded reserved nstances and assumed there s a sngle type of VM avalable from the cloud provder. 5.4.1 Effect of sharng cluster In ths case study, we want to examne the effect of cluster resource sharng. In partcular, we consder two scenaros. The frst one s our baselne, whch corresponds to (P2) problem settng. The second one consders the same resource demand (n terms of job profles, deadlnes, etc.) but U (P2) problems are solved ndependently,.e., assumng a dedcated cluster s devoted to each user class. To perform the comparsons, we consder dfferent numbers of user classes. We vary the cardnalty of the set U between 2 and 1, wth step 2 and randomly generate ten nstances for each cardnalty value. For each nstance we calculate two values: the frst one s the objectve functon of the baselne scenaro, that we refer to as dependent objectve functon; the second value, that we call ndependent objectve functon, s evaluated by summng up the U objectve functons of the ndvdual problems. The comparson s performed by consderng the rato between the dependent and ndependent objectve functon. Fgure 7 reports the average of ths ratos for dfferent numbers of user classes. Overall, the cluster cost margnally decreases by assumng all user classes together and on average we have.48% varaton on the overall cluster cost. We can conclude that, thanks to cloud elastcty, the adopton of shared or dedcated clusters leads to the same cost. Note that, shared cluster can lead to benefts thanks to HDFS (e.g., better dsk performance and node load balancng) but ths can not be captured by our cost model. 5.4.2 Effect of job concurrency degree In ths case study we want to analyze the effect of the job concurrency degree on the cost of one sngle job. To perform the experment, we assume there s just one user class n the cluster. We vary the job concurrency degree h from 1 to 3 and, for each value, we randomly generate 1 Cluster cost vara.ons 1.7 1.6 1.5 1.4 1.3 1.2 1.1 Effect of assumng all user classes to- Fgure 7: gether. Cost of a sngle job 1374 1373.5 1373 1372.5 1372 1371.5 1371 137.5 2 8 14 2 26 32 38 44 5 56 62 68 Number of user classes 74 8 86 92 98 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 Job concurrency Fgure 8: Effect of job concurrency degree on sngle job cost. nstances of problem (P2). For each nstance we dsable the admsson control by settng up H low = H up and we solve the optmzaton problem. We calculate the cost of one sngle job for each nstance by dvdng the objectve functon by the job concurrency degree. Fgure 8 shows how the per-job cost vares wth dfferent job concurrency degrees for a representatve example. Overall, the analyss demonstrates that the cost varance for dfferent job concurrency s neglgble,.e., the dfferent job concurrency degree leads to less than.2% varaton of the cost of one job. Hence, n a cloud settng, elastcty allows to obtan a constant per-job executon cost ndependently of the number of users n a class. Ths result s n lne wth Theorem 4.3 b). 5.4.3 Effect of tghtenng the deadlnes Here we want to examne the relaton between cost and deadlnes. In partcular, we check the effect of reducng the deadlnes on the cluster cost. We vary the cardnalty of the set U between 2 and 1, and for each cardnalty we generate several random nstances as descrbed n Secton 5.3. For each nstance, we teratvely tghten the deadlnes of every user class to observe how the changes are reflected on the cost. In each step, we decrease the deadlnes by 5% of the ntal value. The reducton process contnues untl the nstance wth new deadlnes does not have a feasble soluton. After each reducton, we calculate the ncreased cost rato,.e., the rato between the objectve functon for the problem wth the new deadlnes and the objectve functon of the problem wth the ntal deadlnes. Fgure 9 llustrates the trend of the ncrease cost rato for a representatve nstance wth 2 user classes: the reducton s not lnear and the cost to pay for reducng the deadlnes by a 6% s more than three tmes wth respect to the base case.

Cost of cluster 8 7 6 5 4 3 2 1 Fgure 9: cost. 5 1 15 2 25 3 35 4 45 5 55 6 65 7 % reduc.on on deadlnes Effect of reducng deadlnes on cluster 6. RELATED WORK Capacty management and optmal schedulng of Hadoop clusters receved a lot of nterest by the research communty. Authors n [13] propose Starfsh, a self-tunng System for analytcs on Hadoop. Indeed, rarely Hadoop exhbts the best performance as t s, wthout a specfc tunng phase. Starfsh, collects at runtme some key nformatons about the job executon generatng a profle that s eventually exploted to automatcally confgure Hadoop wthout human nterventon. The same tool has been successful employed to solve cluster szng problems [12]. Tan and Chen [24] face the problem of resource provsonng optmzaton mnmzng the cost assocated wth the executon of a job. Ths work presents a cost model that depends on the amount of nput data and on the consdered job characterstcs. A proflng regresson-based analyss s carred out to estmate the model parameters. A dfferent approach, based on closed queung networks, s proposed n [2] that consders also contenton and parallelsm on compute nodes to evaluate the completon tme of a MapReduce job. Unfortunately, ths approach concerns the executon tme of the map phase only. Vanna et al. [28] propose a smlar soluton, whch however, has been valdated for cluster exclusvely dedcated to the executon of a sngle job. The work n [17] models the executon of Map task through a tandem queue wth overlappng phases and provdes very effcent run tme schedulng solutons for the jont optmzaton of the Map and copy/shuffle phases. Authors show how ther runtme schedulng algorthms match closely the performance of the offlne optmal verson. The work n [1] ntroduces a novel modelng approach based on mean feld analyss and provde very fast approxmate methods to predct the performance of Bg Data systems. Deadlnes for MapReduce jobs are consdered also n [23]. The authors recognze the nablty of Hadoop schedulers to handle properly jobs wth deadlnes and propose to adapt to the problem some well-known multprocessor schedulng polces. They present two versons of the Earlest Deadlne Frst heurstc and demonstrate they outperform the classcal Hadoop schedulers. The problem of progress estmaton of parallel queres s addressed n [21]. The authors present Parallax, a progress estmator able to predct the completon tme of queres representng MapReduce jobs. The estmator s mplemented on Pg and evaluated wth PgMx benchmark. ParaTmer [22], an extenson of Parallax, s a progress estmator that can predct the completon of parallel queres expressed as Drected Acyclc Graph (DAG) of MapReduce jobs. The man mprovement wth respect to the prevous work, s the support for queres where multple jobs work n parallel,.e., have dfferent path n the DAG. Authors n [31] nvestgate the performance of MapReduce applcatons on homogeneous and heterogeneous Hadoop cloud based clusters. They consder a problem smlar to the one we faced n our work and provde a smulaton-based framework for mnmzng nfrastructural costs. However, admsson control s not consdered and a sngle type of workload (.e., user class) s optmzed. In [26] the ARIA framework s presented. Ths work s the closest to our contrbuton and focuses on clusters dedcated to sngle user classes runnng on top of a frst n frst out scheduler. The framework addresses the problem of calculatng the most sutable amount of resource (slots) to allocate to Map and Reduce tasks n order to meet a user-defned soft deadlne for a certan job and reduce costs assocated wth resource over-provsonng. A MapReduce performance model relyng on a compact job profle defnton to calculate a lower bound, an upper bound and an estmaton of job executon tme s presented. Fnally, such model, mproved n [32], s valdated through a smulaton study and an expermental campagn on a 66-nodes Hadoop cluster. 7. CONCLUSIONS AND FUTURE WORK In ths paper, we provded an optmzaton model able to mnmze the executon costs of heterogeneous tasks n cloud based shared Hadoop clusters. Our model s based on novel upper and lower bounds for MapReduce job executon tme. Our soluton has been valdated by a large set of experments. Results have shown that our method s able to determne the global mnmum solutons for systems ncludng up to 1, user classes n less than.5 seconds. Moreover, the average executon tme of MapReduce jobs obtaned through smulatons s wthn 14% of our bounds on average. Future work wll valdate the consdered tme bounds n real cloud clusters. Moreover, a dstrbuted mplementaton of the optmzaton solver able to explot the YARN herarchcal archtecture wll be developed. Acknowledgement The work of Marzeh Malekmajd has been supported by the European Commsson grant no. FP7-ICT-211-8-318484 (MODAClouds). Danlo Ardagna and Mchele Cavotta s work has been partally supported by the the European Commsson grant no. H22-644869 (DICE). The smulatons and numercal analyses have been performed under the Wndows Azure Research Pass 213 grant.

8. REFERENCES [1] Capacty Scheduler. http://hadoop.apache.org/docs/r2.3./hadoopyarn/hadoop-yarn-ste/capactyscheduler.html. [2] Elastc Compute Cloud (EC2). http://aws.amazon.com/ec2. [3] Far Scheduler. http://hadoop.apache.org/docs/r2.3./hadoopyarn/hadoop-yarn-ste/farscheduler.html. [4] MapReduce: Smplfed Data Processng on Large Clusters. http: //research.google.com/archve/mapreduce.html. [5] Mcrosoft Azure. http://azure.mcrosoft.com/enus/servces/hdnsght/. [6] The dgtal unverse n 22. http://dcdocserv.com/1414. [7] YARN Scheduler Load Smulator (SLS). http://hadoop.apache.org/docs/r2.3./hadoopsls/schedulerloadsmulator.html. [8] J. Anselm, D. Ardagna, and M. Passacantando. Generalzed Nash Equlbra for SaaS/PaaS Clouds. European Journal of Operatonal Research, 236(1):326 339, 214. [9] D. Ardagna, B. Pancucc, and M. Passacantando. Generalzed Nash Equlbra for the Servce Provsonng Problem n Cloud Systems. IEEE Transactons on Servces Computng, 6(4):429 442, 213. [1] A. Castglone, M. Grbaudo, M. Iacono, and F. Palmer. Explotng mean feld analyss to model performances of bg data archtectures. Future Generaton Computer Systems, 37():23 211, 214. [11] C. P. Chen and C.-Y. Zhang. Data-ntensve applcatons, challenges, technques and technologes: A survey on bg data. Informaton Scences, 275():314 347, 214. [12] H. Herodotou, F. Dong, and S. Babu. No one (cluster) sze fts all: Automatc cluster szng for data-ntensve analytcs. In SOCC 11, pages 18:1 18:14, 211. [13] H. Herodotou, H. Lm, G. Luo, N. Borsov, L. Dong, F. B. Cetn, and S. Babu. Starfsh: A Self-tunng System for Bg Data Analytcs. In CIDR 11, pages 261 272, 211. [14] H. V. Jagadsh, J. Gehrke, A. Labrnds, Y. Papakonstantnou, J. M. Patel, R. Ramakrshnan, and C. Shahab. Bg data and ts techncal challenges. Commun. ACM, 57(7):86 94, 214. [15] K. Kambatla, G. Kollas, V. Kumar, and A. Grama. Trends n bg data analytcs. Journal of Parallel and Dstrbuted Computng, 74(7):2561 2573, 214. [16] K.-H. Lee, Y.-J. Lee, H. Cho, Y. D. Chung, and B. Moon. Parallel data processng wth mapreduce: A survey. SIGMOD Rec., 4(4):11 2, 212. [17] M. Ln, L. Zhang, A. Werman, and J. Tan. Jont optmzaton of overlappng phases n MapReduce. SIGMETRICS Performance Evaluaton Revew, 41(3):16 18, 213. [18] M. Malekmajd, A. M. Rzz, D. Ardagna, M. Cavotta, M. Passacantando, and A. Movaghar. Optmal Capacty Allocaton for executng Map Reduce Jobs n Cloud Systems. Techncal Report n. 214.11, Poltecnco d Mlano, http://home.deb.polm.t/ ardagna/mapreducetechreport214-11.pdf. [19] J. Manyka, M. Chu, B. Brown, J. Bughn, R. Dobbs, C. Roxburgh, and A. H. Byers. Bg data: The next fronter for nnovaton, competton, and productvty. McKnsey Global Insttute, 212. [2] D. A. Menascé and S. Bardhan. Queung Network Models to Predct the Completon Tme of the Map Phase of MapReduce Jobs. In 38th Internatonal Computer Measurement Group Conference, 212. [21] K. Morton, M. Balaznska, and D. Grossman. ParaTmer: A Progress Indcator for MapReduce DAGs. In SIGMOD 1, pages 57 518, 21. [22] K. Morton, A. Fresen, M. Balaznska, and D. Grossman. Estmatng the progress of MapReduce ppelnes. In ICDE 1, pages 681 684, 21. [23] L. T. X. Phan, Z. Zhang, Q. Zheng, B. T. Loo, and I. Lee. An emprcal analyss of schedulng technques for real-tme cloud-based data processng. In SOCA 11, pages 1 8, 211. [24] F. Tan and K. Chen. Towards Optmal Resource Provsonng for Runnng MapReduce Programs n Publc Clouds. In CLOUD 11, pages 155 162, 211. [25] V. K. Vavlapall, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curno, O. O Malley, S. Rada, B. Reed, and E. Baldeschweler. Apache Hadoop YARN: Yet Another Resource Negotator. In SOCC 13, pages 5:1 5:16, 213. [26] A. Verma, L. Cherkasova, and R. H. Campbell. ARIA: Automatc Resource Inference and Allocaton for Mapreduce Envronments. In ICAC 11, pages 235 244, 211. [27] A. Verma, L. Cherkasova, and R. H. Campbell. Resource Provsonng Framework for Mapreduce Jobs wth Performance Goals. In Mddleware 11, pages 165 186, 211. [28] E. Vanna, G. Comarela, T. Pontes, J. M. Almeda, V. A. F. Almeda, K. Wlknson, H. A. Kuno, and U. Dayal. Analytcal Performance Models for MapReduce Workloads. Internatonal Journal of Parallel Programmng, 41(4):495 525, 213. [29] F. Yan, L. Cherkasova, Z. Zhang, and E. Smrn. Heterogeneous cores for MapReduce processng: Opportunty or challenge? In NOMS 14, pages 1 4, 214. [3] Q. Zhang, Q. Zhu, M. Zhan, and R. Boutaba. Dynamc Servce Placement n Geographcally Dstrbuted Clouds. In ICDCS 12, pages 526 535, 212. [31] Z. Zhang, L. Cherkasova, and B. T. Loo. Explotng Cloud Heterogenety for Optmzed Cost/Performance MapReduce Processng. In CloudDP 14, pages 1:1 1:6, 214. [32] Z. Zhang, L. Cherkasova, A. Verma, and B. T. Loo. Automated Proflng and Resource Management of Pg Programs for Meetng Servce Level Objectves. In ICAC 12, pages 53 62, 212.

Mnmzng Interference and Maxmzng Progress for Hadoop Vrtual Machnes We Zhang *, Sundaresan Rajasekaran *, Shaohua Duan *, Tmothy Wood * and Mngfa Zhu * Department of Computer Scence, The George Washngton Unversty, Washngton, D.C., USA Behang Unversty, Bejng, Chna ABSTRACT Vrtualzaton promsed to dramatcally ncrease server utlzaton levels, yet many data centers are stll only lghtly loaded. In some ways, bg data applcatons are an deal ft for usng ths resdual capacty to perform meanngful work, but the hgh level of nterference between nteractve and batch processng workloads currently prevents ths from beng a practcal soluton n vrtualzed envronments. Further, the varable nature of spare capacty may make t dffcult to meet bg data applcaton deadlnes. In ths work we propose two schedulers: one n the vrtualzaton layer desgned to mnmze nterference on hgh prorty nteractve servces, and one n the Hadoop framework that helps batch processng jobs meet ther own performance deadlnes. Our approach uses performance models to match Hadoop tasks to the servers that wll beneft them the most, and deadlne-aware schedulng to effectvely order ncomng jobs. We use admsson control to meet deadlnes even when resources are overloaded. The combnaton of these schedulers allows data center admnstrators to safely mx resource ntensve Hadoop jobs wth latency senstve web applcatons, and stll acheve predctable performance for both. We have mplemented our system usng Xen and Hadoop, and our evaluaton shows that our schedulers allow a mxed cluster to reduce web response tmes by more than ten fold compared to the exstng Xen Credt Scheduler, whle meetng more Hadoop deadlnes and lowerng total task executon tmes by 6.5%. Keywords schedulng, vrtualzaton, Map Reduce, nterference, deadlnes, admsson control 1. INTRODUCTION Vrtualzaton has facltated the growth of nfrastructure cloud servces by allowng a sngle server to be shared by multple customers. Dvdng a server nto multple vrtual machnes (VMs) provdes both a convenent management abstracton and resource boundares between users. However, the performance solaton provded by vrtualzaton software s not perfect, and nterference between guest VMs remans a challenge. If the hypervsor does not enforce proper prortes among guests, t s easy for one vrtual machne s performance to suffer due to another guest. Ths artcle s an extended verson of [24], whch appeared n CC- Grd 214 Copyrght s held by author/owner(s). Despte the danger of nterference, resource sharng through vrtualzaton has been crucal for lowerng the cost of cloud computng servces. Multplexng servers allows for hgher average utlzaton of each machne, gvng more proft for a gven level of hardware expense. Yet the realty s that many data centers, even those employng vrtualzaton, are stll unable to fully utlze each server. Ths s due n part to fears that f a data center s kept fully utlzed there wll be no spare capacty f workloads rse, and part due to the rsk of VM nterference hurtng performance even f servers are left underloaded. In ths paper, we frst study the causes of nterference through vrtualzaton scheduler proflng. We observe that even when set to the lowest possble prorty, bg data VMs (e.g., Hadoop jobs) nterrupt nteractve VMs (e.g., web servers), ncreasng ther tme spent n the runnable queue, whch hurts response tmes. We control and reduce VM CPU nterference by ntroducng a new schedulng prorty for background batch processng VMs, allowng them to run only when other VMs are not actvely utlzng the CPU. Our changes n the VM scheduler mprove the performance of nteractve VMs, but at the cost of unpredctable Hadoop performance. To resolve ths challenge, we mplement a second scheduler wthn the Hadoop framework desgned for hybrd clusters of dedcated and shared VMs that only use resdual resources. We fnd that when gven the same avalable resources, dfferent tasks wll progress at dfferent rates, motvatng the need to ntellgently match each Hadoop task to the approprate dedcated or shared server. Our scheduler combnes performance models that predct task affnty wth knowledge of job deadlnes to allow Hadoop to meet SLAs, despte varablty n the amount of avalable resources. Together, these schedulers form the Mnmal Interference Maxmal Productvty (MIMP) system, whch enhances both the hypervsor scheduler and the Hadoop job scheduler to better manage ther performance. Our prmary contrbutons nclude: A new prorty level bult nto Xen s Credt Scheduler that prevents batch processng VMs from hurtng nteractve VM performance. Task affnty models that match each Hadoop task to the dedcated or shared VM that wll provde t the most beneft. A deadlne and progress aware Hadoop job scheduler that allocates resources to jobs n order to meet performance goals and maxmze the effcency of a hybrd cluster. An admsson control mechansm whch ensures hgh prorty jobs meet deadlnes, even when a cluster s heavly overloaded. We have mplemented the proposed schedulers by modfyng the Xen hypervsor and Hadoop scheduler. Our evaluaton shows that

CDF 1.8.6.4 TPCW Alone w/pi Mn Weght.2 w/pi Def. Weght w/wc Mn Weght 1 2 3 4 5 TPC-W Response Tme (msec) Fgure 1: Colocated Hadoop jobs degrade web applcaton performance, despte usng the Xen scheduler prorty mechansms. MIMP can prevent nearly all nterference on a web applcaton, doublng ts maxmum throughput and provdng nearly dentcal response tmes to when t s run alone. For a set of batch jobs, the algorthm can meet more deadlnes than EDF(Earlest Deadlne Frst), and reduces the total executon tme by over four and a half CPU hours, wth mnmal mpact on nteractve VM performance. Our paper s structured as follows: Secton 2 provdes the motvaton of our paper, and Secton 3 provdes the problem and system overvew for our work. In Secton 4 and Secton 5, we gve a descrpton about nteractve VM schedulng n Xen and progressaware deadlne schedulng n Hadoop. Secton 6 provdes our evaluaton usng dfferent benchmarks. We dscuss related work n Secton 7, and conclude n Secton 8. 2. MAP REDUCE IN VIRTUALIZED CLUSTERS Map Reduce s a popular framework for dstrbutng data ntensve computaton [6]. Hadoop s an open source mplementaton of Map Reduce developed by Yahoo. Users wrte a program that dvdes the work that must be performed nto two man phases: Map and Reduce. The Map phase processes each pece of nput data and generates some knd of ntermedate value, whch s n turn aggregated by the Reduce phase. In ths paper we nvestgate how to run Map Reduce jobs n a hybrd cluster consstng of both dedcated and shared (also known as volunteer) nodes. Ths problem was frst tackled by Clay et al., who descrbed how to pck the approprate number of shared nodes n order to maxmze performance and mnmze overall energy costs [5]. Lke ther work, we focus on schedulng and modelng the Map phase snce ths s generally the larger porton of the program, and s less prone to performance problems due to slow nodes. Our work extends ther deas both wthn the vrtualzaton layer to prevent nterference, and at the Map Reduce job schedulng level to ensure that multple jobs can make the best use of a hybrd cluster and effectvely meet deadlnes. A key ssue that has not yet been fully explored s how to prevent batch processng jobs such as Map Reduce from nterferng wth foreground workloads. Our results suggest that nterference can be qute severe f the mportant performance metrc s nteractve latency as opposed to coarse graned tmng measures (e.g., the tme to comple a lnux kernel). As a motvatng experment, we have measured the acheved throughput and response tme when runnng the TPC-W onlne book store benchmark both alone and alongsde a VM runnng Hadoop jobs. Our results n Fgure 1 show that the response tme of the web applcaton can be dramatcally ncreased when run wth a P or WordCount (WC) job. Ths happens even when the Xen scheduler s parameters are tuned to gve Hadoop the lowest possble weght (.e., the lowest prorty). However, the throughput of Normalzed TCT 7 6 P 5 Sort 4 3 2 1 2 4 6 8 1 12 14 16 18 CPU Utlzaton Fgure 2: TCT vares by job, and ncreases non lnearly as the web servce consumes a larger quantty of CPU (out of 2 cores). TPC-W remans smlar, as does the amount of CPU that t consumes. Further, we fnd that f Hadoop s gven a separate CPU from TPCW, there s no nterference at all. Ths suggests that the performance nterference s due to poor CPU schedulng decsons, not IO nterference. A second major challenge when runnng n shared envronments s that dfferent Hadoop jobs are affected by lmtatons on avalable resources n dfferent ways. Fgure 2 shows that as the amount of resources consumed by a foreground nteractve VM rses, the normalzed task completon tme (relatve to Hadoop runnng alone) can ncrease sgnfcantly for some jobs. For example, P, a very CPU ntensve job, suffers more than Sort, whch s IO ntensve. As a result, the best performance wll be acheved by carefully matchng a Hadoop job to the servers that wll allow t to make the most effcent progress. 3. PROBLEM AND SYSTEM OVERVIEW Ths secton presents the formal problem MIMP targets, and then gves an overvew of the system. 3.1 Problem Statement The scenaro where we beleve MIMP wll provde the most beneft s n a hybrd cluster contanng a mx of dedcated nodes (vrtual or physcal) and volunteer or shared nodes that use vrtualzaton to run both nteractve applcatons and Hadoop tasks. We assume that the nteractve applcatons are hgher prorty than the Hadoop tasks, whch s generally the case snce users are drectly mpacted by slowdown of nteractve servces, but may be wllng to wat for long runnng batch processes. Whle we focus on web applcatons, the nteractve applcatons could represent any latency-senstve servce such as a streamng vdeo server or remote desktop applcaton. Although we treat Hadoop jobs as lower prorty, we stll take nto account ther performance by assumng they arrve wth a deadlne by whch tme they must be complete. As dscussed prevously, we focus on the Map phase of Map Reduce, as ths s generally more parallelzable and s less prone to straggler performance problems (.e., a sngle slow reduce task can substantally hurt the total completon tme). As n [5], we use dedcated servers to run both the shared Hadoop fle system and all reduce tasks. We assume that the nteractve applcatons runnng n the hgh prorty VMs have relatvely low dsk workloads, meanng that sharng the IO path wth Hadoop tasks does not cause a resource bottleneck. Whle ths s not true for some dsk ntensve applcatons such as databases, for others t can be acceptable, partcularly due to the ncreasng use of networked storage (e.g., Amazon s Elastc Block Store) rather than local dsks. Gven ths type of cluster, a key queston s how best to allocate the avalable capacty n order to maxmze Hadoop job per-

MI CPU Scheduler Runnng VM-3 Runnable Blocked VM-1 VM-2 VM-1 Web App VM-2 Web App MI CPU Scheduler Xen Shared Server VM-3 Hadoop Task Tracker VM-4 Web App VM-5 Hadoop Task Tracker MI CPU Scheduler Xen Shared Server... VM-n-1 Hadoop Data Node Xen VM-n Hadoop MP Job Scheduler Dedcated Server MP Job Scheduler Resource Montor Performance Models Admsson control Task Scheduler Fgure 3: The MI CPU scheduler only runs Hadoop VMs when others are blocked, so VM-2 mmedately preempts VM-3 once t becomes ready. The MP Job Scheduler gathers resource avalablty nformaton from all nodes and schedules jobs based on performance model results. formance (.e., mnmze the number of deadlne msses and the total job completon tmes) whle mnmzng the nterference on the nteractve servces (.e., mnmzng the change n response tme compared to runnng the web VMs alone). 3.2 MIMP Overvew We have developed MIMP to tackle ths par of challenges. The system s composed of two schedulng components, as llustrated n Fgure 3. Mnmal Interference CPU Scheduler: The MI CPU Scheduler tres to prevent lower prorty vrtual machnes from takng CPU tme away from nteractve VMs. We do ths by modfyng the Xen CPU scheduler to defne a new prorty level that wll always be preempted f an nteractve VM becomes runnable. Maxmal Productvty Job Scheduler: Next we modfy the Hadoop Job scheduler to be aware of how avalable resources affects task completon tme. The MP Schedulng system s composed of a tranng module that bulds performance models, a montorng system that measures resdual capacty throughout the data center, and a schedulng algorthm. Our MP Scheduler combnes ths nformaton to decde whch avalable resources to assgn to each ncomng Hadoop Job to ensure t meets ts deadlne whle makng the most productve use of all avalable capacty. 4. VM SCHEDULING IN XEN Ths secton dagnoses the performance ssues n the Xen current Credt scheduler when mxng latency senstve and computatonally ntensve vrtual machnes. We then descrbe how we have enhanced the Xen scheduler to help mnmze ths nterference. 4.1 Performance wth Xen Credt Scheduler Xen Credt scheduler s a non-preemptve weghted far-share scheduler. As a VM runs, ts VCPUs are dynamcally assgned one of three prortes - over, under, or boost, ordered from lowest to hghest. Each physcal CPU has a local run queue for runnable VC- PUs, and VMs are selected by ther prorty class. Every 3ms, a system-wde accountng thread updates the credts for each VCPU accordng to ts weght share and resorts the queue f needed. If the credts for a VCPU are negatve, Xen assgns over prorty to ths VCPU snce t has consumed more than ts share. If the credts are postve, t s assgned under prorty. Every 1ms, Xen updates the credts of the currently runnng VCPU based on ts runnng tme. In order to mprove a vrtual machne s I/O performance, f a VCPU s woken up (e.g., because an IO request completes) and t has credts left, t wll be gven boost prorty and mmedately scheduled. After the boosted VCPU consumes a non-neglgble amount of CPU resources, then Xen resets the prorty to under. As ths s a weght-based scheduler, t prmarly focuses on allocatng coarse graned shares of CPU to each vrtual machne. The TPC-W 6 clents Alone +WC (mn weght) Avg. Resp. Tme 25 msec 175.5 msec Avg. CPU Utlzaton 84.8% 91.1% Runnng (sec) 65.9 656.4 Runnable (sec) 1.9 524.4 Blocked (sec) 192.4 52.2 Table 1: Xen Credt Scheduler statstcs when runnng a web applcaton alone or wth a Word Count VM. Boost mechansm s reled upon to mprove performance of nteractve applcatons, but as shown prevously, t has lmted effect. Table 1 shows how much tme was spent n each scheduler state when a TPCW VM s run ether alone or wth a VM runnng the Word Count Hadoop job that has been gven the lowest possble scheduler weght. As was shown n Fgure 1, ths sgnfcantly affects TPCW performance, rasng average response tme by seven tmes. We fnd that the Credt Scheduler weght system does do a good job of ensurng that TPCW gets the overall CPU tme that t needs the CPU utlzaton (out of 2% snce t s a 2-core machne) and tme spent n the Runnng state are smlar whether TPC- W s run alone or wth word count. In fact, TPC-W actually gets more CPU tme when run wth word count, although the performance s substantally worse. Whle the overall CPU share s smlar, the tmelness wth whch TPC-W s gven the CPU becomes very poor when word count s also runnng. The tme spent n the Runnable state (.e., TPC-W could be servcng requests) rses substantally, causng the delays that ncrease response tme. Ths happens because Credt uses coarse gran tme accountng, whch means that 1) at tmes TPC-W may be woken up to handle IO, but t s not able to nterrupt Hadoop; and 2) at tmes Hadoop may obtan boost prorty and nterrupt TPC-W f t s at the begnnng of an accountng nterval and has not yet used up ts quantum. 4.2 Mnmal Interference CPU Scheduler Our goal s to run processor or data ntensve vrtual machnes n the background, wthout affectng the more mportant nteractve servces. Therefore, we have modfed the Xen scheduler to defne a new extra low prorty class. Vrtual machnes of ths class are always placed at the end of the Runnable queue, after any hgher prorty VMs. We also adjust the Boost prorty mechansm so that background VMs can never be boosted, and so that f a regular VM s woken up due to an I/O nterrupt, t wll always be able to preempt a background VM, regardless of ts current prorty (.e., under or over). Ths schedulng algorthm mnmzes the potental CPU nterference between nteractve and Hadoop vrtual machnes, but t can cause starvaton for background VMs. To prevent ths, we allow a perod, p, and executon tme, e, to be specfed. If over p seconds the VM has not been n the Runnng state for e mllseconds, then

ts prorty s rased from background to over. After t s scheduled for the specfed tme slce, t reverts back to background mode. We use ths to ensure that Hadoop VMs do not become completely naccessble va SSH, and so they can contnue to send heartbeat messages to the Hadoop job scheduler. Whle ths mechansm s not necessary when runnng nteractve VMs that typcally leave the CPU dle part of the tme, t can be mportant f MIMP s run ether wth CPU ntensve foreground tasks, or wth a very large number of nteractve VMs. 5. PROGRESS AWARE DEADLINE SCHEDULING IN HADOOP A Hadoop Job s broken down nto multple tasks, whch each perform processng on a small part of the total data set. When run on dedcated servers, the total job completon tme can be relably predcted based on the nput data sze and prevously traned models [25, 17]. The challenge n MIMP s to understand how job completon tmes wll change when map tasks are run on servers wth varable amounts of spare capacty. Usng ths nformaton, MIMP then nstructs the Hadoop Job Tracker on how to allocate slots (.e., avalable shared or dedcated workers) to each job. Montorng Cluster Resource Avalablty: MIMP montors resource usage nformaton on each node to help gude task placement and prevent overload. MIMP runs a montorng agent on each dedcated and shared node, and sends perodc resource measurements to the centralzed MP Job Scheduler component. MIMP tracks the CPU utlzaton and dsk read and wrte rates of each vrtual machne on each host. These resource measurements are then passed on to the modelng and task schedulng components as descrbed n the followng sectons. 5.1 Modelng Background Hadoop Jobs MIMP uses Task Completon Tme models to predct the progress rate of dfferent job types on a shared node wth a gven level of resources. As shown prevously n Fgure 2, each job needs ts own task completon tme model. The model s traned by runnng map tasks on shared nodes wth dfferent avalable CPU capactes. Ths can ether be done offlne n advance, or the frst set of tasks for a new job can be dstrbuted to dfferent nodes for measurement, and then a model can be generated and updated as tasks complete. Our current mplementaton assumes that all jobs have been traned n advance on nodes wth a range of utlzaton levels. Once a job has been traned for one data nput sze, t can generally be easly scaled to accurately predct other data szes [17]. Job Progress: The progress model for a job of type j s a functon that predcts the task completon tme on a shared node wth resdual capacty r. From Fgure 2 we see that ths relatonshp s hghly non-lnear, so we use a double exponental formula, exp2, provded by MATLABs Non-lnear Least Squares functonalty : T CT j(r) = a e b r + c e d r (1) where a, b, c, and d are the coeffcents of the regresson model traned for each job. The coeffcents b and d represent the rate at whch T CT j(r) exponentally grows. In order to compare the progress that wll be made by a job on an avalable slot, we use the normalzed TCT: NormT CT j(r) = T CT j(r) T CT j(r dedcated ) where the denomnator represents the task completon tme when runnng on a dedcated node. Ths allows MIMP to compare the relatve speeds of dfferent jobs. (2) Checkng Deadlnes: The task completon tme model can then be used to determne f a job wll be able to meet ts deadlne gven ts current slot allocaton. MIMP tracks a resource vector, R, for each actve job. The entry R represent the amount of resources avalable on worker slot that ths job has been allocated for use: 1% for an avalable dedcated slot, % for a slot assgned to a dfferent job, or somethng n between for a shared slot allocated to ths job. If there s currently t remanng seconds untl the job s deadlne, then MIMP can check f t wll meet ts deadlne usng: CompletableT asks(j, R) = n slot =1 t remanng T CT j(r ) If CompletableT asks(r) s greater than n tasks, the number of remanng tasks for the job, then t s on track to meet ts deadlne. Map Phase Completon Tme: We can also obtan a drect predcton of the map phase completon tme usng: CompletonT me(j, R) = n tasks n slot =1 T CT j(r ) n whch estmates the total map phase completon tme based on the average TCT of each slot and the number of remanng tasks. Data Node I/O: Hybrd clusters lke the ones consdered n MIMP are partcularly prone to dsk I/O bottlenecks snce there may be a relatvely small number of dedcated nodes actng as the data store. If too many I/O ntensve tasks are run smultaneously, task completon tmes may begn to rse [5]. To prevent ths, we use a model to predct the I/O load ncurred by startng a new map task. Durng MIMP model tranng phase, we measure the read request rate sent to the data nodes by a dedcated worker. Snce I/O accesses can be erratc durng map tasks, we use the 9 th percentle of the measured read rates to represent the I/O requred by a sngle worker, per data node avalable. In order to calculate the read I/O load ncurred by a new task on a shared worker, we use the normalzed TCT from Equaton 2 as a scalng factor: IO j(r) = read 9th j NormT CT j(r) to predct ts I/O requrement. Ths can then be used to determne whether runnng the task wll cause the data nodes to become overloaded, as descrbed n the followng secton. 5.2 Progress Aware Earlest Deadlne Frst We now present two standard Hadoop job schedulers, and then dscuss how we enhance these n MIMP so that t accounts for both deadlnes and the relatve beneft of assgnng a worker to each job. FIFO Scheduler: The smplest approach to schedulng Hadoop jobs s to servce them n the order they arrve all tasks for the frst job are run untl t fnshes, then all tasks of the second job, and so on. Not surprsngly, ths can lead to many mssed deadlnes snce t has no concept of more or less urgent tasks to perform. EDF Scheduler: Earlest Deadlne Frst (EDF) s a well known schedulng algorthm that always pcks the job wth the earlest deadlne when a worker becomes avalable; that job wll contnue to utlze all workers untl t fnshes or a new job arrves wth a smaller deadlne. EDF s known to be optmal n terms of preventng deadlne msses as long as the system s preemptve and the cluster s not over-utlzed. In practce, Hadoop has somewhat coarse graned preempton each task runs to completon, but a job can be preempted between tasks. It s also dffcult to predct whether a Hadoop cluster s over-utlzed snce tasks do not arrve (3) (4) (5)

Mss Deadlne Deadlne Aware Accepted TaskTracker Job Submsson Admsson Control Job Queue Scheduler All Meet Deadlne Earlest Deadlne Rejected Progress Aware TaskTracker Fgure 4: Admsson Control Archtecture Max Progress on strct schedules as s typcally assumed n Real Tme Operatng Systems work. Despte ths, we stll expect EDF to perform well when schedulng jobs snce t wll organze them to ensure they wll not mss deadlnes. MP Scheduler: Our Maxmum Progress (MP) Scheduler uses the models descrbed n the prevous secton to enhance the EDF scheduler. Whenever a worker slot becomes free, we determne whch job to assgn to t based on the followng crtera. Frst, MP examnes each job n the queue to determne whether t can meet ts deadlne wth ts currently allocated set of slots usng Equaton 3. If one or more jobs are predcted to mss ther deadlne, then MP allocates the slot to whchever of those jobs has the closest deadlne and returns. If all jobs are currently able to meet ther deadlnes, MP consders each job n the queue and uses Equaton 2 to calculate ts normalzed task completon tme f assgned the resources of the free slot. It fnds the job wth the smallest normt CT value, snce that job s best matched for the avalable resources. Before assgnng the slot, MP calculates the IO cost of runnng the selected job usng Equaton 5. If startng a new task of ths type wll cause any of the data nodes to become overloaded, then the job wth the next hghest normt CT s consdered, and so on. Ths algorthm ensures that the selected job s ether currently unable to meet ts deadlne, or the job that wll make the most progress wth the slot, wthout causng the data nodes to become overloaded. 5.3 Admsson-Control Hgh number of ncomng jobs can cause the cluster resources to be overloaded. When cluster resources are over-utlzed, naturally, due to resource contenton, we wll have more jobs that wll mss ther deadlnes than usual. EDF scheduler does not have a mechansm that can control resource overload, nstead, t tres to schedule as many jobs at t can ft n the ncomng job queue. Ths aggressve approach not only can make a new job mss the deadlne but can also cause other jobs n the queue to mss ther deadlnes as well. To prevent a cluster from becomng overloaded, we present Admsson Control mechansm n MIMP scheduler. The Admsson Control maxmzes the number of jobs that can meet ther deadlnes by acceptng or rejectng a new ncomng job to be executed n the cluster based on whether or not t wll overload the resources of that cluster. We assume that jobs wth earler deadlnes always have the hghest prorty. Ths means that f a new job arrves wth an earler deadlne than some job already n the queue, t may be accepted even though ths could cause the job n the queue may now mss ts deadlne. A job wll only be accepted nto the queue f MIMP predcts t can meet ts deadlne wth the avalable resources.we desgn the admsson controller based on the followng crtera: When a new job J new wth deadlne Deadlne new s submtted, the controller fnds the jobs J 1, J 2,..., J N n the job processng queue that have an earler deadlne than J new. For nstance, f J 1 and J 2 are the only jobs that have an earler deadlne than J new, then the controller wll fnd the jobs J 1, J 2, J new to be evaluated. Remember that we assume the jobs J new+1, J new+2,..., J N wll meet ther deadlnes. In order to accept J new, MIMP must determne f dong so wll cause any hgher prorty job (.e., one wth an earler deadlne) to mss ts deadlne. The Admsson Controller does ths by estmatng the processng tme requred by each hgher prorty job. To make a conservatve estmate, we calculate each job s estmated completon tme assumng t runs by tself on the cluster, usng equaton 6. In practce, MIMP wll be progress-aware, and wll attempt to assgn each job to ts most effcent slots, so t s lkely that ths wll be an overestmate; ths makes the admsson controller more conservatve, reducng the lkelhood of a mssed deadlne. JCT (j, R) = n task remanng n slot =1 T CT j(r ) n 2 (6) Once we estmate the processng tme, the Admsson Controller accepts the new job f and only f all the jobs that has ther deadlnes lesser than Deadlne new can complete successfully. The acceptance condton for a new job s shown n equaton 7. If ths condton s not met, there s a hgh probblty that some jobs wll mss ther deadlnes and we thus reject J new. Deadlne new new job =1 JCT (7) Fgure 4 shows the decson makng process of our MIMP scheduler wth Admsson Control. Frst, when a new job s submtted, the admsson controller, based on the above crtera predcts the new jobs completon tme and then makes a decson whether to accept or reject the job. In the case where a job s accepted, the newly accepted job would then be put nto a job processng queue where t wats to get scheduled. When the scheduler fnds some free task slots that free up, t allocates those free slots to the jobs n the queue based on deadlne-aware or progress-aware schedulng. Fnally, once the free slots are allocated to the jobs, they are run on ther assgned task trackers. 6. EVALUATION In ths secton, we present the evaluaton results to valdate the effectveness of reducng nterference usng our Mnmzng Interference Scheduler. We then evaluate the accuracy and predcton

error of the TCT models and show the performance mprovement as we progressvely satturate the data nodes wth I/O. We then show how the admsson controller mproves the performance when the cluster s overloaded. We also descrbe detals of our testbed and three dfferent job schedulng alogorthms n the case study. 6.1 Setup - machnes, benchmarks For our tests we use Xen 4.2.1 wth Lnux 3.7, runnng on a heterogeneous cluster of Dell servers wth Intel E5-242 and Xeon X345 CPUs wth each havng 16GB of RAM. The E5-242 has sx physcal cores at 1.9GHz wth 64KB of L1 cache and 256KB of L2 cache per core, and a shared 15MB L3 cache, and X345 has four physcal cores at 2.67GHz wth 128KB of L1 cache and 1MB of L2 cache per core, and a shared 8MB L3 cache. The Hadoop verson s 1..4. Our vrtual cluster contans 13 physcal servers wth 7 servers runnng 4VMs per server, two for web server wth 1GB of RAM and 2VCPUs each and another two for Hadoop wth 4GB of RAM wth 2VCPUs each. Four servers run 6 dedcated Hadoop VMs (each wth ther own dsk). Two more servers run web clents. The web server VM s always pnned to shared CPU cores and Hadoop VMs are pnned to ether two dedcated or shared CPU cores dependng on the server t runs. Xens Doman-, whch hosts drvers used by all VMs, s gven the servers remanng cores. Benchmarks: We use nteractve workloads and batch workloads as our workloads. For transactonal workloads, we use two applcatons: TPC-W, whch models a three-ter onlne book store and Mcro Web App, a PHP/MySQL applcaton that emulates a mult-ter applcaton and allows the user to adjust the rate and type of requests to control of CPU computaton and I/O actvtes performed on the test system. For batch workloads, we choose the followng Hadoop jobs. PEstmator: estmates P value usng 1 mllon ponts; WordCount: computes frequences of words n 15GB data; Sort: sorts 18GB data; Grep: fnds match of randomly chosen regular expresson on 6GB data; TeraSort: samples the 1GB nput data and sort the data nto a total order; Kmeans: clusters 6GB of numerc data. Both Kmeans and Grep are dvded nto two types of jobs. 6.2 Mnmzng Interference Scheduler We start our evaluaton by studyng how our Mnmal Interference Scheduler s able to provde greater performance solaton when mxng web and processor ntensve tasks. We repeat a varaton of our orgnal motvatng experment, and adjust the number of TPC-W clents when runnng ether P or Word Count Hadoop jobs on a shared server. As expected, Fgure 5(a) shows that the response tme when usng Xen s default scheduler quckly becomes unmanageable, only supportng about 5 clents before nterference causes the response tme to rse over 1ms. In contrast, our MI scheduler provdes performance almost equvalent to runnng TPC-W alone, allowng t to support twce the throughput before response tme starts to rse. A closer look at the response tme CDF n Fgure 5(b) llustrates that MIMP ncurs only a small overhead when there are 7 clents. 6.3 Task Affnty Models In ths secton, we llustrate the accuracy of our task completon tme models and how they gude slot allocaton. 6.3.1 TCT models Fgure 6 shows the tranng data and model curves generated by MIMP (green curve s obtaned from our model). Each Hadoop VM has one core that s shared wth a Mcro Web App VM. We run Response Tme (sec) CDF 1.8.6.4.2 1.75 w/wc Mn Weght w/pi Mn Weght w/wc MIMP w/pi MIMP Alone 2 4 6 8 1 TPCW Workload (EB) (a).5 TPCW Alone.25 w/wc MIMP w/wc Mn Weght 1 2 3 4 5 Response Tme (msec) (b) Fgure 5: The MIMP scheduler provdes response tme almost equvalent to runnng a web applcaton alone. 12 P Deadlne for P = 7s 1 Sort 8 Deadlne for Sort = 33s 6 4 2 1 2 3 4 5 Tme (sec) Predcted MCT (sec) Fgure 7: MIMP accurately predcts Map-phase completon tme (MCT) and approprately allocates slots to meet deadlnes. a set of Hadoop jobs across our cluster usng a randomly generated web workload rangng from 1 to 1% CPU utlzaton for each shared node. The x-axs represents the CPU utlzaton of the web VM before each task s started; we normalze the measured task completon tme by the average TCT when runnng the same type of task on a node wth no web workload. These fgures show the wde range of TCTs that are possble even for a fxed level of CPU avalablty. Ths varaton occurs because the MI scheduler can gve an unpredctable amount of CPU to the low prorty Hadoop VM dependng on fluctuatons n the web workload. 1 Thus, t s qute dffcult to make accurate predctons, although our models do stll capture the overall trends. When we apply these models to our case study workload descrbed n Secton 6.6, we fnd that 57% of the tme our models over predct task completon tme, and that the average over predcton s by 35%. The average under predcton s 29%. Ths s good snce we would prefer our model to over predct task completon tmes, causng t to be more conservatve, and thus less lkely to mss deadlnes. 1 The varaton s not smply related to dfferences n data node I/O levels, snce even the PI job (whch does not make any storage accesses) sees a partcularly hgh varaton n TCT.

Normalzed TCT 2 15 1 5 2 3 4 5 6 7 8 9 1 CPU Utlzaton (a) grep-search Normalzed TCT 2 15 1 5 2 3 4 5 6 7 8 9 1 CPU Utlzaton (b) grep-sort Normalzed TCT 2 15 1 5 2 3 4 5 6 7 8 9 1 CPU Utlzaton (c) Kmeans-Classfcaton Normalzed TCT 2 15 1 5 2 3 4 5 6 7 8 9 1 CPU Utlzaton (d) kmeans-iterator Normalzed TCT 2 15 1 5 2 3 4 5 6 7 8 9 1 CPU Utlzaton (e) P Normalzed TCT 2 15 1 5 2 3 4 5 6 7 8 9 1 CPU Utlzaton (f) Sort Normalzed TCT 2 15 1 5 2 3 4 5 6 7 8 9 1 CPU Utlzaton (g) Terasort Normalzed TCT 2 15 1 5 2 3 4 5 6 7 8 9 1 CPU Utlzaton (h) Word Count Fgure 6: MIMP trans dfferent models for each Hadoop job type. Jobs NRMSE Jobs NRMSE 4 P 7.74% KmeansClass 8.61% Sort 8.2% KmeansIterator 9.53% 3 Terasort 8.17% Grepsearch 8.26% Wordcount 7.25% Grepsort 9.64% 2 Table 2: NRMSE 6.3.2 Total Map phase tme predcton The TCT models of MIMP are used to predct whether a job wll meet ts deadlne gven ts current slot allocaton. Fgure 7 shows how the predctons change as slots are allocated and removed from a job. We frst start a P job at tme wth a deadlne of 7 seconds. Wthn 1 seconds, P has been allocated all of the avalable slots, so ts predcted map phase completon tme (MCT) quckly drops to about 37 seconds. At tme 8 sec, a Sort job s started, causng the MIMP scheduler to dvde the avalable slots between the two jobs. It reduces from P, but only enough to ensure that Sort wll fnsh before ts deadlne. The predcted MCT of each job fluctuates as the number of slots t s gven vares, but t remans accurate throughout the run. 6.3.3 Predcton Error To evaluate the accuracy of the TCT models, we compare the error between the predcted task completon tmes and the actual task completon tmes. The accuracy s measured by the normalzed root mean square error (NRMSE). We use the followng formula 8 to calculate the RMSE. Fgure 8 shows the RMSE for the dfferent Hadoop jobs. The sold bars represent the dfference between the fastest and slowest task completon tme, whle the error bars represent the RMSE; ths shows that the error of our predctor s small compared to the natural varaton n completon tmes on dfferent servers. n =1 RMSE = (X obs, X predct, ) 2 (8) n Where X obs s the observed values and X predct s the predcted values at tme/place. We use the formula 9 to calculate the NMRSE. Table 2 shows the NRMSE values for dfferent Hadoop jobs. NRMSE = RMSE/(X obs,max X obs,mn ) (9) To llustrate how the predcton error affects the MIMP scheduler, we njected errors nto the predcton. We run a trace of Hadoop jobs (p, terasort) n 4 shared nodes for 1 hour. Fgure 9 shows that the total task completon tme slghtly fluctuates when the nsert error was wthn 1%. The nserted error does not nflu- TCT range (msec) Total TCT (sec) 1 p grepsort grepsearch sort Hadoop Job Type kmeansiterator kmeansclass wordcount terasort Fgure 8: TCT predcton error 14 135 13 125 12 115 11 5 1 15 2 25 3 35 4 Inserted Error(%) Fgure 9: Impact of TCT error ence the scheduler obvously snce from Fgure 8 we can clearly see that our TCT model predcton error s around 1%. However, when the error s ncreased from 15% to 2%, the total tme completon tme ncreases sgnfcantly from 121 second to 1379 second. 6.4 Data Node I/O Saturaton Allocatng more shared worker nodes to a Hadoop job wll only ncrease performance f the data nodes that serve that job are not overloaded. To evaluate ths behavor, we measure the map phase completon tme (MCT) when the number of avalable shared nodes s ncreased from to 1; there are also three dedcated nodes that run map tasks and act as data nodes. Each shared node runs a web server that leaves 164% of the CPU avalable for the Hadoop VM. Fgure 1 shows the mpact of the data node IO bottleneck on the P and Grep jobs. We normalze the map phase completon tme by the measured tme wthout any shared nodes. Intally, both job types perform smlarly, but the Grep job soon sees dmnshng returns as the data nodes become saturated. In contrast, P s an entrely CPU bound job, so t s able to scale optmally,.e., ten

Normalzd MCT 1.8.6.4.2 PI Grep Optmal 2 4 6 8 1 Number of Shared VMs AVG TCT(s) 2 18 16 14 12 1.8 1 1.2 1.4 1.6 1.8 2 Utlzaton MIMP TeraSort w/ac EDF TeraSort MIMP P w/ac EDF P Fgure 1: Data ntensve Hadoop tasks see dmnshng returns once the data nodes become saturated. Number of jobs 26 24 22 2 18 16 14 12 1 8 6 4 2 EDF MIMP MIMP-AC 1.3 meet mss reject EDF MIMP MIMP-AC 1.5 EDF MIMP MIMP-AC 1.7 Utlzaton EDF MIMP MIMP-AC 1.8 EDF MIMP MIMP-AC 2. Fgure 11: Meet, Mss, Reject jobs vs Utlzaton shared servers lowers the completon tme by a factor of nearly ten. 6.5 Admsson Control To evaluate how our admsson controller affects the performance when the cluster s overloaded, we vared the level of cluster overload and then compare the number of jobs meetng deadlnes, mssng deadlnes, or beng rejected among MIMP wth admsson control(mimp w/ac), MIMP wthout admsson control and EDF scheduler. Wth dfferent cluster utlzaton, we generate 3 mnute traces usng p and terasort jobs. Fgure 11 shows that when the cluster s slghtly overloaded, the number of met deadlnes s almost the same for the MIMP w/ac, MIMP and EDF scheduler. However, when the cluster s hghly overloaded, most of the jobs wll mss deadlne for EDF and MIMP. Snce MIMP w/ac can detect the overload and reject some jobs, t ensures that the other jobs n the queue wll meet ther deadlnes. To show how our progress-aware MIMP scheduler affects TCT wth vared cluster utlzaton, n fgure 12, we compare the average TCT for MIMP w/ac and EDF scheduler whle changng the cluster utlzaton. We can see that f the cluster s lghtly loaded, for example when the utlzaton s 1, the average TCT of terasort task wth MIMP w/ac s less than the EDF scheduler. Ths s because, whle usng MIMP w/ac all the jobs can meet ther deadlne, so our progress-aware rule s used to select whch jobs to run. For the same utlzaton, the average TCT of P task n MIMP w/ac s more than the one n EDF scheduler. One reasonable explanaton s that the TCT model for P s lower than the TCT model for terasort. At the start of the experment, almost all slots n the scheduler are runnng P task. However, n order to avod the terasort job from mssng ts deadlne, the MIMP scheduler also chooses the terasort task to run. In addton, the Hadoop slot whch has a low CPU utlzaton would have shorter schedule tme perod than the one wth hgh CPU utlzaton. Therefore, terasort task has a greater chance to gan the slot CDF 1.8.6.4 Fgure 12: Average TCT vs Utlzaton.2 MIMP w/ac TeraSort EDF TeraSort 1 2 3 4 5 6 TCT (sec) Fgure 13: TeraSort TCT CDF wth low CPU utlzaton than the P job. Furthermore, wth the utlzaton ncreasng, the behavor of MIMP scheduler wll be more smlar wth EDF scheduler. The devaton of TCT between the two types of scheduler wll decrease around zero. Fgure 13 and 14 shows CDFs of the TCT for terasort and p wth the.8 utlzaton. One can see that when the utlzaton s lght, the MIMP scheduler tends to be progress-aware, and TCTs of terasort and P n MIMP scheduler are shorter than the TCTs n EDF scheduler. In ths case, the progress-aware scheduler s only provdng a modest beneft snce the expermental cluster only has four slots, and only one of those slots provdes a sgnfcant boost when terasort jobs are scheduled there (as shown by the larger number of jobs fnshng wthn 1 seconds for terasort. We expect that a larger cluster wth a more dverse set of servers and jobs would provde even greater beneft. 6.6 Case study To evaluate our overall system, we perform a large scale experment where we run a trace of Hadoop jobs on a shared cluster and evaluate the performance of three dfferent job schedulng algorthms. We use a total of 2 Hadoop VMs each wth two cores: 6 dedcated hosts, 6 wth a lght web workload (2-35% CPU utlzaton), 6 wth a medum load (85-95%), and 2 that are hghly loaded (13-17%). We generate these workloads based on our observatons of the DIT(Dvson of Informaton Technology) and Wkpeda data traces, although we use a hgher overall utlzaton level than was found n those traces snce ths puts more stress on CDF 1.8.6.4.2 MIMP w/ac PEstmator EDF PEstmator 5 1 15 2 25 3 35 TCT (sec) Fgure 14: PEstmator TCT CDF

P WordCount TeraSort Sort grep-search grep-sort Kmeans-Iter Kmeans-class Number of Slots Number of Slots 45 4 35 3 25 2 15 1 5 3 25 2 15 1 5 1 2 3 4 5 6 Tme(s) (a) EDF 1 2 3 4 5 6 Tme(s) (b) MIMP Fgure 15: Slot allocaton for the frst 1 mnutes - MIMP vs. EDF FIFO EDF MIMP #Mss Deadlne 67 2 1 Total TCT(h) 72.17 72.61 67.95 #Faled Jobs 17 #Faled Tasks 18 1 Avg Lateness(s) 217.8 6.2 7.9 Table 3: Workload Performance Statstcs makng ntellgent schedulng decsons. The web workloads are generated usng httperf clents connected to our Mcro Web App benchmark. We generate a random Hadoop trace composed of our sx representatve job types. Jobs of each type arrve followng a Posson dstrbuton; the mean nter-arrval perod s used as the deadlne for that type of job, wth the excepton of KMeans jobs whch we set to have no deadlne. The job trace lasts 2.5 hours and contans 174 jobs n total. Scheduler Comparson: Table 3 shows the performance statstcs of each job scheduler when processng ths trace. Unsurprsngly, FIFO performs very poorly, mssng deadlnes for 67 out of 174 jobs, wth an addtonal 17 jobs falng to complete at all. The EDF scheduler performs much better, but stll msses two jobs, wth an average lateness of 6.2 seconds. The total task completon tme (.e., the sum of all successful task executon tmes) s 72.61 hours for EDF; FIFO s slghtly lower only because t has faled tasks whch do not add to the total. MIMP provdes the best performance, mssng only one job deadlne by 7.9 seconds. Most mportantly, t acheves ths whle usng 4.66 hours less total executon tme than EDF. Ths s possble because MIMP makes smarter decsons about whch tasks to run on whch nodes, better matchng them to the avalable resources. Job Dstrbuton: To understand why MIMP mproves performance, we now examne how each scheduler assgns workers to jobs. Fgure 15(b) shows the number of slots assgned to each job durng a 1 mnute porton of the trace. Ths shows that EDF assgns all task slots to whatever job has the earlest deadlne, even MIMP/HIGH MIMP/LOW % 1% 2% 3% 4% 5% 6% 7% 8% 9% 1% p wordcount terasort sort grepcsearch grepcsort kmeanscteragon kmeanscclassfcagon Fgure 16: Job Dstrbuton on hgh vs. low load server though some slots may be better suted for a dfferent job type. In contrast, MIMP tends to run multple jobs at the same tme, allocatng the best fttng slots to each one. Whle ths can result n longer job completon tmes, MIMP stll ensures that all jobs wll meet ther deadlnes and mproves overall effcency due to resource affnty. Fgure 16 breaks down the percent of tasks completed by the hghly loaded and lghtly loaded servers when usng MIMP. Snce each job has a dfferent arrval perod, some jobs (such as P) have more total tasks than nfrequent jobs (such as K-means). However, the results stll follow our expectatons based on the models shown n Fgure 6. For example, P and grep-search have partcularly hgh normalzed TCT when the web server utlzaton rses, so MIMP runs relatvely fewer of those tasks on the hghly loaded servers. In contrast, the completon tme of sort does not change much, so MIMP runs more of those tasks on the hghly loaded servers. 7. RELATED WORK Vrtual Machne Schedulng: Several prevous works propose to mprove the effcency of Xen Credt CPU scheduler. For example, [3] proposes a modfcaton to the scheduler that asynchronously assgns each vrtual CPU to a physcal CPU n order to reduce CPU sleep tme. Ln et al. [12] developed VSched that schedules batch workloads wthn nteractve VMs wthout compromsng the usablty of nteractve applcatons. The drawback of ths system s that t was not desgned to run on a cluster. X et al., use technques from Real-Tme schedulng to gve strcter deadlne guarantees to each vrtual machne [19]. Other work has looked at avodng nterference between these tasks by careful VM placement [7, 21] or dedcatng resources [1]. Paragon [7] proposes a heterogeneous and nterference-aware data center scheduler. The system prefers to assgn the applcatons on the heterogeneous hardware platform that the applcaton can beneft from and have less nterference wth the co-scheduled applcatons. MIMP extends our prelmnary study [23] to reduce nterference through mnor changes to Xen scheduler and then uses the resdual resources for bg data applcatons. To mprove I/O performance, Xu et al. [2] propose the use of vturbo cores that have a much smaller tme-slce compared to normal cores, reducng the overall IRQ processng latency. Cheng et al. [2] mproves I/O performance for Symmetrc MultProcessng VMs by dynamcally mgratng the nterrupts from a preempted VCPU to a runnng VCPU thereby avodng nterrupt processng delays. Our current focus s on shared envronments where dsk I/O s not the bottleneck for nteractve applcatons, but vew ths as mportant future work. Hadoop Schedulng & Modelng: Job schedulng n MapReduce envronments has focused on topcs lke farness [9, 22] and dynamc cluster sharng among users [16]. HybrdMR [17] consdered runnng Hadoop across mxed clusters composed of dedcated and vrtual servers, but does not consder VM nterference. Bu et al. [1] propose a new Hadoop scheduler based on the exstng far scheduler. They present an nterference and localty-aware task scheduler for MapReduce n vrtual clusters and desgn a task performance predcton model for an nterference-aware polcy. Morton et al. [14] provde a tme-based progress ndcator for a seres

of Map Reduce jobs, whch can be used to predct job completon tmes. Polo et al. provde a system to dynamcally adjust the number of slots provded for Map Reduce jobs on each host to maxmze the resource utlzaton of a cluster and to meet the deadlne of the jobs [15]. [18] decdes the approprate number of slots allocated to Map and Reduce based on the upper and lower bounds of batch workload completon tme obtaned from the hstory of job proflng. Our work s dstnct from pror work n that we estmate the job completon tme of the batch jobs that are runnng n clusters wth unpredctable resource avalablty due to other foreground applcatons. Prevous works [5, 8, 11, 13] have shown heterogeneous cluster desgns wheren a core set of dedcated nodes runnng batch jobs are complemented by resdual resources from volunteer nodes, or n some cases usng spot nstances from EC2 [4]. The closest work to ours s by Clay et al [5]. They present a system that determnes the approprate cluster-sze to harness the resdual resources of under utlzed nteractve nodes to meet user-specfed deadlnes and mnmze cost and energy. Our work extends ths by focusng on how groups of jobs should be scheduled across a shared cluster n order to both mnmze nterference and meet job deadlnes. 8. CONCLUSIONS Vrtualzaton allows servers to be parttoned, but resource multplexng can stll lead to hgh levels of performance nterference. Ths s especally true when mxng latency senstve applcatons wth data analytc tasks such as Hadoop jobs. We have desgned MIMP, a Mnmal Interference, Maxmal Progress schedulng system that manages both VM CPU schedulng and Hadoop job schedulng to reduce nterference and ncrease overall effcency. MIMP works by exposng more nformaton to both the Hadoop job scheduler and the Xen CPU scheduler. By gvng these systems nformaton about the prorty of dfferent VMs and the resources avalable on dfferent servers, MIMP allows cluster utlzaton to be safely ncreased. MIMP allows hgh prorty web applcatons to acheve twce the throughput compared to the default Xen scheduler, and has response tmes nearly dentcal to runnng the web applcaton alone. Despte the ncreased varablty ths causes n Hadoop task completon tmes, MIMP s stll able to meet more deadlnes than an Earlest Deadlne Frst scheduler and lowers the total executon tme by nearly fve hours n one of our experments. Acknowledgments: We thank the revewers for ther helpful suggestons. Ths work was supported n part by NSF grant CNS- 1253575 and the Natonal Natural Scence Foundaton of Chna Grant No. 613759, 612329, and Bejng Natural Scence Foundaton Grant No. 412242. 9. REFERENCES [1] X. Bu, J. Rao, and C.-z. Xu. Interference and Localty-aware Task Schedulng for MapReduce Applcatons n Vrtual Clusters. In Proc. of HPDC, 213. [2] L. Cheng and C.-L. Wang. vbalance: usng nterrupt load balance to mprove I/O performance for SMP vrtual machnes. In Proc. of SOCC, 212. [3] T. Cha-Yng and L. Kang-Yuan. A Modfed Prorty Based CPU Schedulng Scheme for Vrtualzed Envronment. Int. Journal of Hybrd Informaton Technology, 213. [4] N. Chohan, C. Castllo, M. Spretzer, M. Stender, A. Tantaw, and C. Krntz. See spot run: usng spot nstances for mapreduce workflows. In Proc. of HotCloud, 21. [5] R. B. Clay, Z. Shen, and X. Ma. Acceleratng Batch Analytcs wth Resdual Resources from Interactve Clouds. In Proc. of MASCOTS, 213. [6] J. Dean and S. Ghemawat. MapReduce: Smplfed Data Processng on Large Clusters. Commun. ACM, 51, 28. [7] C. Delmtrou and C. Kozyraks. Paragon: QoS-aware Schedulng for Heterogeneous Datacenters. In Proc. of ASPLOS, 213. [8] H. Herodotou, F. Dong, and S. Babu. No one (cluster) sze fts all: automatc cluster szng for data-ntensve analytcs. In Proc. of SOCC, 211. [9] M. Isard, V. Prabhakaran, J. Currey, U. Weder, K. Talwar, and A. Goldberg. Quncy: far schedulng for dstrbuted computng clusters. In Proc.of SOSP, 29. [1] E. Keller, J. Szefer, J. Rexford, and R. B. Lee. NoHype: vrtualzed cloud nfrastructure wthout the vrtualzaton. In Proc. of ISCA, 21. [11] G. Lee, B.-G. Chun, and H. Katz. Heterogenety-aware resource allocaton and schedulng n the cloud. In Proc. of HotCloud, 211. [12] B. Ln and P. A. Dnda. Vsched: Mxng batch and nteractve vrtual machnes usng perodc real-tme schedulng. In Proc. of Super Computng, 25. [13] H. Ln, X. Ma, J. Archuleta, W.-c. Feng, M. Gardner, and Z. Zhang. MOON: MapReduce On Opportunstc envronments. In Proc. of HPDC, 21. [14] K. Morton, A. Fresen, M. Balaznska, and D. Grossman. Estmatng the progress of MapReduce ppelnes. In Proc. of ICDE, 21. [15] J. Polo, C. Castllo, D. Carrera, Y. Becerra, I. Whalley, M. Stender, J. Torres, and E. Ayguadé. Resource-aware adaptve schedulng for mapreduce clusters. In Proc. of Mddleware, 211. [16] T. Sandholm and K. La. Dynamc proportonal share schedulng n Hadoop. In Proc. of JSSPP, 21. [17] B. Sharma, T. Wood, and C. R. Das. HybrdMR: A Herarchcal MapReduce Scheduler for Hybrd Data Centers. In Proc. of ICDCS, 213. [18] A. Verma, L. Cherkasova, V. Kumar, and R. Campbell. Deadlne-based workload management for MapReduce envronments: Peces of the performance puzzle. In Proc. of NOMS, 212. [19] S. X, J. Wlson, C. Lu, and C. Gll. RT-Xen: towards real-tme hypervsor schedulng n xen. In EMSOFT, 211. [2] C. Xu, S. Gamage, H. Lu, R. Kompella, and D. Xu. vturbo: acceleratng vrtual machne I/O processng usng desgnated turbo-slced core. In Proc. of Usenx ATC, 213. [21] Y. Xu, Z. Musgrave, B. Noble, and M. Baley. Bobtal: Avodng Long Tals n the Cloud. In Proc. of NSDI, 213. [22] M. Zahara, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoca. Delay schedulng: a smple technque for achevng localty and farness n cluster schedulng. In Proc. of EuroSys, 21. [23] W. Zhang, S. Rajasekaran, and T. Wood. Bg Data n the Background: Maxmzng Productvty whle Mnmzng Vrtual Machne Interference. In Proc. of Workshop on Archtectures and Systems for Bg Data, 213. [24] W. Zhang, S. Rajasekaran, T. Wood, and M. Zhu. Mmp: Deadlne and nterference aware schedulng of hadoop vrtual machnes. CCGrd, 214. [25] Z. Zhang, L. Cherkasova, A. Verma, and B. T. Loo. Performance Modelng and Optmzaton of Deadlne-Drven Pg Programs. ACM TAAS, 8, 213.