Virtualizing Power Distribution in Datacenters

Size: px
Start display at page:

Download "Virtualizing Power Distribution in Datacenters"

Transcription

1 Vrtualzng Power Dstrbuton n Datacenters D Wang, Chuangang Ren, Anand Svasubramanam Department of Computer Scence and Engneerng The Pennsylvana State Unversty Unversty Park, PA {dw518,cyr516,anand}@cse.psu.edu ABSTRACT Power nfrastructure contrbutes to a sgnfcant porton of datacenter expendtures. Overbookng ths nfrastructure for a hgh percentle of the needs s becomng more attractve than for occasonal peaks. There exst several computng knobs to cap the power draw wthn such under-provsoned capacty. Recently, batteres and other energy storage devces have been proposed to provde a complementary alternatve to these knobs, whch when decentralzed (or herarchcally placed), can temporarly take the load to suppress power peaks propagatng up the herarchy. Wth aggressve under-provsonng, the power herarchy becomes as central a datacenter resource as other computng resources, makng t mperatve to carefully allocate, solate and manage ths resource (ncludng batteres), across applcatons. Towards ths goal, we present vpower, a software system to vrtualze power dstrbuton. vpower ncludes mechansms and polces to provde a vrtual power herarchy for each applcaton. It leverages tradtonal computng knobs as well as batteres, to apporton and manage the nfrastructure between co-exstng applcatons n the herarchy. vpower allows applcatons to specfy ther power needs, performs admsson control and placement, dynamcally montors power usage, and enforces allocatons for farness and system effcency. Usng several datacenter applcatons, and a -level power herarchy prototype contanng batteres at both levels, we demonstrate the effectveness of vpower when workng n an under-provsoned power nfrastructure, usng the rght computng knobs and the rght batteres at the rght tme. Results show over 5% mproved system utlzaton and scaleout for vpower s over-bookng, and between 1-8% better applcaton performance than tradtonal power-cappng control knobs. It also ensures solaton between applcatons competng for power. Categores and Subject Descrptors C. [Computer Systems Organzaton]: General Keywords Datacenters, Power Management, Batteres Permssontomakedgtalorhardcopesofallorpartofthsworkfor personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bearthsnotceandthefullctatononthefrstpage.tocopyotherwse,to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ISCA 13 Tel-Avv, Israel Copyrght 13 ACM /13/6...$ INTRODUCTION Power consumpton of datacenters has tremendous mplcatons on both operatng (op-ex) and captal (cap-ex) expendtures. Whle energy consumpton has been the target of many pror studes to reduce op-ex, there s growng nterest (e.g. [17, 1, 7, 3, 11]) n reducng cap-ex of power provsonng. Power nfrastructure/dstrbuton costs can contrbute to over a thrd of a large datacenter s amortzed monthly expendtures, wth each provsoned watt costng between $1- [11, 18]. Consequently, t has become ncreasngly attractve to under-provson ths resource. Wth aggressve under-provsonng, the power herarchy - from ncomng utlty lnes, swtch gear and UPS unts, gong down to Power Dstrbuton Unts (PDUs) and even server power supples - becomes a precous resource that needs to be carefully allocated and managed across applcatons to ensure safe operaton. As n any other hardware resource, vrtualzng ths nfrastructure can hde the under-provsonng (scarcty) from applcatons, provdng each wth the lluson of a dedcated/nsulated herarchy, whle extractng the maxmum value from each provsoned watt. Ths paper presents vpower, a software system to create vrtual power herarches on ths valuable physcal resource. We draw an analogy wth physcal memory, a valuable and under-provsoned resource. Even sngle applcatons need capactes larger than the avalable physcal memory. Further, co-exstng applcatons that are tme and/or space multplexed on that system, need to share ths lmted capacty. Vrtual memory was ntroduced to overbook ts capacty whle nsulatng applcatons from each other. It provdes applcatons wth the lluson of a large address space wthout ther havng to deal wth physcal memory management explctly. Dynamc memory management (usng malloc()/free()) s at best employed at the vrtual memory nterface by the applcaton. The underlyng system/os treats these calls more as hnts n managng physcal memory. We beleve that as power nfrastructure s aggressvely under-provsoned, t becomes as central as other computng resources, mandatng a smlar management strategy. However, vrtualzng power nfrastructure poses unque challenges and opportuntes: Power dstrbuton uses a herarchcal network spannng several servers, and the levels n the herarchy that are under-provsoned play a key role n how the resource has to be managed. A power over-draw n one part of the herarchy can mply a reducton n power capacty for another part. Further, wth many datacenter applcatons spannng multple servers, we need the ablty to create,

2 manage and polce vrtual sub-herarches (unlke vrtualzng physcal memory of a sngle machne). Power demand management s system controlled by modulatng other resources - manly the CPU through schedulng, mgraton, consoldaton and power state (DVFS) modulaton. Applcaton partcpaton/hnts, such as a palloc(), to gude such power management, has not been explored n greater depth, partcularly for applcatons spannng several servers. In many conventonal applcatons, snce memory s an explct abstracton, ther drect nvolvement to gve such hnts seems more natural. On the other hand, conventonal applcatons are oblvous of power as a resource, makng the ablty to gve such hnts less apparent. However, we beleve that wth many datacenter applcatons gong through extensve proflng, debuggng and tunng before they go nto producton [3], there may be adequate opportuntes to gauge ther power needs. For nstance, [3] detals the pre-producton proflng of Google datacenter applcatons, on dverse server hardware before they are deployed on producton systems, and one could envson collectng ther power demands durng such pre-producton proflng phases. These can subsequently be used as hnts n producton mode. More mportantly, snce these are hnts, over or under statng these needs or even statng these at a coarse temporal resoluton, may stll be better than not specfyng at all. Demand modulaton usng computng knobs - schedulng, DVFS, mgraton, etc. - s no longer the only (ndrect) means for controllng ths resource. Supply sde knobs that leverage batteres and other energy storage devces placed n one or more layers of the herarchy, have recently been proposed [17, 1, 39] to facltate aggressve power provsonng. A battery n a gven level can temporarly boost the power capacty of the sub-herarchy under t, even f the capacty of the lne above ths sub-herarchy cannot sustan ths power draw. Wth potentally several herarchcally dstrbuted batteres, the choce of whch to use to temporarly boost a sub-herarchy s capacty, and a far allocaton of ths capacty across applcatons, poses nterestng trade-offs. Provsonng vrtual power herarches requres () understandng an applcaton s power needs, () admttng and co-locatng the rght set of applcatons, () accountng for ther power usage, and (v) polcng ther executon to ensure farness and effectveness of power usage. Ths paper presents the desgn, mplementaton and evaluaton of vpower, ncorporatng mechansms and polces for these functonaltes. It provdes an optonal and flexble nterface (palloc()) for applcatons to gve hnts of ther power needs, at dfferent resolutons and accuraces. vpower has three man components - admsson control/placement manager, accountng manager and enforcer. The admsson control/placement manager regulates admsson of applcatons, places them on approprate servers and allocates power herarches. The accountng manager contnuously tracks the power draw of dfferent applcatons, that can span several servers, wth respect to ther allocatons. It s the enforcer s job to () ensure no emergences/volatons durng runtme (.e., power on a lne does not exceed provsoned capacty), () maxmze resource utlzaton, () nsulate the usage across applcatons, and (v) ensure farness. It uses a combnaton of computng knobs, such as DVFS and mgraton, and battery-based power boosts, for these purposes. Smultaneously t also ncorporates ntellgence to source power from the rght batteres n the herarchy at the rght tme. We prototype vpower on a -level herarchy, wth batteres n both layers, contanng 8 servers. Usng representatve datacenter benchmarks, we study the mpact of dfferent polces on system consoldaton, performance and scaleout, farness, and effectveness n suppressng power emergences. vpower provdes 1-8% better performance than conventonal approaches that use only computng knobs, on a rack that s 5% under-provsoned. It mproves utlzaton by over 5% compared to an approach whch conservatvely schedules based on peak load requrements. It also effectvely solates ms-behavng applcatons ensurng other applcatons are not penalzed. To our knowledge, ths s the frst paper that attempts to vrtualze the power herarchy of a datacenter spannng several servers, ncorporatng a combnaton of computng knobs and mult-level batteres.. BACKGROUND AND RELATED WORK Power Herarchy: Fgure 1 shows a typcal 4-level datacenter power herarchy. At the hghest layer, swtchgear scales down the voltage of utlty power, whch then passes through centralzed UPS unts. The UPS unts get Fgure 1: Power Herarchy another feed from backup desel generators. Power then goes through Power Dstrbuton Unts (PDUs) to dfferent racks, whch then gets dstrbuted through power strps to ndvdual servers whch all have ther own power supples. Power Cappng: When under-provsonng a layer, we need to ensure the resultng power lmts (caps) are obeyed to ensure safety. Demand Response for power cappng [] has prmarly leveraged three broad computng knobs: () Spatal Allocaton/Consoldaton [5, 31, 36, 4, 9] of workloads when they arrve, explotng statstcal multplexng of the power profles of dfferent applcatons (enablng more aggressve power provsonng than would have been possble f we consder the peaks of all n unson); () Mgraton [, 34] of these workloads spatally to a dfferent part of the datacenter (wth more headroom), or even to a dfferent datacenter, and () Temporally Shapng the demand by power state modulaton (throttlng) of hardware devces [3, 4, 13, 3, 5, ] and/or workload schedulng [8, 6]. Leveragng Energy Storage/Batteres: Some datacenters use dstrbuted UPS unts - server level n Google [14], rack level n Mcrosoft [7] and Facebook [1], though ther motvaton s to avod double-converson losses (and not power cappng). There are recent proposals [17, 15, 1, 39] to provson extra battery capactes n dfferent layers of the herarchy. Whenever the power needs n the sub-herarchy ht safety lmts (termed emergences n ths paper), the batteres can step n to temporarly source/supplement the power, so that the layers hgher up do not notce the emergency. These studes have demonstrated that power avalablty s not compromsed by keepng a reserve capacty that s suffcent to fal-over to desel generators upon power outages. Computng knobs are stll needed to handle long emergency duratons that are beyond the capactes of the provsoned storage. These studes have shown that provsonng addtonal battery capactes, whether placed centrally [15], or

3 dstrbuted at server/rack levels [17, 1], s economcally attractve, wth the consequent cap-ex savngs of underprovsonng the power nfrastructure more than offsettng addtonal battery capacty costs. Further, a recent work [39] has also shown that placng energy storage devces (not just batteres) n multple layers of the herarchy, rather than sngle-level placement, can be even more economcally attractve. Battery capacty (over-)provsonng/placement s orthogonal to ths paper, and we leverage all ths pror work, consderng these energy storage devces, at dfferent layers, as part of the power nfrastructure, towards creatng vrtual power sub-herarches. In fact, lower capactes at one or more levels make t even more mportant to manage these devces ntellgently, further motvatng ths work. Tradeoffs between sharng versus nterference n usng these batteres across applcatons s an mportant consderaton for buldng vrtual herarches. Power Vrtualzaton: Unlke other resources, vrtualzaton of power has not been studed extensvely. Relevant work n ths area s on accountng and control of power/energy usage of workloads n a co-hosted envronment (e.g. [1, 35]) or on a multcore server [33], and coordnatng (possbly conflctng) power management decsons of ndvdual vrtual machnes co-hosted on a physcal server [9]. However, our work looks at () vrtualzng entre power herarches and not just a sngle server, and () looks to leverage mult-level energy storage (batteres) to acheve ths goal (not just computng knobs). As we wll see, sharng and solaton of battery usage across applcatons, especally n a herarchcal settng, ntroduces several new consderatons n the power herarchy management. Vrtualzng and managng batteres across applcatons has been manly studed (e.g. [4, 41]) n the context of embedded and moble devces, where the prmary goal s to prolong battery runtme, rather than power cappng. 3. VPOWER SYSTEM ARCHITECTURE Goals and Problem Statement: We work wth an abstracton of the power herarchy whch has an mposed power cap P cap at one or more levels L, where L = L 1 (server level) to L L (datacenter level). Our focus here s only on mechansms and polces to adhere to these mposed P cap values, and a detaled evaluaton of the trade-offs wth df- s (examned n pror work [15, 1, 17, 39]) s orthogonal to ths paper. Specfcally, the mplementaton and evaluaton platform n ths paper uses a -level herarchy, L 1 (server) and L (rack) wth assocated power caps ferent P cap P cap 1 and P cap, but the dscussons and results can generalze to more levels. At each level, we assume the presence of energy storage - our expermental platform uses lead-acd batteres, whch are currently the most common n datacenter UPS unts. The goal s to ensure that the power draw never exceeds P cap at all correspondng L s n the herarchy, usng a combnaton of () demand-response computng knobs - workload placement, consoldaton, mgraton, schedulng, power state modulaton (DVFS states), etc., and () batteres wth maxmum power P bmax and energy capacty E bmax avalable at each level L n the herarchy whch can temporarly step n to provde the extra power needed to handle the needs beyond P cap. Table 1 summarzes these notatons. Ths paper explores applcaton nterfaces, software mechansms and polces to attan ths goal when hostng multple applcatons that share ths power nfrastructure. In the process, we need to adhere to applcaton SLAs, meet these SLAs wth mnmum amount of resources (.e., maxmze consoldaton and utlzaton), and ensure farness n how ths power s allocated and shared between applcatons. Notaton L P cap P t P req t P bmax E bmax Descrpton Power sub-herarchy Power cap of L Power draw of L at tme t Requested power from an applcaton for tme t Power capacty of a battery at L Energy capacty of a battery at L Table 1: Some Common Notatons Overvew of System Archtecture: Towards ths goal, we mplement a software-based vrtual power herarchy (vpower) (from the datacenter at the root, to the servers) for each applcaton, wthn whch t can safely execute, nsulated from any power-related emergences arsng from co-exstng applcatons, even when ther aggregated peak demand can exceed the power cap n any level. The power herarchy that s beng vrtualzed ncludes the capactes of all the power equpment (swtchgear, transformers, UPS unts, PDUs, and ndvdual power outlets) from the root, all the way down to ndvdual servers runnng ths applcaton. In addton, t also vrtualzes the battery capactes of each layer. Whle one could conservatvely book/reserve for the peak power demands of applcatons, and admt only those whose collectve peak demand fts wthn the power caps P cap, overbookng can mprove system utlzaton, snce smultaneous and sustaned peak demands from all applcatons may be less common. vpower has to: admt the rght set of applcatons (admsson control descrbed n secton 4..1); place/co-locate them wth the rght/synergstc mx of applcatons n the sub-herarchy that would lead to fewer power emergences (placement descrbed n secton 4..1); and suppress any power volatons, f and when they occur, from propagatng hgher up the herarchy usng both computng knobs and rght batteres (enforcer descrbed n secton 4..3). An accountng manager needs to track the dynamc power consumpton of dfferent applcatons, as descrbed n secton 4.., to provde up-to-date nformaton for the enforcer to ensure farness. As wth any other shared resource management, vpower can leverage applcaton-level power related nformaton that can be explctly provded usng a palloc() nterface descrbed n secton 4.1. The nterfaces to the power nfrastructure hardware that can be exploted by our system are shown n Table. Apart from nterfaces to montor the nstantaneous power draw on any lne n the herarchy (power()), there are also nterfaces to batteres n each level to montor ther state-of-charge (soc()), and control ther charge and dscharge rates. There are programmable power electroncs crcutry to control the latter, but snce our expermental setup does not provde ths functonalty, our evaluatons account for such control capablty n the management decsons. In addton, the system also uses hardware/kernel/vm nterfaces to change DVFS states, modulate schedulng and mgraton decsons, etc. Our system archtecture s pctorally depcted n Fgure for a -level herarchy wth server and rack level batteres, and power caps to be enforced at each of these two levels. There are two man software components: () one runnng as a drver wthn each server to ntate server level control knobs (DVFS, schedulng, etc.); and () another run-

4 Interface power(...) soc(...) charge(...) dscharge(...) Descrpton returns nstantaneous power draw of gven lne return state of charge of a battery charge battery at specfed rate dscharge battery at specfed rate Table : Hardware Interfaces to Power Infrastructure palloc( ) DVFS( ) mgrate( ) soc( ) power( ) charge( ) dscharge( ) soc( ) Fgure : vpower Archtecture nng as mddleware on a dedcated server, whch mplements the management/control polces and approprately nterfaces wth the server drvers. The latter also nterfaces wth the power dstrbuton network to montor (measure power draw, battery SoC, etc.) and control ther operaton (sourcng from batteres, swtchng off outlets, etc.). The next secton gves detals on the desgn choces and mplementaton detals for vpower. 4. VPOWER MECHANISMS AND POLICIES 4.1 The palloc() Interface As wth any other resource, provdng detaled and accurate nformaton of an applcaton s power profle can help manage t better, especally n aggressvely under-provsoned settngs. There are two dmensons to provdng ths nformaton - how much? and over what duraton? - and our applcaton nterface s accordngly specfed as palloc(<power,nterval>,<power,nterval>,...) where the arguments are tuples specfyng antcpated applcaton power needs durng dfferent tme ntervals over ts executon. Ideally one would lke to accurate power needs at fne temporal resoluton as s depcted n the power profle graphs of Fgure 3. Whle many datacenter applcatons are long runnng servces (e.g., web searchng, memcached, etc.) and/or perodc/repettve workloads (e.g., web crawlng), wth many of them gong through proflng and fne-tunng phases before gong nto producton mode [3], one may be able to obtan detaled and accurate power needs through proflng, barrng executon and data dependency vagares. Further, pror work on load predctablty (tme-of-day behavor n web servces, flash crowd behavor for meda servces, etc.) may be useful for these specfcatons, as s pror work on phase characterzaton [19] to make such requests ahead of ther need. However, gettng accurate and fnegraned nformaton (depcted as Exact ) may not always be feasble, and we accommodate several loose specfcatons (Table 3) both n requestng power, as well as n the temporal duratons (possbly not even needng to specfy duratons) as below. Along the temporal dmenson, one need not even gve No Entre Every Every Tme Execn. 6 secs. Instant Max. Power PMax PMax-entre PMax-6 Exact 9th Per. Power P9 P9-entre P9-6 - Avg. Power PAvg Mn. Power PMn Table 3: Example palloc() Specfcatons consdered n evaluatons. any tme nformaton (as n PMax, PMn, PAvg, P9) f jobs are long runnng wthout much varance. It s also possble to estmate broadly defned executon phases (as n the MapReduce profle shown n Fgure 3 where the power draws are qute dfferent between map and reduce phases) and accordngly specfy the ntervals. In the nterest of a unform comparson across workloads, we ntroduce a range of pre-determned nterval szes between the Every Instant and No Tme extremes, and consder representatve ntervals of entre applcaton duraton (PMax-entre, P9-entre) and 6 seconds (PMax-6, P9-6) n our evaluatons. Along the power dmenson, one could consder a wde range of values startng from the peak (PMax) (whch can be conservatve dependng on the tme nterval t s specfed for), a 9th percentle (P9) of ths peak (less conservatve), an average over the nterval (PAvg) or even as low as the mnmum power (PMn). Note that our nterface does not preclude an applcaton from specfyng any power value, over any partcular duraton. However, our system ensures that regardless of what the applcaton specfes (whch s treated more as a hnt), t wll nsulate applcatons from each other. By gvng bad hnts - ether ntentonally or unntentonally - an applcaton can at best hurt tself. For nstance, when an applcaton specfes values hgher than PMax - ths would come at the cost of our system possbly not admttng ths applcaton when power budgets are tght. At the other end, when an applcaton requests very low power needs, and then (msbehaves) starts consumng hgher power than what t ntally specfed, our enforcer wll step n and penalze t. 4. TheMddleware The mddleware s the core software component for allocatng and controllng power across applcatons. It conssts of an admsson control and placement manager, an accountng manager and a power enforcer. We next explan ther goals, desgn tradeoffs and mplementaton detals Admsson Control and Placement Manager Goal: Based on an applcaton s palloc() hnts, ths manager evaluates whether t can accommodate these needs wthout volatng exstng allocatons, and f so where n the herarchy t should be placed (.e., co-locaton wth others). The shared power nfrastructure (even f the computng load s spread across dfferent physcal servers), and shared energy storage (batteres) at multple layers n the herarchy, ntroduces addtonal consderatons to ths problem compared to approaches that only consder computng resources. Desgn Tradeoffs and Dscussons: Note that, the net power draw of a herarchy can be temporarly boosted beyond ts provsoned capacty by as much as the sum of the power draws that can be sustaned by all the batteres (dstrbuted and herarchcal) wthn that herarchy. Our admsson control polces are therefore based on our relance

5 on these batteres to acheve power cappng: Conservatve Polcy: In ths polcy, batteres are not taken nto consderaton for admttng applcatons. Batteres may stll step n durng executon to suppress peaks f they arse, whch may be rare because of the conservatve admsson control. Hence, t may be better to use less strngent power needs n palloc() when usng ths polcy. Moderate Polcy: Ths allows a certan percentage (e.g., %) of battery power (P battad ) and energy (E battad ) capacty to be avalable n addton to normal lne capacty when admttng applcatons. It can admt more, wth possble subsequent emergences when batteres run out. Aggressve Polcy: Ths s the same as Moderate, wth 1% (stll leaves reserve capacty to ensure power avalablty mandates upon outages [16, 15]) of the capacty used for admsson control. It s better to specfy more strngent needs n palloc() (e.g., PMax or P9) n conjuncton wth ths polcy, to lessen emergency occurrences. Once admtted, we need to decde where to locate/colocate ths applcaton. Wth dstrbuted batteres across the herarchy, the degree of balancng/unbalancng ther usage serves as the man crtera when explorng ths queston. Ths can be captured by the correlatons - Hgh, Low and Ant - of the power profle of ths applcaton wth those n the sub-herarchy/server where t s beng consdered. Colocaton effectveness s also a functon of the admsson control polcy. Co-locaton of ant-correlated workloads may be a better opton when conservatve admsson control s used, snce the latter s less relant on batteres, and the ant-correlaton would lessen the possblty of power emergences. At the other end, co-locaton of correlated workloads may be better for an aggressve admsson control polcy (whch may have hgher emergences) snce t aggregates (batches together) the dschargng of batteres, gvng longer tme wndows for chargng, smlar to how unbalancng of load creates more opportuntes for server shutdown n [31]. The desgn choces along these two dmensons are qualtatvely summarzed n Table 4 ndcatng the prorty order of correlaton degrees for placement under a gven admsson control polcy. We wll show expermental results to corroborate these choces. Implementaton: When an applcaton specfes ts power needs va palloc(p req t, t), the admsson control algorthm checks two aspects of ths requrement - power (P req t ) and energy (P req t t) needs - and s admtted only f both can be met. For applcatons whch only specfy power (e.g., PMax, P9, PAvg, PMn) due to the lack of tme nformaton, we deal wth energy constrant on a best effort bass and smply assume the energy constrant s met for admsson. These checks are recursvely made startng from the root. For each sub-herarchy L, a check s made to ensure: () for every tme nterval t, total allocated power ) for exstng applcatons plus the new power needs (P req t ) from the requestng applcaton should be less than or equal to the sum of provsoned power cap of ths subherarchy (P cap ) plus the amount of power from batteres (P battad ) dedcated for admsson control (as per the above 3 polces); () total allocated energy for exstng applcatons (E alloc ) plus the new applcaton s energy needs ( P (P req t t)) should be less than or equal to the sum of maxmum energy from the outlet (E outlet ) and energy from batteres (E battad ) commtted to admsson control. (P alloc,t Dfferent amounts of power/energy from batteres represent the aggressveness of admsson control polces as descrbed above. Once admtted to a sub-herarchy, we place an applcaton based on the choces from Table 4. Conservatve Moderate Aggressve P battad = P battad =.P bmax P battad E battad = E battad =.E bmax E battad Hgh Corr. (3) (3) (1) Low Corr. () () () Ant-Corr. (1) (1) (3) Table 4: Placement choce for a gven admsson control polcy. Prorty order s (1), () and then (3). = P bmax = E bmax 4.. AccountngManager Goal: It dynamcally tracks an applcaton s power wth respect to ts allocaton across ts servers. Desgn Tradeoffs and Dscussons: We adopt a smple credt based accountng mechansm [4] to track the dfference between power reservaton and ts actual consumpton for each applcaton. An applcaton s gven postve credts f t consumes (P t) below ts reservaton (P req t ) whle credts are subtracted f ts power goes above reservaton. However, to prevent msbehavng applcatons from contnuously bankng credts wthout usng them, a bound s enforced on the amount of credts that t can bank (e.g., the bound can be the battery capacty allocated to an applcaton). Smlarly, to prevent msbehavng applcatons from contnuously dschargng battery, a bound s enforced on the amount of credts that can be n debt. Implementaton: For each applcaton, we mantan a data structure called PCB (Power Control Block), whch contans ts power needs, accumulated credts, assgned server(s), etc., and the credts are updated perodcally. Outlet power s sampled for each server at a second granularty, whch suffces f only one applcaton s runnng on that server. Wthn a server, one could use apportonng technques between co-exstng applcatons usng system metrcs for power accountng as n [, 35, 33], though outlet level server meterng suffces n our evaluatons whch places applcatons on dstnct servers PowerEnforcer Goal: Despte admsson control, there may be power emergences durng the runtme due to under-estmates of applcaton needs and/or aggressve over-bookng as explaned earler. The enforcer uses dfferent computng demand-response knobs (DVFS/clock throttlng and mgraton) and batteres n dfferent layers of the herarchy to handle these emergences. Specfcally, ssues related to whch knob(s) to use for dfferent applcatons, whch batteres to employ n a herarchcal settng, are some consderatons n the desgn of the enforcer. The goal s to meet applcaton performance SLAs, as well as maxmze system utlzaton, whle ensurng farness between applcatons n the choce of knobs (performance detrmental computng knobs versus use of batteres). Desgn Tradeoffs and Dscussons: Consder any two herarchy levels L and L 1,j, wth the latter havng j = 1..n components (and hence ts -dmensonal representaton) drectly connected to L. We use the followng notatons: the provsoned peak power (power cap) of the outlet that level L can draw from s cap ; the requred aggregate power draw by level L s P t = P n j=1 P 1,j t at tme t; each component

6 L 1,j can draw power from ts own battery wth energy capacty E 1,j bmax and maxmum dscharge/charge power P 1,j bmax, and some or all of the parent battery at L dependng on the capacty of the provsoned lne between L 1,j and L. One or more of the n + 1 (n at L 1 and 1 at L ) batteres can be used to shave all or part of the power volaton at L. Lemma. No matter whch strategy s taken to dscharge batteres (.e., whch battery to use), a gven peak P t exceedng P cap at L can be shaved by batteres n the herarchy f the followng three condtons are satsfed: (A) (P t P cap ) mn(p 1,j, t P 1,j bmax ), j, (B) a battery s dscharged only when P t > P cap, and (C) dschargng decsons of all batteres are coordnated (.e., the aggregate power drawn from all n + 1 batteres at nstant t s at most (P t P cap )). Ths lemma mples that all batteres n the herarchy can be treated as one large battery placed at L provded the three condtons are obeyed. Condton (A) says that the volaton at L can be shaved by removng the contrbuton from any chld component L 1,j by sourcng that component from ts local battery. Hence, any one of the batteres at 1 can be used to suppress the volaton. Condton (B) says that the battery at L 1,j s used only to address the volaton at the parent L, and not to suppress a peak volaton that happens at L 1,j tself (.e., only the parent s under-provsoned and not each chld). These, together wth perfect coordnated control of all chld batteres to shave ths peak (Condton C), gves the smplstc lluson of a centralzed larger battery of energy capacty E bmax = P n j=1 Ebmax 1,j + E bmax and power capacty P bmax = P n j=1 P 1,j bmax + P bmax at L. The proof s gven n our techncal report [38]. However, n practce, these condtons may not hold. We need to, thus, carefully choose the rght battery to handle the emergences as descrbed below to lessen the probablty of volatng these condtons. For clarty, we dscuss these ssues assumng L (parent) s a rack, and each chld L 1,j s an ndvdual server. Parent s volaton s larger than power consumpton of m chldren (.e., (P t P cap ) > P m j=1 P t 1,j): If some polcy schedules an mbalanced lower workload on m servers than the other (n m) servers, and has draned out the batteres of these n m other servers and the rack battery because of pror usage, then the power volaton at the parent can not be handled by batteres alone. Even f these m servers completely draw power from ther local batteres, the reducton wll not suffce to address the emergency at the parent. The problem gets accentuated when m decreases, especally as we move towards the root. To address ths concern, t s better for polces to () balance the load across servers to ncrease P 1,j for under-utlzed servers, () use up the batteres at the servers whch have less power demandng applcatons frst before gong to others n cases of load mbalance. Parent s volaton s larger than power supplable by m chldren batteres (.e., (P t P cap ) > P m j=1 P 1,j bmax ): Consder a case where a polcy leaves resdual capactes n only these m batteres after pror usage, and has already draned out the other n m + 1 batteres (ncludng the rack battery). In ths case, the resdual capactes of these m server batteres are not suffcent to handle the rack volaton (agan the problem accentuates when m decreases as we move to the root). To address ths concern, t s better to () use batteres evenly at the servers, and () save rack battery for such power volatons snce they are typcally provsoned wth larger power/energy capacty to handle a potentally larger load. Chld s tself under-provsoned (.e., P 1,j t > P cap 1,j for some j): Wth aggressve under-provsonng deeper n the herarchy, a chld battery may have been used to shave ts own peak, rather than the peak of a parent, thereby volatng Condton B. Subsequently, when called upon to suppress the peak of a parent, there may not be resdual capacty. Snce a battery can be useful to suppress peaks from propagatng hgher n the herarchy and not n the reverse drecton, we use the followng general gudelnes n our polces: () reserve some capacty for suppressng emergences at ts own level, and () use hgher level batteres for the hgher level volatons as much as possble (provded lower levels stay wthn ther budget). Implementaton: Note that palloc() from applcatons, n combnaton wth our accountng manager, helps track the postve/negatve credts accumulated by each applcaton. Our runtme enforcer takes these credts nto consderaton when employng the computng and battery knobs n apportonng the emergency suppresson mechansms across applcatons. Such proportonal power allocaton (whether t be n the computng knobs or n the power from batteres), s smlar to proportonal farness studed n other resources [37]. The enforcer prortzes the order of employng knobs based on the mpact on performance (e.g., mgraton can be relatvely costly as studed n [17]), as well as the duraton and strngency of the emergency. When the estmated duraton of volatons exceeds a threshold, the power enforcer wll start to mgrate applcatons wth least accumulated credts to destnaton herarches wth suffcent slack. For smaller emergency duratons, t uses local DVFS and battery knobs. Based on the gudelnes of herarchcal battery usage dscussed above, we use the hnts comng from palloc() to mplement the enforcer as follows: () reserve certan local battery capacty at each server f any local power volatons are antcpated; () use local batteres for smaller power volatons, savng the shared hgher level batteres for larger ones, and supplement t wth DVFS f batteres alone cannot handle the need; and () as far as possble, source from batteres of servers wth low antcpated future demand. When there s slack n power usage, batteres are re-charged n the followng order: () batteres wthout suffcent charge on servers estmated to have local power volatons are gven frst prorty; () batteres wth lower state of charge are gven hgher prorty; () batteres on servers estmated to have low power demand (hence unlkely to ncur power volatons) are gven the least prorty. 5. EVALUATION 5.1 ExpermentalSetup We evaluate vpower on a scaled-down prototype usng a cluster (rack) of n = 8 servers. The face-plate ratng of these servers s 45W, dle power s around 1W and the peak power that we can push them to across our workloads s 3W. The dynamc power consumpton can be modulated wth several DVFS states and clock throttlng states. The Power Drver at each server changes power states usng the MSR regsters. Each server s equpped wth a UPS unt whch serves as the server level battery. A larger ca-

7 pacty UPS s used as the rack-level battery. We consder a 4-mnute battery per server, of whch we leave a resdual capacty of 3 mnutes for avalablty guarantees. The UPS s able to report ts load and remanng battery runtme. By swtchng off the ncomng power to a UPS unt over the network, we create the lluson of a power outage to the UPS to source power from ts batteres. Ths s a conservatve way compared to a more elaborate approach whch has power electroncs crcutry to nstantaneously source only the addtonal power needs (excess over the cap) from batteres wth the remanng comng from outlet power. Snce our expermental setup does not provde ths functonalty, our evaluatons account for such control capablty n the management decsons. vpower can leverage the server battery for enforcng server level caps, and can use one or more (and perhaps takng turns to extend the duraton) server batteres and/or the rack battery for enforcng rack level caps. We use another cluster as the destnaton for mgratng workloads. All our applcatons are hosted as VMs, wth Red-Hat Lnux 5.5, under Xen. A separate machne runs our Mddleware whch mplements the algorthms, and sends approprate throttlng and VM mgraton commands to the server power drver. 5. Workloads We consder a sute of 8 representatve datacenter applcatons (Table 5), that are both user-facng (nteractve) and batch Workload Type Runtme (Secs) YCSB user-facng 587 MedaServer user-facng 37 Memcached user-facng 1 WebSearch user-facng 69 MapReduce batch 365 WebCrawlng batch 37 GPU batch 1 VrusScan batch 888 Table 5: Workloads workloads. Interactve applcatons nclude the Yahoo! Cloud Servng Benchmark (YCSB) [8], an n-memory key-value store Memcached benchmark [6], two other applcatons (MedaServer, Web- Search) from Cloudsute [1]. Batch applcatons nclude a WebCrawlng benchmark from Cloudsute, a Hadoop MapReduce applcaton (word count used n several analytcs applcatons), a GPU applcaton n CUDA mplementng the Black-Scholes fnancal model usng a NVIDIA card, and a vrus scanner (VrusScan). The last are not amenable to mgraton from the server where they are executng. Fgure 3 shows the power profles on a representatve server runnng these applcatons, and ther runtme s gven n Table YCSB 1 3 MedaServer Memcached WebSearch MapReduce 1 3 GPU VrusScan WebCrawlng Fgure 3: Applcaton power profles. For clarty, only frst 35 seconds of executon s shown. 5.3 EvaluatonMetrcs We consder the followng metrcs from the system and applcaton perspectves: Power volatons: the number of tmes that a power cap s volated at any of the two levels. Note that n our complete system, the enforcer wll ensure no volatons. We use ths metrc to compare the pros and cons of solatng the admsson control/placement polces by removng the power enforcer n those experments. In practce, the extent (magntude and duraton) of the volatons should be tracked, but n the nterest of clarty when presentng results we fnd that the number of power volatons s a reasonable proxy for comparsons. Degree of consoldaton: the number of applcatons that can be co-located on the same server/rack wthout exceedng the specfed power caps. Consoldaton can extract more value from exstng compute and power nfrastructure, and s also attractve from the vewpont of lowerng energy consumpton/costs. Scale-out: the number of nstances that an applcaton can be replcated, to mprove ts throughput. Ths metrc s more useful when the goal s to accelerate the performance of a sngle applcaton (e.g., Memcached) as opposed to consoldaton of dsparate applcatons. Performance Degradaton: the percentage degradaton (of response tme, throughput, completon tme, etc. dependng on the applcaton) of runnng an applcaton under a power cap wth respect to the same experment runnng wthout a power cap. Farness: a measure of how the system treats co-exstng applcatons from the performance vewpont when meetng the power budgets. Rather than a sngle numercal metrc, our results wll clearly depct the dfferences between polces n penalzng applcatons. 5.4 Impact of Admsson Control Polces We begn by evaluatng the three admsson control polces - Conservatve (II), Moderate (III) and Aggressve (IV) - and compare them wth two baselne schemes - Baselnepower (I) whch admts applcatons as long as addng ther peak power consumpton (P Max) does not volate the caps (.e., assumes no statstcal multplexng or dynamc modulaton knobs and s thus extremely conservatve), and Baselneutlzaton (V) whch admts applcatons consderng only the average utlzaton of the applcatons (whch s somewhat representatve of how consoldaton s currently conducted, wthout regard to any power caps n the herarchy, and s consequently very aggressve). As explaned n secton 4..1, the effectveness of these polces depends on the specfcatons/hnts that an applcaton can gve, and we consder the desgn space for the specfcatons gven earler n Table 3. In the followng experments, we set a rack level power cap of 1W (approxmately half of the maxmum power to whch we can push our rack). To solate the mpact of admsson control, we remove the power enforcer n these experments (whch can lead to power volatons dependng on the aggressveness of admsson control), and study the resultng consoldaton and scale-out effectveness. Consoldaton:. Admsson control restrcts the consoldaton degree (the number of applcatons n a rack) wth the possble beneft of fewer power emergences. To study these trade-offs, we conduct a stress-test where we try to place

8 as many of the eght applcatons (as allowed by the admsson control polcy) n the rack, and examne the resultng volatons at the rack level. In realty, the choce for colocatng applcatons on a rack or a server would depend on addtonal ssues such as performance nterference, storage localty, securty, etc., over and beyond power caps. However, we gnore such consderatons to manly solate the mpact of the power cap at the rack level, and allow up to 8 applcatons to co-exst n the same rack, but place each applcaton on a separate server,.e., there s no server level nterference, but there could be rack level nterferences. Degree of Consoldaton Degree of Consoldaton (a) PMn (e) P9 entre (b) PAvg 8 (f) P (c) P (g) PMax entre (d) PMax (h) PMax 6 Fgure 4: Consoldaton Results for dfferent Admsson Control polces and Power Need specfcatons. Bars depct consoldaton degree (left y-axs) and ponts on lne depct number of power volatons (rght y- axs) of rack power cap set at 1W. I:Baselnepower, II:Conservatve, III:Moderate, IV:Aggressve, V:Baselne-utlzaton. Horzontal lne represents consoldaton degree (= 7) possble wth Exact power specfcaton wthout any power volatons. Fgure 4 shows the trade-offs between consoldaton degree and ts consequences on power volatons for each of the three admsson control polces, comparng them wth the two baselne extremes. Results are shown for the applcaton power specfcatons outlned earler n Table 3. In these graphs, we also show the maxmum possble consoldaton that can be attaned (whch s 7) wthout havng any volatons (and not requrng ether batteres or computng knobs n the runtme), as a horzontal lne. From these results, we make the followng observatons: A conservatve admsson control polcy (II) whch does not consder batteres, s comparable to Baselne-power (I) that provsons for the peak, and degenerates to the latter when the applcatons use the PMax specfcaton (Fg 4 (d)). Unless applcatons grossly under-estmate ther power (PMn n Fgure 4 (a)), ths polcy s not preferable. At the other end, the aggressve polcy (IV) acheves the same consoldaton degree as Baselne-utlzaton (n the cases where a tme nterval s not specfed). As long as the consoldaton degree s less than or equal to 7, there are no volatons, even when there s no dynamc enforcement.e., statstcal multplexng of workload profles suffces to keep the power draw wthn the cap. Ths s the case for all 3 polces n all the specfcatons whch have a duraton component (ether entre or 6 seconds n Fgure 4 (e-h)) # of Power Volatons # of Power Volatons It s only when duratons (even the executon tme of the applcaton) are not specfed, that the moderate and aggressve polces start overbookng the power nfrastructure, whch can potentally lead to power volatons - these would be suppressed at runtme by the enforcer wth ether battery or computng knobs. Whle ths may appear counter-ntutve (.e., no tme component should lead to less flexblty n multplexng needs and thereby lower consoldaton), recall that there are crtera to admsson control - power and energy. Even though power multplexng may be better wth temporal nformaton, note that when tme duratons are not specfed, the energy crtera s always assumed to be met (secton 4..1) from batteres, allowng more applcatons to be co-located. Scale-Out:. Instead of co-locatng dsparate applcatons wthn a gven nfrastructure capacty, a datacenter may only be nterested n acceleratng the performance of a sngle applcaton wth addtonal nstances for scale-out. We conduct smlar experments by creatng more nstances of the Memcached server workload wthn the rack as allowed by the admsson control polces. Fgure 5 (a) shows a representatve result for the P9 specfcaton, whch reterates the observatons made n the consoldaton studes. Wth more aggressve admsson control, we can add more nstances, wth the number of memcached servers supported by the rack gong to 8 for the Aggressve (IV) polcy. Volatons also ncrease, but we wll shortly show that these can be effectvely managed wthout sgnfcant performance degradaton when we ntroduce the enforcer. Wthout any runtme enforcement, one can acheve a scale-out to at most 5 servers (horzontal lne) f we are constraned by the power cap. Scale out (# of servers) (a) # of Power Volatons (b) No Power Cap DVFS vpower Fgure 5: Memcached Scaleout wth P9. I:Baselnepower, II:Conservatve, III:Moderate, IV:Aggressve, V:Baselne-utlzaton. In (a), the horzontal lne represents scale-out degree (= 5) usng Exact power specfcaton wthout any power volatons, bars depct scaleout degree, and ponts on lne represent number of power volatons. 5.5 Impact of Placement Polces We conduct experments to study the mpact of the polces when placng two applcatons - WebCrawlng and MedaServer - n the same rack, wth each applcaton placed on ts own server. These two are representatve of more snusodal behavor n the power profle, helpng us to study the nfluence of cross-correlatons. To capture dfferent crosscorrelatons, we smply vary the startng tme of these applcatons, helpng us capture a wde range of multplexng possbltes (correlatons). The P9 nterface s used to specfy the power needs of these applcatons, and the enforcer s n place to avod volatons at the potental cost of performance degradaton. As explaned n secton 4..1, placement choces go hand-n-hand wth admsson control, Throughput (MB/s)

9 and we consder the correlaton factor nteractons wth the aggressveness of admsson control as s depcted n Table 6. Note that the power caps need to be changed n order to ensure the two applcatons are admtted n all the admsson control schemes,.e., power cap s ncreased by the amount of battery capacty as you move from aggressve (rght column where the power cap s 3W) to conservatve (left column where the power cap s 5W) n Table 6. Consequently, one should not compare the performance degradatons across columns. Rather, the trend wthn each column, and how that trend changes when we move from aggressve to conservatve s what s mportant. As we can see, the columns to the left prefer an ant-correlated co-locaton of workloads, whle the aggressve admsson control prefers a more correlated placement. These observatons valdate our choce of placement prortes based on the admsson control polces as dscussed earler n secton Conservatve Moderate Aggressve Rack Cap 5W 46W 3W Hgh corr. (1,1) (8,6) (15,14) Low corr. (,) (,) (,18) Ant-corr. (,) (,) (5,) Table 6: Performance degradaton of placng WebCrawlng + MedaServer n the same rack wth dfferent cross-correlatons and admsson control polces. Tuples (WebCrawlng, MedaServer) show % degradaton w.r.t. runnng the same experment wthout a power cap. 5.6 Effectveness of Enforcer Performance. Results n Fgures 4 and 5 (a), showed the mpact of the admsson control polces wthout the enforcer n place, whch can lead to power volatons n some of the cases. We now renstate the enforcer to suppress these volatons, and show the resultng applcaton degradaton for those two sets of experments wth the P9 specfcaton n Fgure 6 (a) and Fgure 5 (b) respectvely. Applcaton DVFS vpower YCSB MedaServer 4 9 Memcached WebSearch Aggregate Power Demand Rack Power Cap MapReduce 7 41 Utlty 5 Rack Battery GPU 6 5 Server Battery Throttlng WebCrawlng VrusScan (a) Perf. Degradaton (%) (b) Power Profle (Sourcng and Cappng) Fgure 6: P9 specfcaton n the Fgure 4 experment wth Enforcer As can be seen, whle the degradaton s non-zero n most applcatons for the consoldaton experment (Fgure 6 (a)), t s stll sgnfcantly better than a DVFS-only approach (an Oracle-based best DVFS states are chosen to adhere to the power caps). Degradaton wth vpower s between 15% to 69% lower than n a DVFS-only approach (1%- 8% performance mprovement) for the more power hungry applcatons (VrusScan draws relatvely low power). The effectveness of the enforcer can be explaned wth Fgure 6 (b), whch shows how t meets the aggregate power demand of all 8 applcatons - through normal power (utlty), one or more batteres, and DVFS throttlng when necessary. Wth vpower, DVFS s employed manly when batteres do not suffce (from 7 seconds onwards). There are perods n-between (e.g. 1- seconds), when some applcatons reach ther credt lmts, and DVFS s used to enforce ther debt, as seen n a small top porton of the curve n ths regon. Server batteres are the frst choce for small ampltude volatons whle the rack battery s used for most large ampltude volatons. The choce of whch battery to use, and the farness to dfferently meet applcaton demands (through batteres or throttlng) s explaned n more detal later n ths secton. In the shown zoomed-n 35 second tme wndow, there s lttle opportunty for re-chargng the batteres, though such re-chargng does happen wth power slack from 37 seconds (not shown n Fgure 6 (b)). Smlarly, the scale-out experment for Memcached n Fgure 5 (b) shows vpower gvng throughput close to the uncapped case, and a value that s 5%-75% hgher than the throughput of a DVFS-only opton. Note that for aggressve polces (IV and V), the throughput of DVFS-only enforcement actually drops below the throughput of less aggressve polces (I, II and III); whle the throughput wth vpower actually ncreases despte the power cap. Ths shows that vpower can help applcatons such as Memcached acheve better scaleout capabltes when hosted n aggressvely underprovsoned power nfrastructure. Battery Management. We now show two examples depctng the mportance of pckng the rght batteres at the rght tme towards mantanng the power caps, and show how the enforcer n vpower chooses the better opton than always optng to frst use the rack battery or server batteres. Rack (Shared) Battery Frst beng the better opton: We run an experment wth a rack of two servers runnng MapReduce and MedaServer respectvely, and the aggregate rack level power draw s shown n Fgure 7 (a). Both applcatons specfy ther power needs usng P9-6, and we set both a server-level power cap of 4W and a rack level power cap of 34W (we proportonally reduce the amount of power that can be drawn from the rack level battery, whch was orgnally provsoned for 8 servers). As per heurstcs descrbed earler n secton 4..3, when the enforcer antcpates local power volatons, t frst dscharges rack (shared) battery for rack power volatons and uses server level (local) batteres for such volatons only when the shared battery reaches ts lower threshold. Ths leaves more charge n local batteres for later use, and Fgures 7 (e) and (f) show the amount of power that s capped for each applcaton by throttlng and the state of charge (SoC) of batteres (at the two servers and at the rack), respectvely wth vpower. On the other hand, a scheme (local battery frst) that dscharges local batteres greedly and goes to the rack battery only when the former runs out of charge, wll not be able to handle the hgher load that may come later to exceed the server level power cap. Ths can be seen n the throttlng power and SoC graphs n Fgures 7 (c) and (d) respectvely. Usng up the MedaServer server s battery n the frst seconds to handle rack level volatons, leads to ts nablty to handle ts own power volatons later on, resultng n a 9% performance penalty (Fgure 7 (b)). Note that we ntentonally leave a resdual capacty of % of the usable battery capacty (whch already excludes resdual capacty for avalablty guarantees) due to battery lfetme ssues [39]. Interestngly, MapReduce s throttled by vpower at around tme s, despte havng capacty n ts local battery. Ths

10 Throttlng Throttlng s because t s n debt, and ts local battery s beng conserved for possble subsequent rack level volatons, n the nterest of farness that s covered later n ths secton. Throttlng Throttlng Total Power Rack Power Cap Local-Frst vpower MedaServer 9 MapReduce 1 1 (a) Aggregate Rack Power (b) Performance Degradaton (%) MapReduce MedaServer (c) Throttlng (Local Frst) MapReduce MedaServer (e) Throttlng (vpower) Normalzed Energy Normalzed Energy MapReduce. MedaServer Rack Battery (d) SoC (Local Frst).4 MapReduce MedaServer. Rack Battery (f) SoC (vpower) Fgure 7: Shared Battery Frst s Better Opton (MedaServer+MapReduce) 6 4 Total Power Rack Power Cap Shared-Frst vpower VrusScan GPU 8 Memcached 15 (a) Aggregate Rack Power (b) Performance Degradaton (%) VrusScan Memcached GPU (c) Throttlng (Shared Frst) VrusScan Memcached GPU (e) Throttlng (vpower) Normalzed Energy Normalzed Energy VrusScan Memcached GPU Rack Battery (d) SoC (Shared Frst) VrusScan Memcached GPU Rack Battery (f) SoC (vpower) Fgure 8: Local Battery Frst s better Opton (Memcached+GPU+VrusScan) Server (Local) Battery Frst beng the better opton: On the other hand, Fgure 8 (a) shows the aggregate power of three applcatons (Memcached, GPU and VrusScan) runnng on a server each n the rack, specfyng ther power needs usng P9-6. We ntentonally reduce the runtme of VrusScan to s and show results for a duraton of 45s. Only a rack level power cap s set at 45W and we proportonally reduce the amount of power that can be drawn from the rack battery as n the prevous experment. vpower antcpates no local power volatons, and also antcpates an earler completon of VrusScan wth the P9-6 specfcaton. Based on ts heurstcs, t uses local batteres frst to handle rack level power volatons n ths case, and n fact prortzes the draw from VrusScan server s Normalzed Credt battery before t fnshes (Fgure 8 (f)), removng any necessty for throttlng (Fgure 8 (e)). However, oblvously usng a shared battery frst approach n ths case mandates subsequently throttlng Memcached and GPU, causng performance degradaton of 15% and 8% n these applcatons, respectvely. Farness. We fnally evaluate vpower s ablty to enforce solaton between power draws of dfferent applcatons for farness. As noted earler, there are pros and cons n the strngency of power demands made by an applcaton. Askng for very strngent power may lessen ther chances of beng admtted, makng them wat longer for power allocaton. At the other end, even though grossly under-specfyng the power may get them admtted, vpower wll ensure that such applcatons do not ms-behave/ms-approprate much more power n the runtme at the expense of others. The accountng mechansm n vpower, enforces a bound on the amount of credts that can be banked, as well as a bound on the credts that can be n debt. To llustrate these ssues, we conduct an experment wth WebCrawlng and MedaServer, each runnng on ts own server. WebCrawlng uses PMn=18W for ts power specfcaton (a sgnfcant under-statement), whle MedaServer uses PAvg=18W. Rack power cap s set to 35W, wth no ndvdual server power caps. We show the benefts of vpower s accountng mechansm by comparng ts executon wth that for a scheme whch uses DVFS and batteres as vpower, but wthout the accountng mechansms as shown n Fgure 9. Wthout accountng n place, power needs of WebCrawlng are contnung to be met (there s no throttlng for t for the frst 5 seconds) through batteres, and both applcatons are beng penalzed subsequently (1-13% performance degradaton). MedaServer s beng penalzed n ths case for WebCrawlng s fault, makng t unfar. On the other hand, wth our accountng mechansm n place, WebCrawlng s credts deplete rapdly, whle MedaServer saves/accumulates ts credts. Consequently the penalzaton for WebCrawlng (whch s ms-behavng) steps n a lot sooner, and MedaServer s not affected at all Total Power Rack Power Cap (a) Aggregate Power (c) vpower Account. MedaServer WebCrawlng Throttlng MedaServer WebCrawlng (b) Throttlng (vpower w/o Account.) Throttlng MedaServer WebCrawlng (d) Throttlng (vpower w/ Account.) Applcaton Name w/ Account. w/o Account. MedaServer 1 WebCrawlng 8 13 (e) Performance Degradaton (%) Fgure 9: Insulaton from Power-hungry Applcatons (MedaServer+WebCrawlng) 6. CONCLUDING REMARKS AND FUTURE WORK Wth aggressve under-provsonng of the power nfrastructure, ths resource becomes as valuable as any other

11 computng resource that needs to be approprately ratoned and managed between competng applcatons. Vrtualzng ths nfrastructure gves the lluson of a potentally larger and nsulated nfrastructure for each applcaton, vodng the need to expose ts physcal capacty and management to the applcatons. Untl now, apportonng power has at best used modulaton of other computng resources (demand-sde knobs) - CPUs n partcular through DVFS, schedulng, mgraton, etc. However, power dstrbuton can also beneft tremendously from supply-sde resources such as energy storage devces (batteres), at potentally multple layers n the herarchy. Hence, explct vrtualzaton of power herarches s essental, than just controllng power draws through demand-sde knobs as has been the case untl now. Towards ths goal, ths paper has presented the desgn and mplementaton of vpower, to create, allocate and manage vrtual power herarches for co-exstng datacenter applcatons. vpower allows applcatons (not essental) to explctly request ther power needs at dfferent resolutons usng a palloc() nterface. We have shown that such nformaton can consderably help system performance, whle stll sheldng ndvdual applcatons from explctly managng ths resource, smlar to the malloc() analogy. Specfyng needs even as a hgh percentle (e.g., P9,) and not necessarly the exact maxmum, and farly coarse tme resolutons (a mnute or coarser at applcaton phases such as Map/Reduce) seems to offer a good trade-off pont. vpower uses these specfcatons n determnng whether to admt them, and f so, where to place them. We have shown that gnorng the battery capactes, and conservatvely allocatng for the potental peaks to avod any power emergences, results n over 3% reducton n system utlzaton. Ths also hghly lmts the scale-out capabltes of applcatons such as memcached. Consequently, an aggressve admsson control polcy that places correlated workloads together - to offer more opportuntes for batteres to recharge - s a better polcy. Despte an aggressve admsson control polcy, vpower s enforcer s able to effectvely manage the power volatons. We have shown that n our expermental rack of 8 servers, vpower shows 1-8% better performance than for a scheme that uses purely demand-sde computng knobs, on a 5% under-provsoned power nfrastructure. It also provdes at least 5% hgher throughput than the latter, for a memcached scale-out workload n ths aggressvely under-provsoned system. We have also demonstrated that the choce of batteres to draw upon durng the runtme n a herarchcal -layer settng s very mportant. Based on a theoretcal framework that dentfes the condtons when ths decson makng becomes mportant, we have ncorporated heurstcs nto vpower for battery sourcng. We have shown that vpower does much better than greedly usng up local batteres or shared batteres frst. Fnally, we have demonstrated the effectveness of vpower n farly treatng applcatons, by nsulatng the msbehavor of one applcaton from the power needs of another. Our contrbutons are applcable regardless of where batteres are placed, the choce of energy storage technologes, and battery capactes. In fact, more strngent battery capactes make t even more mportant to manage herarches ntellgently. There are several nterestng drectons for future work. We have only been able to prototype a small -level herarchy on our avalable expermental testbed, and we would lke to conduct smlar evaluatons on a larger and deeper herarchy, and examne a more herarchcal/dstrbuted mplementaton of vpower for ts scalablty. Despte flexblty n our software to modulate the draw/charge rates of batteres, and possbly usng those to source only a porton of the power exceedng the cap, our current hardware platform does not have those facltes. We are lookng to buld such capabltes n the future. Havng shown the utlty of a palloc() nterface, even f t s at coarse resolutons, n managng the under-provsoned nfrastructure, there s consderable scope for leveragng ths nterface. Frst, as descrbed n [3], many current datacenter applcatons undergo extensve pre-producton proflng on dfferent hardware platforms, makng them well-suted for explotng ths nterface (as even a command lne argument). Second, long-runnng servces (e.g. mal, search, etc.) that may have predctablty n load patterns (e.g. tme-of-day behavor) could also use pror hstory for these purposes. Fnally, a challengng area for future work s targetng cloud datacenters wth sporadc jobs dynamcally arrvng, where pre-proflng may be dffcult. Whle the nfrastructure could assume average/worst case scenaros for such applcatons, one could envson a new set of runtme abstractons for power. Analogous to data structures that are explctly (by the program) or mplctly (by the runtme system) allocated/deleted n (vrtual) memory, we propose to nvestgate such power abstractons for ths aggressvely under-provsoned resource, to make palloc() a more natural feature, just as malloc() s n today s programs/runtme-systems. 7. ACKNOWLEDGMENTS Ths work was supported, n part, by NSF grants 81167, , 15618, 1135, , CAREER award , and a research award from Google. 8. REFERENCES [1] F. Bellosa, A. Webel, M. Watz, and S. Kellner. Event-Drven Energy Accountng for Dynamc Thermal Management. In Workshop on Complers and Operatng Systems for Low Power (COLP), 3. [] R. Banchn and R. Rajamony. Power and Energy Management for Server Systems. IEEE Computer, 37(11), 4. [3] O. Blgr, M. Martonos, and Q. Wu. Explorng the Potental of CMP Core Count Management on Data Center Energy Savngs. In Workshop on Energy Effcent Desgn, 11. [4] Q. Cao, D. Kassa, N. Pham, Y. Sarwar, and T. Abdelzaher. Vrtual Battery: An Energy Reservaton Abstracton for Embedded Sensor Networks. In Proceedngs of RTSS, 8. [5] J. Chase, D. Anderson, P. Thakur, and A. Vahdat. Managng Energy and Server Resources n Hostng Centers. In Proceedngs of SOSP, 1. [6] G. Chen, W. He, J. Lu, S. Nath, L. Rgas, L. Xao, and F. Zhao. Energy-aware Server Provsonng and Load Dspatchng for Connecton-ntensve Internet Servces. In Proceedngs of NSDI, 8. [7] J. Cho, S. Govndan, B. Urgaonkar, and A. Svasubramanam. Proflng, Predcton, and Cappng of Power Consumpton n Consoldated Envronments. In Proceedngs of MASCOTS, 8. [8] B. F. Cooper, A. Slbersten, E. Tam, R. Ramakrshnan, and R. Sears. Benchmarkng Cloud

12 Servng Systems wth YCSB. In Proceedngs of SoCC, 1. [9] G. Dhman, G. Marchett, and T. Rosng. vgreen: A System for Energy-Effcent Management of Vrtual Machnes. ACM TODAES, 16(1):6:1 6:7, 1. [1] Facebook Rack-level UPS for Improved Effcency. 11/4/7/. [11] X. Fan, W.-D. Weber, and L. A. Barroso. Power Provsonng for a Warehouse-Szed Computer. In Proceedngs of ISCA, 7. [1] M. Ferdman, A. Adleh, O. Kocberber, S. Volos, M. Alsafaee, D. Jevdjc, C. Kaynak, A. D. Popescu, A. Alamak, and B. Falsaf. Clearng the Clouds: A Study of Emergng Scale-out Workloads on Modern Hardware. In Proceedngs of ASPLOS, 1. [13] A. Gandh, M. Harchol-Balter, R. Das, and C. Lefurgy. Optmal Power Allocaton n Server Farms. In Proceedngs of SIGMETRICS, 9. [14] Google Server-level UPS for Improved Effcency. http: //news.cnet.com/831-11_ html. [15] S. Govndan, A. Svasubramanam, and B. Urgaonkar. Benefts and Lmtatons of Tappng nto Stored Energy For Datacenters. In Proceedngs of ISCA, 11. [16] S. Govndan, D. Wang, L. Y. Chen, A. Svasubramanam, and B.Urgaonkar. Towards Realzng a Low Cost and Hghly Avalable Datacenter Power Infrastructure. In Workshop on HotPower, 11. [17] S. Govndan, D. Wang, A. Svasubramanam, and B. Urgaonkar. Leveragng Stored Energy for Handlng Power Emergences n Aggressvely Provsoned Datacenters. In Proceedngs of ASPLOS, 1. [18] J. Hamlton. Internet-scale Servce Infrastructure Effcency, ISCA Keynote, 9. [19] C. Isc, G. Contreras, and M. Martonos. Lve, runtme phase montorng and predcton on real systems wth applcaton to dynamc power management. In Proceedngs of MICRO, 6. [] A. Kansal, F. Zhao, J. Lu, N. Kothar, and A. Bhattacharya. Vrtual Machne Power Meterng and Provsonng. In Proceedngs of SOCC, 1. [1] V. Kontorns, L. Zhang, B. Aksanl, J. Sampson, H.Homayoun, E. Petts, T. Rosng, and D. Tullsen. Managng Dstrbuted UPS Energy for Effectve Power Cappng n Data Centers. In Proceedngs of ISCA, 1. [] J. Leverch, M. Monchero, V. Talwar, P. Ranganathan, and C. Kozyraks. Power Management of Datacenter Workloads Usng Per-Core Power Gatng. IEEE Computer Archtecture Letters, 8():48 51, 9. [3] J. Mars, L. Tang, and R. Hundt. Heterogenety n Homogeneous Warehouse-Scale Computers: A Performance Opportunty. IEEE Computer Archtecture Letters, 1():9 3, 11. [4] M. R. Marty and M. D. Hll. Vrtual Herarches to Support Server Consoldaton. In Proceedngs of ISCA, 7. [5] D. Mesner, B. T. Gold, and T. F. Wensch. PowerNap: Elmnatng Server Idle Power. In Proceedngs of ASPLOS, 9. [6] Memslap: Load Testng and Benchmarkng a Server, 1. [7] Mcrosoft Reveals ts Specalty Servers, Racks, Apr [8] J. Moore, J. Chase, P. Ranganathan, and R. Sharma. Makng Schedulng Cool: Temperature-Aware Workload Placement n Data Centers. In Proceedngs of USENIX, 5. [9] R. Nathuj and K. Schwan. VrtualPower: Coordnated Power Management n Vrtualzed Enterprse Systems. In Proceedngs of SOSP, 7. [3] S. Pelley, D. Mesner, P. Zandevakl, T. F. Wensch, and J. Underwood. Power routng: Dynamc power provsonng n the data center. In Proceedngs of ASPLOS, 1. [31] E. Pnhero, R. Banchn, E.Carrera, and T. Heath. Load Balancng and Unbalancng for Power and Performance n Cluster-Based Systems. In Workshop on COLP, 1. [3] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu. No Power Struggles: Coordnated Mult-level Power Management for the Data Center. In Proceedngs of ASPLOS, 8. [33] K. Shen, A. Shrraman, S. Dwarkadas, X. Zhang, and Z. Chen. Power Contaners: An OS Faclty for Fne-Graned Power and Energy Management on Multcore Servers. In Proceedngs of ASPLOS, 13. [34] J. Stoess, C. Klee, S. Domthera, and F. Bellosa. Transparent, power-aware mgraton n vrtualzed systems. In GI/ITG Fachgruppentreffen Betrebssysteme, number 7-3, pages 3 8, 7. [35] J. Stoess, C. Lang, and F. Bellosa. Energy management for hypervsor-based vrtual machnes. In Proceedngs of USENIX, 7. [36] A. Verma, G. Dasgupta, T. Kumar, N. Pradpta, and R. Kothar. Server Workload Analyss for Power Mnmzaton Usng Consoldaton. In Proceedngs of USENIX, 9. [37] C. A. Waldspurger and W. E. Wehl. Lottery schedulng: Flexble Proportonal-share Resource Management. In Proceedngs of OSDI, [38] D. Wang, C. Ren, and A. Svasubramanam. Vrtualzng Power Dstrbuton n the Datacenter. Techncal Report CSE-13-4, The Pennsylvana State Unversty, 13. [39] D. Wang, C. Ren, A. Svasubramanam, B. Urgaonkar, and H. Fathy. Energy Storage n Datacenters: What, Where, and How Much? In Proceedngs of SIGMETRICS, 1. [4] X. Wang and M. Chen. Cluster-level Feedback Power Control for Performance Optmzaton. In Proceedngs of HPCA, 8. [41] H. Zeng, C. S. Ells, A. R. Lebeck, and A. Vahdat. ECOSystem: Managng Energy as a Frst Class Operatng System Resource. In Proceedngs of ASPLOS,.