Sochasic Approximaion Conrol of Power and Tardiness in a Three-ier Web-Hosing Cluser Julius C.B. Leie Insiuo de Compuação Universidade Federal Fluminense Rio de Janeiro, Brasil julius@ic.uff.br Dara Kusic Deparmen of Compuer Science Universiy of Pisburgh Pisburgh, PA dmk64@pi.edu Luciano Berini Perobras Rio de Janeiro, Brasil Daniel Mossé Deparmen of Compuer Science Universiy of Pisburgh Pisburgh, PA mosse@cs.pi.edu ABSTRACT Large-scale web-hosing and daa ceners are increasingly challenged o reduce power consumpion while mainaining a minimum qualiy of service. Dynamic volage and frequency scaling provides one echnique o curb power consumpion by limiing he power supply and/or frequency of he CPU a he expense of lower execuion speed. Model-based approaches ofen require edious offline profiling, and generaing an accurae model under all condiions may be infeasible. This paper develops a sochasic feedbackconrol algorihm, and couples i wih a mehod of sochasic opimizaion o minimize power consumpion while mainaining ardiness in a hree-ier sysem. Our approach assumes nohing abou he sysem and he applicaion, reaing each as a black box. The scheme is effecive under limied dynamic workload condiions ha can aler he response imes and power consumpion o be approximaed. Wih lile overhead, he conrol scheme is able o mainain a specified quanile ordiness under a desired hreshold, while suppressing power consumpion o wihin 1% of is heoreical minima. Caegories and Subjec Descripors C.4 [Performance of Sysems]: Design Sudies, modeling echniques, performance aribues General Terms Algorihms, Performance, Managemen, Reliabiliy, Efficiency Keywords Power managemen, performance managemen, online conrol, feedback conrol, sochasic approximaion Permission o make digial or hard copies oll or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. To copy oherwise, o republish, o pos on servers or o redisribue o liss, requires prior specific permission and/or a fee. ICAC 10, June 7 11, 2010, Washingon, DC, USA. Copyrigh 2010 ACM 978-1-4503-0074-2/10/06...$10.00. 1. INTRODUCTION Large-scale web-hosing and daa ceners are increasingly challenged o reduce power consumpion and lower cooling coss, while sill expeced o mainain a baseline qualiy of service. In a web hosing sysem, a single reques will ypically process on several servers comprising a hree-ier pah for HTTP receip and response, applicaion logic for dynamic conen generaion, and backend daa rerieval. The fron-end performs low-laency funcions such as load balancing and mainaining nework presence of he hosing sysem, and hus, will ypically be on a all imes, running a full capaciy. One fron-end ier can ypically roue requess o several sub-clusers opplicaion and daabase iers. I is he applicaion and daabase iers ha dominae performance, hus hey mus be approached wih careful performance conrol when seeking energy savings from he sysem. Dynamic volage scaling (DVS) is one echnique o achieve energy savings by reducing he power supply o he CPU and hroling is operaing frequency. Minimizing he power consumpion for he enire hosing environmen presens he problem of coordinaing DVS conrol, for each sage in he execuion pah, o minimize he overall power consumpion and mainain he end-o-end delay wihin a specified Qualiy of Service (QoS). The conrol problem is mos ofen characerized by variabiliy wihin processingimes, dynamically changing workloads, and noise wihin insananeous power measuremens. Developing models opplicaion behavior under all possible workload characerisics is edious and ofen infeasible. By conras, sochasic approximaion mehods require no knowledge of he underlying sysem, and he black-box approach can offer simplified soluions o complex conrol problems. In his paper, we develop a coordinaed echnique for conrolling ardiness and minimizing power consumpion using DVS for wo iers in a hree-ier execuion environmen for web hosing. We use he Robbins-Monro (RM) sochasic approximaion mehod o esimae he ardiness quanile, where ardiness is defined as he raio of he end-o-end response ime achieved o a deadline [1, 2]. To he RM algorihm we couple a proporional-inegral-derivaive (PID) feedback conroller o obain he CPU frequency for a single ier ha will mainain performance wihin he specified QoS. Nex, his echnique is inegraed wih he Kiefer-Wolfowiz (KW) mehod of sochasic approximaion ha explores CPU frequency for a second ier o guide he sysem o an operaing poin near is minimum power consumpion [3]. We measure he performance of he sysem in quaniles ordiness, as proposed in [4], o guaranee 41
ha a cerain percenage of requess will mee heir deadline. Our approach avoids he developmen ime and inaccuracies confroned by model-based conrol; in conras, our scheme requires only measuremens of end-o-end ardiness and oal power consumpion. We lasly show ha he approach is effecive under sochasic applicaion behaviors and limied dynamic workload characerisics. We evaluae our RM/PID approach agains a simple heurisic scheme for managing performance, and hen combine each wih he KW approximaion o deermine he efficacy of inegraing he echniques in he wo iers. We show, hrough simulaion resuls, ha he coordinaed approximaion scheme operaes wih a lower seady-sae error han he heurisic scheme, can conrol performance wihin specified QoS, and can lower power consumpion o wihin 1% of is heoreical minima, wihou any knowledge of he workload, applicaion, and underlying sysem componens. The paper is organized as follows. Secion 2 discusses relaed work on DVS and PID conrol in compuing sysems. Secion 3 presens he sysem model, and Secion 4 discusses he conroller design. Secion 5 presens simulaion resuls, and Secion 6 concludes he paper. 2. RELATED WORK Reducing power consumpion in server clusers has been a wellsudied problem recenly; for example, see [5 8]. One approach is o combine CPU-clock hroling and dynamic volage scaling, for server-level energy conrol, wih an on/off scheme, for cluser-level energy conrol, based on he incoming workload [9]. In mos cases, he energy-saving scheme is combined wih a conrol echnique o mainain performance around a sepoin [8] or under a specified hreshold [10, 11]. The combinaion of DVS and feedback conrol around a sepoin has been addressed in [12], using a model-based approach derived from queueing heory. The server-level echnique of DVS has been shown o reduce energy consumpion by abou 10%-30% over a machine running a is full capaciy [5, 13, 14]; adding a cluser-level scheme o urn off unneeded servers such as in [15] can increase power savings o 40%-80% [16 18]. The power and performance framework developed in [13] uses dynamic volage scaling on hos machines o demonsrae a 10% savings in power consumpion, wih a small sacrifice in performance, while [14] shows 30% energy savings is possible in a muli-ier environmen. The work in [14] is similar o ours, in ha, hey conrol he end-o-end delays in a muli-ier pah via DVS, however, wihou opimizing for power. The use of DVS for conrolling real-ime asks ses, using a model for coninuous CPU frequencies, as assumed in our work, is developed in [19]. The coninuous model of CPU frequency is achieved in [6] and [20], using a echnique o diher in a ime-sharing mode beween wo adjacen frequency seings. In [6], he auhors apply he echnique o disribue a power budge among he available sysem resources. The noion of power budges, limiing he oal power a compuer cluser can consume, has been applied as a consrain o he resource conrol problem [5, 6, 21]. In [21], for a given cluser power budge, he resource allocaion for individual nodes can be assigned o opimize for he convexiy in various power-performance efficiency curves. Up o 20% savings for he overall sysem power is shown in [5], wih a near-zero sacrifice in performance. Conrol heory has been applied o he performance of compuing sysems [22, 23], including daa ceners [24, 25], and web hosing environmens [14, 26, 27]. Feedback conrol o desired delay is shown in [26], and alhough he approach does no consider DVS or energy savings, i uses process reallocaion o achieve service differeniaion ha is no considered in our work. Load Balancer Fron End Level A A 1 A 2. A m Level B B 1 B 2 B n 2 nd ier 3 rd ier Figure 1: A clusered model for a hree-ier web hosing sysem. 3. SYSTEM MODEL Figure 1 shows a ypical web hosing sysem consising of hree iers. The fron-end, or presenaion sage, handles HTTP receip and response, load-balancing, and higher-level sysem configuraions such as urning addiional iers on and off [18]. The fron-end ypically has shor delays and small resource requiremens per reques, and hus one machine can direc requess o several applicaion servers and daabase servers, which incur greaer resource demands per reques. Requess from he fron-end are direced o machines in he second ier, or Level A in he diagram, in which applicaion servers generae dynamic page conen and query-language calls he daabases in he hird ier, or Level B. Level A and Level B may each have clusered configuraions in which a cluser in Level A has A m machines, m 1, and is corresponding cluser in Level B has B n machines, n 1. Each level may have several sub-clusers, in which we assume homogeneiy and equal load-balancing wihin a sub-cluser, bu heerogeneiy of compuing resources may exis beween sub-clusers. We assume ha CPU frequency has a significan impac on performance, and can be uned o achieve measurable power savings. We furher assume ha DVS can be performed a each level, such ha all machines wihin a sub-cluser will be assigned he same CPU frequency, ha here is a laency obou 1-3 milliseconds o apply DVS beween adjacen saes, as measured in [28], and ha he brief period obou 5 microseconds during which a processor is unavailable while ransiioning saes is an orhogonal issue o he conrol scheme. 3.1 Modeling Assumpions To assess he viabiliy of he conrol scheme, we compose a sysem of equaions o model he response imes and power consumpion of he sysem in Figure 1. We assume ha he power consumpion of he CPU is proporional o he cubic power of he operaing frequency, as in [20] and [9], plus a saic quaniy ha esimaes he power consumpion of componens such as he fan, memory, and disk. In an acual sysem, here will be oher dynamic phenomena affecing power by oher han CPU frequency, such as variable fan speeds and memory access raes, bu we do no accoun for hem. We assume ha he CPU frequency is coninuous, wihin a bounded range, which can be implemened by dihering in ime beween wo adjacen processor saes, as in [6] and [19]. We furher assume ha dynamic characerisics of he workload such as arrival rae and ransacion mix cause changes in he sysem re-. 42
sponse imes and CPU uilizaions. We are concerned only wih achieving quaniles of performance, such ha for a given disribuion of response imes, a percenage of requess will be guaraneed o have compleed before deadline. Iniially, we assume ha he CPU uilizaion, an indicaor of workload, is 100%, bu show ha he conroller performs well when he workload causes CPU uilizaions oher han 100%. Firs, le us consider one machine in Level A and a corresponding machine in Level B, forming a single execuion pah. The measured end-o-end response ime of he pah can be expressed as he sum of he response imes for each sage in he pah, a measure of endo-end delay as defined in [14]. Thus, for a single reques ha is submied a ime, he response ime z() achieved by he pah can be expressed by r a() =a 1 fa 1 ()+a 2 (1) r b () =b 1 f 1 b ()+b 2 (2) z() =r a()+r b (), (3) where r a and r b are response imes, and and f b are he CPU frequencies of Level A and Level B, respecively. Consans {a 1,b 1} represen he number of cycles in CPU for each reques, and consans {a 2,b 2} represen ime spen in waiing for I/O, for example. The end-o-end response ime of he wo iers is capured in z(). The oal power P () can be expressed by P a() =a 3 fa()+a 3 4 (4) P b () =b 3 fb 3 ()+b 4 (5) P () =P a()+p b (), (6) where P a and P b are he power consumed by Level A and Level B, respecively, a 3 and b 3 are coefficiens of he CPU operaing frequency, and a 4 and b 4 represen he saic power consumpion of oher componens in he sysem. If we wish o obain values of and f b ha will saisfy our QoS, we assign z() =ref, where ref is he desired deadline for requess reurning from Levels A and B. From Equaions (1)-(3), we can solve for f b in erms of and res follows. f b () = b 1 () (ref a 2 b 2) () a 1 (7) Subsiuing he values of P a and P b from Equaions (4) and (5), respecively, and f b from Equaion (7), P () can hen be expressed as a funcion of. The convex funcion of he power-performance relaionship, composed of values of and f b ha saisfy our QoS consrain, z() ref, will have a minimum power consumpion ha can be obained by aking he firs derivaive of P (), seing i equal o zero, and solving for, as expressed by he following dp =0 (8) d b 3 b 3 1 a 1 a 3 Min = a1 + 4 ref a 2 b 2 (9) subjec o: (ref a 2 b 2) () a 1 > 0, (10) where he consrain in (10) indicaes he second derivaive will be posiive o assure a minima. For CMOS circuis, he power consumpion of he CPU is relaed o frequency raised o anywhere beween he 2 nd and 3 rd power, hus a generalized form of Equaion (9) can be derived when P () relaes o f n, as in he foonoe 1. 1 A generalized form of he expression for he value of Min given QoS deadline rend power as a funcion of he n h power Equaions (1)-(6) represen he simulaed sysem, where he oupus of end-o-end response ime z() and oal power consumpion P () are inpus o he conrollers. The developmen in (8)-(10) is applied o obain heoreical opimal values of Min and f b Min agains which we can compare he performance of our conrol implemenaion o an opimal sysem configuraion. We assign values for he consans a 1...4 and b 1...4 and he QoS hreshold res follows. P aram. Value Uni a 1 5 10 7 cycles b 1 2 10 8 cycles a 2 0.01 sec. b 2 0.02 sec. a 3 10 28 W sec. 3 b 3 2 10 28 W sec. 3 a 4 20 W b 4 25 W ref 0.22 sec. Figure 2: Consans of he sysem model and QoS deadline. Figure 3 plos he power consumpion versus and f b saisfying several values of QoS ref. Noe ha only a cerain range of values of and f b are able o saisfy QoS such ha z() =ref, which will saisfy he inequaliy in (10). If were o violae (10), hen he corresponding value of f b would be negaive. 4. CONTROL SCHEME Figure 4 shows he inroducion of he conroller ino he sysem, where he end-o-end response ime z() and oal power consumpion P () are he only inpus o he conroller, which reurns ( +1)and f b ( +1)o he sysem o be acuaed before he nex inerval of measuremens. We can expand he conroller diagram in Figure 4 ino is conrol componens as shown in Figures 4 and 4(c) for Tier A and Tier B, respecively. The basic idea is o conrol he ardiness via, guaraneeing ha a specified quanile of response imes will mee QoS deadline ref, and o minimize power P by seeking an opimal /f b combinaion, led by exploraion in f b, o discover he minima as in Figure 3. Conrolling ardiness was proposed in [4] for known Pareo and Log-normal disribuions, and generalized in [2] hrough he use of he Robbins-Monro mehod [1]. In his expansion of he work, we propose opimizing in f b o minimize oal power using he Kiefer- Wolfowiz esimaion algorihm [3]. 4.1 Conrol of Tier A Tardiness is he raio of he end-o-end response ime z() achieved, o he arge deadline ref. Insead of using a simple binary meric {0, 1} ha can only indicae wheher or no a response has me is deadline, ardiness can ell us how close responses were o violaing or meeing he deadline. The measure ordiness is also a good indicaion of he load on he sysem; responses wih a ardiness close o 1.0 (e.g. jus meeing deadline) will indicae a more heavily loaded sysem han if he ardiness were closer o 0.0. The Robbins-Monro mehod esimaes he quanile on unknown disribuion [1]. We assume M(x) is he expeced value a x of he sysem response, where M is a monoone funcion of x. For each x here corresponds a random variable Y = Y (x) wih a of CPU frequency is Min = a 1+ 2 (n 1) b3 b n 1 a 1 a 3 ref a 2 b 2 43
Power, in Was Toal power consumpion versus frequency in saisfying QoS 49 48 47 ref=0.14 sec. 46 ref=0.16 sec. ref=0.18 sec. ref=0.20 sec. ref=0.22 sec. ref=0.24 sec. 45 0 0.5 1 1.5 Frequency f in GHz a 2 2.5 2.5 2.0 1.5 1 0.5 Frequency f in GHz b Figure 3: Power consumpion versus and f b for six arge QoS. Toal Power, P() = P a () + P b () Conroller (+1) f b (+1) Muli-ier Cluser where, as shown in Figures 4 and 4, α is he desired ardiness quanile. We le z() be he independen random variable and he oucome of he experimen, he sysem response ime, in our case, wih disribuion funcion Pr[z() x] =F (x), and le y() be defined as: { 1 if z() x() y() = (14) 0 oherwise Le x(0) be an iniial guess of Θ, and le x( +1)=x()+a() (α y()), (15) where a() is of ype 1 as explained in he foonoe2. I can hen be proven ha M(x) =F (x) and ha lim x() =Θ. (16) The sequence {x} can be proven o converge o Θ as he soluion o Pr[X Θ] = α. For analysis of he convergence of x o Θ, he reader is referred o [1] and [2]. The parameer a() can be se wo ways: decreasing as goes o infiniy, wih some resricions o guaranee convergence, as rue o he original form of he Robbins- Monro mehod, or o a small fixed value ε. We use he laer in order o assure ha he sysem can adap o changing disribuions arising from ime-varying workloads. Figure 5 shows ha he value ε should be chosen carefully, as larger values will decrease convergence ime, bu a some radeoff o he seady-sae error, when ε is fixed o values of 0.005 and 0.001, and he disribuion o be esimaed shifs a = 1000. 1 - + err(+1) Conroller A Response ime, z() = r a () + r b () PID x(+1) (+1) Robbins- Monro Muli-ier Cluser z() Tardiness esimae 0.96 0.95 0.94 0.93 0.92 0.91 0.9 Tardiness esimae, x(+1) sep size, ε=0.005 sep size, ε=0.001 ref Kiefer- Wolfowiz f b (+1) Muli-ier Cluser 0.89 0 500 1000 1500 2000 Conroller B (c) Figure 4: Sysem wih conrol inpus. Conroller A. (c) Conroller B. disribuion funcion Pr[Y (x) y] =H(y x), such ha M(x) = P() Le F (x) be an unknown disribuion funcion, and or, equivalenly, ydh(y x) (11) F (Θ) = α(0 <α<1), F (Θ) > 0, (12) Pr[F Θ] = α, (13) Figure 5: Effec of sep size ε in he Robbins-Monro algorihm on convergence ime and seady-sae error. Nex, as shown in Figure 4, we apply he updaed ardiness esimae x( +1)obained from he Robbins-Monro algorihm as he new sepoin o regulae via single-inpu, single-oupu (SISO) PID conrol, as shown in he following equaions, err( +1)=x( +1) 1.0 (17) ( +1)=()+(K p + d ak i + K d d a )err( +1) ( 2K d d a + K p)err()+( K d d a )err( 1), (18) 2 For a sequence o be of ype 1, i mus saisfy 0 < 1 a2 = A<, and a c 2 (a 1 +...+a 1 =. In paricular, a ) c saisfies hese condiions, where c and c are posiive consans. 44
Response ime in ms Response ime in ms 230 225 220 215 230 225 220 215 Response ime, z() z() ref in GHz 0 50 100 Response ime, z() z() ref 0 1000 2000 3000 Power, P() in GHz 1.3 1.25 1.2 1.15 1.1 CPU Frequency, Level A () Min 1.05 0 50 100 1.3 1.25 1.2 1.15 1.1 CPU Frequency, Level A () Min 1.05 0 1000 2000 3000 Figure 6: Performance of he PID conrol, operaing wihou he Robbins-Monro algorihm, o conrol ardiness of responses ha occur wih zero disribuion. Conrol of Tier A, wih PID conrol and Robbins-Monro esimaion (ε =0.005), and a long-ailed response disribuion. where he error is compued by subracing 1.0, he desired ardiness, from he Robbins-Monro esimae x( +1). For a value of α =0.95 o indicae a arge performance of having 95% of requess mee QoS deadline, i follows ha, on average, y() will also be equal o 0.95 (for 95% of responses, y =1, and for 5% of responses, y =0). Thus, he difference erm in (15) will be equal o zero, x( +1) = x() =1.0, and he error erm for he PID conrol will be zero, indicaing no change o (). We use he form of PID conrol as developed in [4], where K p, K i, and K d are he proporional, inegral, and derivaive erms, respecively, and he oupu of he conroller is he new CPU frequency for he machine(s) in Tier A. The value for K p was obained by measuring he change in he process variable (response ime) over he change in he conrol variable (CPU frequency) from a sep inpu. The value of K i is assigned a value of 2K p and K d is assigned a value of Kp, which perform well in erms of riseime, overshoo, and seady-sae error, as shown by Figure 6, 2 operaing wihou he Robbins-Monro algorihm, o conrol he ardiness of response imes occurring wihou any noise added o he sysem model equaions in (1) and (2). Inroducing a long-ailed disribuion o he sysem response, and hen he Robbins-Monro algorihm o esimae he ardiness quanile, he conroller coninues o perform well, as shown by Figure 6, mainaining he ardiness quanile under less han 1% of is arge of α =0.95, while mainaining wihin less han 3% of is heoreical opimal value Min. 4.2 Conrol of Tier B The Level B conroller implemens he Kiefer-Wolfowiz mehod [3] of sochasic approximaion o discover he global minima of he convex funcion ha characerizes he relaionship beween CPU frequency and oal power consumpion of Levels A and B, subjec o he QoS consrain. The Kiefer-Wolfowiz algorihm is useful when he minima canno be compued direcly because of unknown sysem parameers or noise wihin he measuremens, bu can only be esimaed hrough observaion o variable. Le Q(x) be a convex funcion which has an unknown minima a Γ, where Q( ) is unknown, bu observaions can be made a any x. Q is sricly decreasing for x < Γ, and sricly increasing for x>γ. Le H(P x) be a family of disribuion funcions, and le Q(x) = PdH(P x). (19) I can hen be proven ha f b () converges o Γ, as, where and P f + b and P f b P f + b f b ( +1)=f b () a P f b c, (20) are independenly disribued random variables wih disribuions H(P f b + c ) and H(P f b c ), respecively. Also, {a } and {c } are posiive sequences such ha c 0, 1 a =, 1 a c <, 1 a 2 c 2 <. (21) For example, a = 1 1 and c = would saisfy he consrains in (21). For a full discussion and proof 3 of convergence, we refer readers o [3], and for crieria on choosing sequences {a } and {c }, see [29]. For purposes o pracical implemenaion, we assume a small consan value δf b o compare P f + and P b f, raher han b he diminishing sequence {c }, such ha he expressions for he disribuion H( ) become H(P f b + δf b ) and H(P f b δf b ). The basic idea of he KW algorihm is o nudge he conrol variable f b by a small δf b in he posiive and negaive direcions from is curren value, and collec measuremens of he observaion variable (oal power consumpion P ) such ha ieraions of (20) will converge f b oward he value of Γ which resuls in he minimum power consumpion for he sysem. To smooh for variaions among he insananeous power measuremens, we ake he average o small number of cycles over a window of size h d a, where d a is he duy cycle of Conroller A, o reduce he conroller s response o noise. As boh {a } and {c } decrease in, he amoun of disurbance inroduced o he sysem by he KW algorihm will gradually become insignifican as {a } 0 and {c } 0. Thus, when workload shifs cause a change in disribuion such ha renewed exploraion is necessary o minimize power consumpion, we inroduce a scheme o refresh {a } and {c } o heir original values, and reiniialize he KW algorihm o seek he new minima. 4.3 Conroller Ineracion The raio of he duy cycles for Conroller A and Conroller B mus be uned such ha Conroller A (PID conrol of he ardiness) is given sufficien ime o recover he response ime near is sepoin o disurbances in f b injeced by he Kiefer-Wolfowiz algorihm. If he raio is oo small, and Conroller A does no have sufficien ime o respond, Conroller B will always prefer he lower seing for f b ha resuls in a smaller power consumpion. Hence, insananeous power measuremens for Conroller B should be aken only afer Conroller A resores he response ime o near is sepoin. Figure 7 shows he ime, in ieraions, for Conroller A o resore he response ime o 65% and 90% of is seady-sae value, given no noise wihin he response imes, for various fixed values of ε, he sep size of he Robbins-Monro algorihm. Building on he insigh gained from he experimens in Figure 7, i can be inferred ha he duy cycle of Conroller B can be deermined by he amoun of disurbance o which Conroller A mus respond. If he duy cycle of 45
Ieraions of PID conroller o recover r() 450 400 350 300 250 200 150 100 50 Recovery ime of PID conroller o disurbances in f b ε=0.001, 90% recovery ε=0.005, 90% recovery ε=0.01, 90% recovery ε=0.001, 65% recovery ε=0.005, 65% recovery ε=0.01, 65% recovery Conrol B waiing period τ when ε =0.005 If Δf b 2 MHz τ =60 d a = d b Min If Δf b 5 MHz τ = 110 d a If Δf b 10 MHz τ = 130 d a If Δf b 20 MHz τ = 150 d a If Δf b 40 MHz τ = 160 d a If Δf b 60 MHz τ = 170 d a If Δf b > 60 MHz Δf b capped a 60 MHz and τ = 170 d a 0 0 10 20 30 40 50 60 f b (MHz) Figure 7: Recovery ime, in ieraions, for Conroller A o resore he response ime z() o 65% and 90% of is seady-sae value for hree values of ε, he sep size of he Robbins-Monro algorihm. P arameer Noise in response ime z() 95% of requess have uniform ±1% noise, 5% have long-ail Noise in insananeous power None P () Iniial value 1.1 GHz (lower han Min) Iniial value f b 1.6 GHz (higher han Min) δf b 1 MHz Δf b Max None Duy cycle A, d a 4 ime unis Duy cycle B, d b Fixed a 3 60d a ime unis Size of smoohing window, h 4d a ime unis ms GHz 450 400 350 300 250 Response ime, z() 200 0 2 4 6 8 x 10 4 CPU Frequency, Level A, () 2 1.5 1 z() ref () f Min a 0.5 0 2 4 6 8 x 10 4 Was GHz 49 48 47 46 45 0 2 4 6 8 x 10 4 CPU Frequency, Level B, f () b 2 1.5 1 P() P Min Power, P() f b () f Min b 0.5 0 2 4 6 8 x 10 4 Figure 8: Experimenal resuls when he duy cycle of Conroller B is fixed. Conroller B is fixed in relaion o Conroller A, i mus be fixed o he ime i ake for Conroller A o respond o he maximum disurbance from Conroller B, resuling in a longer convergence ime 3. 3 We experimened wih he effec of fixing he duy cycle o a max- Figure 9: Adapaion of Conroller B waiing period τ in per he changes in f b in, which is derived from he daa in Figure 9. Conroller A Conroller B Sysem =0... P(-d b Min-hd a )- P(-d b Min)............ f b - f b + KW f b - P(-hd a ) - P() =d b Min =2d b Min =d b = +2d b Min Figure 10: Conrol iming diagram. If, as in Figure 8, he duy cycle of Conroller B is fixed o some ime smaller han his value, hen insabiliy can occur, when he duy cycle of Conroller A d a is 4 ime unis, he duy cycle of Conroller B is fixed o d b =3 60d a ime unis (e.g. nudge f + b = f b + δf b, wai 60d a, nudge f b = f b 2δf b, wai 60d a, execue KW, wai 60d a), and power measuremens are colleced for he las h =3d a inervals afer Conroller B acuaes a change in f b. Power measuremens are averaged over he inerval [57d a, 60d a] afer nudging f b o f + b and f b. Figure 8 shows ha Conroller B evenually seeks he lowes possible seing for f b. This is caused by Conroller A having insufficien ime o respond o a disurbance. If, however, Conroller B s duy cycle can adap according o he amoun of disurbance, as deermined by he plos shown in Figure 7, and he amoun by which Conroller B can change f b a a single ieraion is capped a a maximum hreshold, hen Conroller A will always be given sufficien ime o respond, and convergence will occur earlier han if Conroller B s duy cycle is fixed as per he maximum allowable change in f b. Adapaion schemes o address conroller ineracions can be designed analyically, offline, as in [16], or experimenally imal period, long enough o allow Conroller A o respond o a maximal disurbance in f b capped o 60 MHz, and found ha he convergence ime was abou 15% slower han ha o conroller ha adaps he duy cycle o he varied disurbance in f b. 46
P arameer Targe ardiness quanile, α 0.95 Response ime deadline, ref 220 ms Noise in response ime z() 95% of requess have uniform ±1% noise, 5% have long-ail Noise in insananeous power P () None Iniial value 2.2 GHz (Max) Consan value f b 1.3 GHz (opimal value) Duy cycle A, d a 4 ime unis CPU Uilizaion, A 1.0 CPU Uilizaion, B 1.0 Number of experimenal runs 10 P erformance, Approximaion Scheme 95 h quanile of z(), las 50, 000 samples 0.07% ± 0.01% below deadline ref Average z(), las 50, 000 samples 217.93 ± 0.03 ms P erformance, Heurisic Scheme 95 h quanile of z(), las 50, 000 samples 3.88% ± 0.02% below deadline ref Average z(), las 50, 000 samples 209.52 ± 0.05 ms Figure 11: A performance comparison o simple heurisic conrol scheme versus our approximaion conrol scheme. as follows. Applying he daa of recovery imes colleced in Figure 7, we assign a duy cycle d b for Conroller B in accordance wih he amoun by which f b changes a each ieraion. For an RM ε =0.005 value, and oping for 90% recovery of he response ime, he duy cycle of Conroller B adaps as per he able in Figure 9. Afer he exploraion period of KW, during which f b changes by a small dela value δf b = ±1 MHz, Conroller A is given 60d a ieraions o resore he ardiness, a minimum waiing period which we refer o as d b Min. When Conroller B updaes he value of f b as in (20), he waiing period τ is assigned via Figure 9, such ha Δf b for a single ieraion of KW is capped a 60 MHz, and he waiing period is no longer han τ = 170d a. Thus, as shown in Figure 10, he duy cycle for Conroller B is d b = τ +2d b Min. Figure 9 shows he adapaion of d b o Δf b in ime. 5. RESULTS We firs assess he qualiy of he RM approximaion scheme in Level A, for conrolling ardiness in, agains a simple heurisic scheme, developed as follows: Esimae he 95 h quanile ordiness by collecing four ardiness samples over he inerval d a =4, compue he mean μ, and add wo esimaed sandard deviaions 2σ. The simple heurisic algorihm increases he frequency of TierAby60 MHz if μ +2σ > ref. If μ +2σ < 0.95ref, he CPU frequency is decreased by 20 MHz. The heurisic algorihm leaves a [0.95ref, ref] dead-zone in which no acion is aken on he sysem. We assume ha one sample measuremen of response ime is aken for each ime insance, such ha for a Conroller A duy cycle d a =4ime unis, 4 samples are colleced. To es boh conrol schemes, we omi he conroller from Level B and hold f b consan a is opimal value as deermined by a permuaion of (9) o solve for f b Min. We ake he las 50, 000 samples of end-o-end response ime from each experimenal run o capure he seady-sae performance of he conrol schemes. Figure 11 summarizes he resuls over 10 runs, showing ha he heurisic scheme, in grey, performs well, bu ha he approximaion scheme, in black, performs beer in mainaining ardiness closer o he deadline ref. Figure 11 shows he resuls of one run from en. All resuls were obained via simulaions in Malab 2008a and execued on a 3 GHz Inel Penium 4 duo-core processor wih 1 GB RAM. The workload is simulaed such ha one sysem response is generaed every ime insance as per he modeling equaions (1)- (3), wih 95% of response imes having uniform ±1% noise, and 5% having a long-ail. The overhead of each conrol scheme is less han 1 millisecond. To furher make he case for choosing he approximaion scheme over he heurisic scheme in Level A, we exend he comparison in Figure 11, combining each of he conrol schemes for Level A wih he KW approximaion scheme o minimize power in f b.now, here is a disurbance o he sysem in f b caused by Conroller B. We assume ha one sample measuremen for response ime and insananeous power is aken for each ime insance. We ake he las 50, 000 samples of response ime and power consumpion o capure he seady-sae performance of he conrol schemes. The energy savings of each scheme are compared agains an unconrolled sysem, in which boh iers operae a heir full capaciy of 2.2 GHz. In Figure 12, i can be seen ha he simple heurisic scheme, in grey, does no operae effecively wih Conroller B. The reason is ha, in some insances, when KW nudges f b by he small inerval δ = ±1 MHz, he change in ardiness is insufficien for he 47
P arameer Targe ardiness quanile, α 0.95 Response ime deadline, ref 220 ms Noise in response ime z() 95% of requess have uniform ±1% noise, 5% have long-ail Noise in insananeous power P () None Iniial value 2.2 GHz (Max) Iniial value f b 2.2 GHz (f b Max) Δf b 1 MHz Δf b Max 60 MHz Duy cycle A, d a 4 ime unis Duy cycle B, d b Adapive, τ +2d b Min Size of smoohing window, h 4d a ime unis CPU Uilizaion, A 1.0 CPU Uilizaion, B 1.0 Number of experimenal runs 10 P erformance, Approximaion Scheme 95 h quanile of z(), las 50, 000 samples 0.06% ± 0.01% below deadline ref Average z(), las 50, 000 samples 217.91 ± 0.04 ms Average P (), las 50, 000 samples 0.07% ± 0.01% above heoreical min. Average energy savings over all samples, as compared o an 5.17% ± 0.01% below full capaciy sysem unconrolled sysem P erformance, Heurisic Scheme 95 h quanile of z(), las 50, 000 samples 1.94% ± 0.33% below deadline ref Average z(), las 50, 000 samples 214.03 ± 0.73 ms Average P (), las 50, 000 samples 1.17% ± 0.01% above heoreical min. Average energy savings over all samples, as compared o an 3.68% ± 0.02% below full capaciy sysem unconrolled sysem Figure 12: Experimenal resuls of he simple heurisic conrol scheme versus he approximaion conrol scheme, wih power minimizaion. heurisic conrol scheme o respond. Thus, Conroller B changes f b according o (20), always oping for he lower seing of f b o decrease overall power consumpion, alhough he new operaing poin diverges from he opimal seing. In response, Conroller A raises o resore ardiness o an accepable value. This rend coninues unil he sysem seles a a sub-opimal poin 4 where = Max. Thus, we conclude ha he RM approximaion scheme is preferable o he heurisic scheme in mainaining ardiness closer o is deadline, and in operaing successfully wih he power minimizaion scheme. Figure 12 shows he resuls of one unsuccessful case from he en runs. The overhead of Conroller A and Conroller B, combined, is less han 1 millisecond. 5.1 Time-Varying Workload We consider dynamic operaing condiions, e.g., changes in workload inensiy/mix and background processes, ha aler response imes and CPU uilizaions. Hence, he minima of he convex funcion power consumpion will shif, and mus be rediscovered. Figures 14-15 show he effecs when a disurbance occurs a = 4 In he heurisic case, we have imposed a condiion on Conroller B o cease execuion if () =Max, >0, anicipaing ha he sysem is rending oward insabiliy. 250, 000, given he parameers in Figure 13, afer KW has converged near he minimum power consumpion, and he CPU uilizaion changes o 1.1. We have refleced an increase in in CPU uilizaion in (3) as a simple muliplier, e.g., z() =1.1(r a()+r b ()),o model he effecs of changes in he workload inensiy or mix. The workload is simulaed such ha one sysem response is generaed every ime insance, adding a uniform disribuion of ±1% noise o 95% of response imes, and a long-ailed disribuion of noise o 5% of response imes. Conroller A mus hen re-sabilize he sysem in o mainain he ardiness quanile o α =0.95. A change in CPU uilizaion, as a funcion of workload inensiy, also changes he opimal heoreical values of, f b, and P Min, shown as he doed reference lines. Figure 14 illusraes ha if Conroller A is no given sufficien ime o resore he sysem performance o is sepoin, Conroller B will always achieve a lower power consumpion by decreasing f b, resuling in insabiliy. The conclusion from Figure 14 is ha he KW conroller mus wai unil Conroller A sabilizes ardiness before resuming execuion. This is similar o he adapive wai ime τ for Conroller B, in response o disurbances in f b, excep in his case he source of disurbance is due o environmenal facors, of which Conroller B will have no knowledge wihou a signal from Conroller A. 48
The resuls in Figure 15 are colleced for a scheme in which Conroller A hals Conroller B when he mean ardiness since he las execuion of Conroller A deviaes more han 3% from is arge. When Conroller A recovers he mean ardiness o wihin 3% of he sepoin, i signals Conroller B o resume, and also rese he sequences of {a } and {c } o achieve faser convergence ime along he new power consumpion curve. Figure 15 shows ha he scheme is effecive in mainaining ardiness quanile o is arge and opimizing power consumpion o near is heoreical minima P Min. Our experimens indicae ha for every 5% deviaion in workload, for he curren PID parameers, he performance conroller is able o mainain ardiness quanile under he specified QoS in abou 140 ieraions or less. Thus, if he conroller were o execue every 200 milliseconds, hen he workload may deviae by up o 5% every 28 seconds 5 in order o expec good conrol performance, a he expense of haling he KW algorihm while performance is resored. Larger deviaions in he workload may be handled by vary-on, varyoff conrol o adjus he size of he cluser, or by adapively uning he gain of he conroller, online. P arameer Noise in response ime z() 95% of requess have uniform ±1% noise, 5% have long-ail Noise in insananeous power None P () Iniial value 2.2 GHz (Max) Iniial value f b 2.2 GHz (f b Max) δf b 1 MHz Δf b Max 60 MHz Duy cycle A, d a 4d a ime unis Duy cycle B, d b Adapive, τ +2d b Min Size of smoohing window, h 4d a CPU Uilizaion, A 1.0 <250,000, 1.1 250,000 CPU Uilizaion, B 1.0 <250,000, 1.1 250,000 Number of exp. runs 10 Figure 13: Experimenal parameers for he resuls in Figures 14-15. 6. CONCLUSION We have developed a coordinaed echnique for conrolling endo-end performance and minimizing power consumpion using DVS for he back-end iers in a hree-ier execuion environmen used for web hosing. We have applied he Robbins-Monro mehod of sochasic approximaion o esimae he ardiness quanile of an unknown disribuion, and have coupled i wih a proporionalinegral-derivaive (PID) feedback conroller o obain he CPU frequency for a single ier ha will mainain performance wihin a specified QoS deadline. Furher, his echnique is inegraed wih he Kiefer-Wolfowiz mehod of sochasic approximaion o esimae a CPU frequency for a second ier ha will converge o a minimum power consumpion for he enire execuion pah. We measure performance in ardiness quaniles o guaranee ha a cerain percenage of requess will always mee he deadline. We show ha he conrol scheme performs beer in erms of mainaining ardiness closer o he arge as compared o a simple heurisic scheme ha also moderaes performance wihin a narrow range. Furher, he approximaion conrol scheme inegraes successfully wih he sochasic opimizaion scheme o minimize power consumpion. 5 The average absolue deviaion in he World Cup 98 workload, ofen used for enerprise web esing, is abou 4% beween consecuive 30-second inervals. Figure 14: Experimenal resuls when CPU uilizaion changes a = 250, 000, and KW coninues execuion wihou waiing for Conroller A o resore he sysem near is sepoin. Forhcoming work will validae he implemenaion on an experimenal sysem wih realisic, ime-varying workload races. 7. REFERENCES [1] H. Robbins and S. Monro, A sochasic approximaion mehod, The Annals of Mahemaical Sas., vol. 22, no. 3, pp. 400 407, Sep. 1951. [2] L. Berini, J. C. B. Leie, and D. Mossé, Generalized ardiness quanile meric: Disribued dvs for sof real-ime web clusers, in Euromicro Conf. on Real-ime Sys., Jul. 2009, pp. 227 236. [3] J. Kiefer and J. Wolfowiz, Sochasic esimaion of he maximum o regression funcion, The Annals of Mahemaical Sas., vol. 23, no. 3, pp. 462 466, Sep. 1952. [4] L. Berini, J. C. B. Leie, and D. Mossé, Saisical qos guaranee and energy-efficiency in web server sysems, in Euromicro Conf. on Real-ime Sys., Jul. 2007, pp. 83 92. [5] P. Ranganahan, P. Leech, D. Irwin, and J. Chase, Ensemble-level power managemen for dense blade servers, in Proc. of he IEEE Sym. on Compuer Archiecure, Jun. 2006, pp. 66 77. [6] C. Lefurgy, X. Wang, and M. Ware, Server-level power conrol, in IEEE In l. Conf. on Auonomic Compuing, Jun. 2007, pp. 4 13. [7] E. Pinheiro, R. Bianchini, and T. Heah, Dynamic Cluser Reconfiguraion for Power and Performance. Kluwer Academic Publishers, 2003. [8] V. Sharma, A. Thomas, T. Abdelzaher, K. Skadron, and Z. Lu, Power-aware qos managemen in web servers, in IEEE In l. Real-ime Sys. Sym., Dec. 2003, pp. 63 72. [9] M. Elnozahy, M. Kisler, and R. Rajamony, Energy-efficien server clusers, in Wrkshp. on Power-Aware Compuing Sys., Feb. 2002, pp. 126 37. [10] D. Kusic, J. Kephar, J. Hanson, N. Kandasamy, and G. Jiang, Power and performance managemen of virualized compuing environmens via lookahead conrol, in IEEE Inl. Conf. on Auonomic Compuing, Jun. 2008, pp. 3 12. 49
P erformance 95 h quanile of z(), las 50, 000 samples 0.03% ± 0.05% below deadline ref Average z(), las 50, 000 samples 218.09 ± 0.07 ms Average P (), las 50, 000 samples 0.09% ± 0.03% above heoreical min. Average energy savings over all samples, as compared o an 4.41% ± 0.35% below full capaciy sysem unconrolled sysem Figure 15: Experimenal resuls when CPU uilizaion changes as a funcion of increased workload inensiy a = 250, 000, and KW is haled by Conroller A when he average ardiness since Conroller A s las execuion deviaes from he sepoin by more han 3%. [11] D. Kusic and N. Kandasamy, Risk-aware limied lookahead conrol for dynamic resource provisioning in enerprise compuing sysems, in IEEE Inl. Conf. on Auonomic Compuing, Jun. 2006, pp. 74 83. [12] Z. Lu, J. Hein, M. Humphrey, M. San, J. Lach, and K. Skadron, Conrol-heoreic dynamic frequency and volage scaling for mulimedia workloads, in In l. Conf. on Compilers, Arch., and Synhesis for Embedded Sys., Oc. 2002, pp. 156 163. [13] J. Kephar, H. Chan, D. Levine, G. Tesauro, F. Rawson, and C. Lefurgy, Coordinaing muliple auonomic managers o achieve specified power-performance radeoffs, in IEEE Inl. Conf. on Auonomic Compuing, Jun. 2007, pp. 145 154. [14] T. Horvah, T. Abdelzaher, K. Skadron, and X. Liu, Dynamic volage scaling in muliier web servers wih end-o-end delay conrol, IEEE Trans. on Compuers, vol. 56, no. 4, pp. 444 458, Apr. 2007. [15] M. Bennani and D. Menascé, Resource allocaion for auonomic daa ceners using analyic performance models, in IEEE Inl. Conf. on Auonomic Compuing. IEEE, June 2005, pp. 229 240. [16] J. Heo, D. Henriksson, X. Liu, and T. Abdelzaher, Inegraing adapive componens: An emerging challenge in performance-adapive sysems and a server farm case-sudy, in IEEE In l. Real-Time Sys. Sym., Dec. 2007, pp. 227 238. [17] C. Tsai, K. Shin, J. Reumann, and S. Singhal, Online web cluser capaciy esimaion and is applicaion o energy conservaion, IEEE Trans. on Parallel and Dis. Sys., vol. 18, no. 7, pp. 932 945, Jul. 2007. [18] L. Berini, J. C. B. Leie, and D. Mossé, Opimal dynamic configuraion in web server clusers, J. of Sys. and Sofware, vol. 83, no. 4, pp. 585 598, Apr. 2010. [19] T. Ishihara and H. Yasuura, Volage scheduling problem for dynamically variable volage processors, in IEEE In l. Sym. on Low Power Elecronics and Design, Aug. 1998, pp. 197 202. [20] D. Zhu, R. Melhem, and B. Childers, Scheduling wih dynamic volage/speed adjusmen using slack reclamaion in muliprocessor real-ime sysems, IEEE Trans. on Parallel and Dis. Sys., vol. 14, no. 7, pp. 686 700, Jul. 2003. [21] M. Femal and V. Freeh, Feedback conrol archiecure and design mehodology for service delay guaranees in web servers, Lecure Noes in Compuer Science, vol. 3471, no. 1, pp. 150 164, Dec. 2005. [22] Y. Diao, J. Hellersein, G. Kaiser, S. Parekh,, and D. Phung, Self-managing sysems: A conrol heory foundaion, IBM T.J. Wason Labs, Tech. Rep., Oc. 2004. [23] J. Hellersein, Y. Diao, S. Parekh, and D. Tilbury, Feedback Conrol of Compuing Sysems. Wiley-Inerscience, 2004. [24] J. Xu, M. Zhao, J. Fores, R. Carpener, and M. Yousif, On he use of fuzzy modeling in virualized daa cener managemen, in IEEE Inl. Conf. on Auonomic Compuing, Jun. 2007, pp. 25 35. [25] Y. Zhang, A. Besavros, M. Guirguis, I. Maa, and R. Wes, Friendly virual machines: leveraging a feedback-conrol model for applicaion adapaion, in ACM/USENIX In l. Conf. on Virual Execuion Envs., Jun. 2005, pp. 2 12. [26] C. Lu, Y. Lu, T. Abdelzaher, J. Sankovic, and S. Son, Feedback conrol archiecure and design mehodology for service delay guaranees in web servers, IEEE Trans. on Parallel and Dis. Sys., vol. 17, no. 9, pp. 1014 1027, Sep. 2006. [27] T. F. Abdelzaher, K. G., Shin, and N. Bhai, Performance guaranees for web server end-sysems: a conrol-heoreical approach, IEEE Trans. on Parallel and Dis. Sys., vol. 13, no. 1, pp. 80 96, Jan. 2002. [28] C.-H. Hsu and W.-C. Feng, A power-aware run-ime sysem for high-performance compuing, in ACM/IEEE Conf. on Supercompuing, Nov. 2005, pp. 1 9. [29] W. Wasan, Sochasic Approximaion. Cambridge Universiy Press, 1969. 50