A Cool Load Balancer for Parallel Applications

A Cool Load Balancer for Parallel Applcatons Osman Sarood Dept. of Computer Scence Unversty of Illnos at Urbana-Champagn Urbana, IL 68, USA sarood@llnos.edu Laxmkant V. Kale Dept. of Computer Scence Unversty of Illnos at Urbana-Champagn Urbana, IL 68, USA kale@llnos.edu ABSTRACT Meetng power requrements of huge exascale machnes of the future would be one major challenge. Our focus n ths paper s to mnmze coolng power and we propose a technque, that uses a combnaton of DVFS and temperature aware load balancng to constran core temperatures as well as save coolng energy. Our scheme s specfcally desgned to sut parallel applcatons whch are typcally tghtly coupled. The temperature control comes at the cost of executon tme and we try to mnmze the tmng penalty. We experment wth three applcatons (wth dfferent power utlzaton profles), run on a 8-core (3-node) cluster wth a dedcated ar condtonng unt. We calbrate the effcacy of our scheme based on three metrcs: ablty to control average core temperatures thereby avodng hot spot occurence, tmng penalty mnmzaton, and coolng energy savngs. Our results show coolng energy savngs of up to 7% wth tmng penalty mostly n the range of to %.. INTRODUCTION Coolng energy s a substantal part of the total energy spent by an Hgh Performance Computng (HPC) computer room or a data center. Accordng to some reports, ths can be as hgh as % [9], [3], [7] of the total energy budget. It s deemed essental to keep the computer room adequately cold n order to prevent processor cores from overheatng beyond ther safe thresholds. For one thng, contnuous operaton at hgher temperatures can permanently damage processor chps. Also, processor cores operatng at hgher temperatures consume more power whle runnng dentcal computatons at the same speeds due to the postve feedback loop between temperature and power [8]. Coolng s therefore needed to dsspate the energy consumed by a processor chp, and thus to prevent overheatng of the core. Ths consumed energy has a statc and dynamc component. The dynamc component ncreases as the cube of the frequency at whch the core s run. Therefore, an alternatve way of preventng overheatng s to reduce the fre- Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SC Seattle, Washngton USA Copyrght ACM X-XXXXX-XX-X/XX/XX...$.. quency. Modern processors and operatng systems support such frequency control (e.g. DVFS). Wth ths, t becomes possble to run a computer n a room wth hgh ambent temperature, by smply reducng frequences whenever temperature goes above a threshold. However, ths method of temperature control s problematc for HPC applcatons, whch tend to be tghtly coupled. If only one of the cores s slowed down by %, the entre applcaton wll slow down by % due to dependences between computatons on dfferent processors. Ths s further exacerbated when the dependences are global, such as when global reductons are used wth a hgh-frequency. Snce ndvdual processors may overheat at dfferent rates, and at dfferent ponts n tme, and snce physcal aspects of the room and the ar flow may create regons whch tend to be hotter, the stuaton where only a small subset of processors are operatng at a reduced frequency wll be qute common. For HPC applcatons, ths method therefore s not sutable as t s. The queston we address n ths paper s whether we can substantally reduce coolng energy wthout a sgnfcant tmng penalty. Our approach nvolves a temperature-aware dynamc load balancng strategy. In some prelmnary work presented at a workshop [6], we have shown the feasblty of the basc dea n the context of a sngle eght-core node. The contrbutons of ths paper nclude development of a scalable load-balancng strategy demonstrated on 8 core machne, n a controlled machne room, and wth explct power measurements. Va expermental data, we show that coolng energy can be reduced to the extent of up to to 7%, wth the tmng penalty only n the range of to % n most cases. We begn n Secton by ntroducng the frequency control method, and documentng the tmng penalty t mposes on HPC applcatons. In Secton 3 we descrbe our temperature-aware load balancer. It leverages object-based overdecomposton and the load-balancng framework n the Charm++ runtme system. Secton 4 outlnes the expermental setup for our work. We then descrbe (Secton ) performance data to show that, wth our strategy, the temperatures are retaned wthn the requste lmts, whle the tmng penaltes are small. Some nterestng ssues that arse n understandng how dfferent applcatons react to temperature control are analyzed n Secton 6. Secton 7 undertakes a detaled analyss of the mpact of our strateges on machne energy and coolng energy. Secton 8 summarzes related work and sets our work n ts context, whch s followed by a summary n Secton 9.

. CONSTRAINING CORE TEMPERATURES Unrestraned, core temperatures can soar very hgh. The most common way to deal wth ths n today s HPC centers s through the use of addtonal coolng arrangements. But as we have already mentoned, coolng tself accounts for around % [9, 3, 7] of the total energy consumpton of a data center and ths can rse even hgher wth the formaton of hot spots. To motvate the technque of ths paper, we start wth a study of the nteractons of core temperatures n parallel applcatons wth the coolng settngs of ther surroundngs. We run, a fnte dfferencng applcaton, for ten mnutes on 8 cores n our testbed. We provde more detals of the testbed n Secton 4. The coolng knob n ths experment was controlled by settng the coolng room ar condtonng (CRAC) to dfferent temperature settngs. Fgure shows the average core temperatures and the maxmum dfference of any core from the average temperature correspondng to two dfferent CRAC set ponts. As expected, coolng settngs have a pronounced effect on the core temperatures. For example, the average core temperatures correspondng to CRAC set pont 3.3 C are almost 6 C less than those for CRAC set pont.6 C. Fgure also shows the maxmum dfference between the average temperature and any core s temperature and as can be seen, the dfference worsens as we decrease external coolng (an ncrease of C). The result: Hotspots! Average/Max Dfference n Core Temperatures (C) 7 6 4 3 Max. Dfference CRAC=.6C Max. Dfference CRAC=3.3C Average CRAC=.6C Average CRAC=3.3C 3 4 6 Tme (secs) Fgure : Average core temperatures along wth max. dfference of any core from the average for The ssue of core overheatng s not new. DVFS s a wdely accepted soluton to cope wth t. DVFS s a technque whch s used to adjust the frequency and nput voltage of a mcroprocessor. It s manly used to conserve the dynamc power consumed by a processor. A shortcomng of DVFS, s that t comes wth an executon tme and machne energy penalty. To establsh the severty of these penaltes, we performed an experment wth 8 cores, runnng for a fxed number of teratons. We used DVFS to keep core temperatures under 44 C by perodcally checkng core temperatures and reducng the frequency by one level whenever a core got hot. The experment was repeated for fve dfferent CRAC set ponts. The results, n Fgure show the normalzed executon tme and machne energy. Normalzaton s done wth respect to the run where all cores run at full frequency wthout DVFS. The hgh tmng penalty (seen from Fgure ) coupled wth an ncrease n machne energy makes Normalzed Executon Tme/Energy.. 4.4 6.7 8.9. 3.3.6 Tme Energy Fgure : Normalzed tme and machne energy usng DVFS for t nfeasble for HPC communty to use such a technque. Now that we have establshed that DVFS on ts own can not effcently control core temperatures wthout ncurrng unacceptably hgh tmng penaltes, we now propose our approach to amelorate the defcences n usng DVFS wthout load balancng. 3. TEMPERATURE AWARE LOAD BALANC- ING In ths secton, we propose a novel technque based on task mgraton that can effcently control core temperatures and smultaneously mnmzes the tmng penalty. In addton, t also ends up savng total energy. Although our technque should work well wth any parallel programmng language whch allows object mgraton, we chose Charm++ for our tests and mplementaton because t allows smple and straghtforward task mgraton. We ntroduce Charm++ followed by a descrpton of our temperature aware load balancng technque. 3. Charm++ Charm++ s a parallel programmng runtme system that works on the prncple of processor vrtualzaton. It provdes a methodology where the programmer dvdes the program nto small computatons (objects or tasks) whch are dstrbuted amongst the P avalable processors by the runtme system []. Each of these small problems s a mgratable C++ object that can resde on any processor. The runtme keeps track of the executon tme for all these tasks and logs them n a database whch s used by a load balancer. The am of load balancng s to ensure equal dstrbuton of computaton and communcaton load amongst the processors. Charm++ uses the load balancng database to keep track of how much work each task s dong. Based on ths nformaton, the load balancer n the runtme system, determnes f there s a load mbalance and f so, t mgrates object from an overloaded processor to an underloaded one [4]. The load balancng decson s based on the heurstc of prncple of persstance, accordng to whch computaton and communcaton loads tend to persst wth tme for a certan class of teratve applcatons. Charm++ load balancers have proved to be very successful wth teratve applcatons such as NAMD [3].

3. Refnement based temperature aware load balancng We now descrbe our refnement based temperature aware load balancng scheme whch does a combnaton of DVFS and ntellgent load balancng of tasks accordng to frequences n order to mnmze executon tme penalty. The general dea s to let each core work at the maxmum possble frequency as long as t s wthn the maxmum temperature threshold. Currently, we do DVFS on a per-chp nstead of a per-core bass as the hardware dd not allow us to do otherwse. When we change the frequency of all the cores on the chp, the core nput voltage also drops resultng n power savngs. Ths rases a queston: What condton should trgger a change n frequency? In our earler work [6], we dd DVFS when any core on a chp crossed the temperature threshold. But our recent results show that basng DVFS decson on average temperature of the chp provdes better temperature control. Another mportant decson s to determne how much should the frequency be lowered n case a chp exceeds the maxmum threshold. Modern day processors come wth a set of frequences (frequency levels) at whch they can operate. Our testbed had dfferent frequency levels from.ghz to.4ghz (each step dffers by.3ghz). In our scheme, we change the frequency by only one level at each decson tme. The pseudocode for our scheme s gven n Algorthm wth the descrptons of varables and functons gven n Table. The applcaton specfes a maxmum temperature threshold and a tme nterval at whch the runtme perodcally checks the temperature and determnes whether any node has crossed that threshold. The varable k n Algorthm refers to the nterval number the applcaton s currently n. Our algorthm starts wth each node computng the average temperature for all cores present on t.e. t k. Once the average temperature has been computed, each node matches t aganst the maxmum temperature threshold (T max). If the average temperature s greater than T max, all cores on that chp shft one frequency level down. However, f the average temperature s less than T max, we ncrease the frequency level of all the cores on that chp (lnes -6). Once the frequences have been changed, we need to take nto account the speed dfferental wth whch each core can execute nstructons. We start by gatherng the load nformaton from the load balancng database for each core and task. In Charm++, ths load nformaton s mantaned n mllseconds. Hence, n order to neutralze the frequency dfference amongst the loads of each task and core, we convert the load tmes nto clock tcks by multplyng load for each task and core wth the frequency at whch t was runnng (lnes 8-). It s mportant to note that wthout dong ths converson, t would be ncorrect to compare the loads and hence load balancng would result n neffcent schedules. Even wth ths converson, the calculatons would not be completely accurate, but wll gve much better estmates. We also compute the total number of tcks requred for all the tasks (lne ) for calculatng the weghted averages accordng to new core frequences. Once the tcks are calculated, we create a max heap.e. overheap, for overloaded and a set for underloaded cores.e. underset (lne 6). The categorzaton of over and underloaded cores s done by the sheavy and slght procedures on lnes (-8). A core s overloaded f ts currently assgned tcks are greater than what t should be assgned.e. a weghted average of Table : Descrpton for varables used n Algorthm Varable Descrpton n number of tasks n applcaton p number of cores Tmax maxmum temperature allowed C set of cores on same chp as core e k executon tme of task durng step k (n ms) l k tme spent by core executng tasks durng step k (n ms) f k frequency of core durng step k (n Hz) m k core number assgned to task durng step k tcks k num. of tcks taken by th task/core durng step k t k average temperature of node at start of step k (n C) S k {e k, ek, ek 3,..., ek n } overheap heap of overloaded cores underset set of underloaded cores P k {l k, lk, lk 3,..., lk p } totalt cks accordng to the cores new frequency (lne 6). Notce the +tolerance factor n the expresson at lne 6. We have to use ths n order to do refnement only for cores that are overloaded by some consderable margn. We set t to.3 for all our experments. Ths means that a core s consdered to be overloaded f ts currently assgned tcks are greater than ts average weghted tcks by a factor of.3. Smlar check s n place for slght procedure (lnes 7) but we do not nclude tolerance as t does not matter. Once the max heap for the overloaded cores and a set for underloaded cores are ready, we start wth the load balancng. We pop the max element (tasks wth maxmum number of tcks) out of overheap (referred as donor). Next, we call the procedure getbestcoreandtask whch selects the best task to donate to the best underloaded core. The besttask s the largest task currently assgned to donor such that t does not overload a core from the underset. And the best- Core s the one whch wll reman underloaded after beng assgned the besttask. After determnng the besttask and bestcore, we do the mgraton by recordng the task mappng and (lne ) updatng the donor and bestcore wth number of tcks n besttask. We then call updateheapandset (lne 3) whch rechecks the donor for beng overloaded. If t s, we reenter t to overheap. It also checks donor for beng underloaded so that t s added to the underset n case t has ended up wth too lttle load. Ths ends the job of mgratng one task from overloaded core to an underloaded core. We repeat ths procedure untl overheap s empty. It s mportant to notce that the value of tolerance can affect the overhead of our load balancng. If that value s too large, t mght gnore load mbalance whereas f t s too small, t can result n a lot of overhead for object mgraton. We have notced that any value from. to. performs equally good. 4. EXPERIMENTAL SETUP The prmary objectve of ths work s to constran core temperature and save energy spent on coolng. Our scheme ensures that all the cores fall below a user-defned maxmum threshold. We want to emphasze that all results reported n ths work are actual measurements and not smulatons. We have used a 6 core (4 node, sngle socket) testbed equpped wth a dedcated CRAC. Each node s a sngle socket machne wth Intel Xeon X343 chp. It s a quad core chp supportng dfferent frequency levels rangng from.ghz to.4ghz. We use 8 cores out of the 6 cores avalable for all the runs that we report. All the nodes

Algorthm Temperature Aware Refnement Load Balancng : At node at start of step k : f t k > T max then 3: decreaseonelevel(c ) //reduce by.3ghz 4: else : ncreaseonelevel(c ) //ncrease by.3ghz 6: end f 7: At Master core 8: for S k do 9: tcks k = e k f k m k : totalt cks = totalt cks + tcks k : end for : for P k do 3: tcks k = l k f k 4: freqsum = freqsum + f k : end for 6: createoverheapandunderset() 7: whle overheap NOT NULL do 8: donor = overheap->deletemaxheap 9: (besttask,bestcore) = getbestcoreandtask(donor,underset) : m k bestt ask = bestcore : tcks k donor = tcksk donor bestsze : tcks k bestcore = tcksk bestcore + bestsze 3: undateheapandset() 4: end whle : procedure sheavy() 6: return (tcks k > ( + tolerance) * (totalt cks * f k ) / freqsum) 7: procedure slght() 8: return (tcks k < totalt cks * f k /freqsum) run ubuntu.4 and we use cpufreq module n order to do DVFS. The nodes are nterconnected usng a 48-port ggabt ethernet swtch. We use the Lebert Power unt nstalled wth the rack to get power readngs for the machnes. The CRAC n our testbed s an ar cooler that uses centrally chlled water for coolng the ar. It manpulates the flow of chlled water to acheve the temperature set pont prescrbed by the operator. The exhaust ar (T hot ).e. the hot ar comng n from the machne room, s compared aganst the set pont and the flow of the chlled water s adjusted accordngly to cover the dfference n the temperatures. Ths model of coolng s favorable consderng that the temperature control s responsve to the thermal load (as t tres to brng the exhaust ar to temperature set pont) nstead of room nlet temperature [9]. The machnes and the CRAC are located n the Computer Scence department of Unversty of Illnos Urbana Champagn. We were fortunate enough to not only be able to use DVFS on all the avalable cores but to also change the CRAC set ponts. There sn t a straghtforward way of measurng the exact power draw of the CRAC as t uses the chlled water to cool the ar whch n turn s cooled centrally for the whole buldng. Ths made t mpossble for us to use a power meter. But that sn t unusual as most data centers use smlar coolng desgns. Instead of usng a power meter, we nstalled temperature sensors at the outlet and nlet of the CRAC. These sensors measure the ar temperature comng from and gong out to the machne room. The heat dsspated nto the ar s affected by core temperatures and the CRAC has to cool ths ar for mantanng a constant room temperature. The power consumed by CRAC (P ac) to brng the temperature of exhaust ar (T hot ) down to the cool nlet ar (T ac) s [9]: P ac = c ar f ac (T hot T ac) () where c ar s the hear capacty constant, f ac s the constant flow rate of the coolng system. Although we are not usng a power meter, our results are very accurate because there s no nterference from other heat sources as s the case wth larger data centers where jobs from other users runnng on nearby nodes mght dsspate a lot of heat whch would dstort coolng energy estmaton for your experments. To the best of our knowledge, ths s the largest testbed on whch any HPC researcher has reported results wth DVFS. Also, we are not aware of any other work on constranng core temperatures and showng ts beneft n coolng energy savngs. In contrast to most of the earler work that emphaszed on savngs from machne power consumpton usng DVFS. Most mportantly, our work s unque n usng load balancng to mtgate effects of transent speed varatons n HPC world. We demonstrate the effectveness of our scheme by usng three applcatons havng dfferent utlzaton and power profles. The frst s a canoncal benchmark, JacobD, that uses pont stencl to average values n a D grd usng D decomposton. The second applcaton,, uses a fnte dfferencng scheme to calculate pressure nformaton over a dscretzed D grd. The thrd applcaton, Mol3D, s from molecular dynamcs and s a real world applcaton to smulate large bomolecular systems. For JacobD and, we choose a problem sze of,x, and 3,x3, grds respectvely. For Mol3D however, we ran a system contanng 9,4 atoms. We dd an ntal run of these applcatons wthout DVFS wth CRAC workng at 3.9 C and noted the maxmum average core temperature reached for all 8 cores. We then used our temperature aware load balancer to keep the core temperatures at 44 C whch was the maxmum average temperature reached n the case of JacobD (ths was the lowest peak average temperature amongst all three applcatons). Whle keepng the threshold fxed at 44 C, we decreased the coolng by ncreasng the CRAC set pont. In order to gauge the effectveness of our scheme, we compared t wth the scheme n whch DVFS s used to constran core temperatures, wthout usng any load balancng (we refer to t as w/o throughout the paper).. TEMPERATURE CONTOL AND TIMING PENALTY Temperature control s mportant for coolng energy consderatons snce t determnes the heat dsspated nto the ar whch the CRAC s responsble for removng. In addton to that, core temperatures and power consumpton of a machne are related wth a postve feedback loop, so that an ncrease n any of them causes an ncrease n the other [8]. Our earler work [6] shows evdence of ths where we ran JacobD on a sngle node wth 8 cores and measured the machne power consumpton along wth core temperatures. The results showed that ncrease n core temperature can cause an ncrease of up to 9% n machne power consump-

ton and ths fgure can be huge for large data centers. For our testbed n ths work, Fgure 3 shows the average temperature for all 8 cores over a perod of mnutes usng our temperature aware load balancng. The CRAC was set to. C for these experments. The horzontal lne s drawn as a reference to show the maxmum temperature threshold (44 C) used by our load balancer. As we can see, rrespectve of how large the temperature gradent s, our scheme s able to restran core temperature to wthn C. For example, core temperatures for Mol3D and reach the threshold.e. 44 C much sooner than JacobD. But all three applcatons stay very close to 44 C after reachng the threshold. Average Temperature (C) 4 4 3 3 Wave Jacob Mol3D 3 4 6 Tme (secs) Fgure 3: Average core temperature wth CRAC set pont at. C Max Dfference n Core Temperature (C) 3 Wthout DVFS CRAC=.6C Wthout DVFS CRAC=3.3C Set Pont=.6C Set Pont=3.3C 3 4 6 Tme (secs) case of Wthout DVFS run, the maxmum dfference s due to one specfc node gettng hot and contnung to be so throughout the executon.e. a hot spot. On the other hand, wth our scheme, no sngle core s allowed to get a lot hotter than the maxmum threshold. Currently, for all our experments, we do temperature measurement and DVFS after every 6-8 seconds. More frequent DVFS would result n more executon tme penalty snce there s some overhead of dong task mgraton to balance the loads. We wll return to these overheads later n ths secton. The above expermental results showed the effcacy of our scheme n terms of lmtng core temperatures. However, as shown n Secton, ths comes at the cost of executon tme. We now use savngs n executon tme penalty as a metrc to establsh the superorty of our temperature aware load balancer n comparson to usng DVFS wthout any load balancng. For ths, we study the normalzed executon tmes, t norm, wth and wthout our temperature aware load balancer, for all three applcatons under consderaton. We defne t norm as follows: t norm = t LB/t base () where t LB represents the executon tme for temperature aware load balanced run and t base s executon tme wthout DVFS so that all cores work at maxmum frequency. The value for t norm n case of w/o run s calculated n a smlar manner except that we use t NoLB nstead of t LB. We experment wth dfferent CRAC set ponts. All the experments were performed by actually changng the CRAC set pont and allowng the room temperature to stablze before any expermentaton and measurements were done. To mnmze errors, we averaged the executon tmes over three smlar runs. Each run takes longer than mnutes to allow far comparson between applcatons. The results of ths experment are summarzed n Fgure. The results show that our scheme consstently performs better than w/o scheme as manfested by the smaller tmng penaltes for all CRAC set ponts. As we go on reducng the coolng (.e. ncreasng the CRAC set pont), we can see degradaton n the executon tmes.e. an ncrease n tmng penalty. Ths s not unexpected and s a drect consequence of the fact that the cores heat up n lesser tme and scale down to lower frequency thus takng longer to complete the same run. Fgure 4: Max dfference n core temperatures for Temperature varaton across nodes s another very mportant factor. Spatal temperature varaton s known to cause hot spots whch can drastcally ncrease the coolng costs of a data center. To get some nsght nto hot spot formaton, we performed some experments on our testbed wth dfferent CRAC set ponts. Each experment was run for mnutes. Fgure 4 shows the maxmum dfference any core has from the average core temperature for when run wth dfferent CRAC set ponts. The Wthout DVFS run refers to all cores workng at full frequency and no temperature control at core-level. It was observed that for the Fgure 6: Projectons tmelne wth and wthout Temperature Aware Load Balancng for

w/o w/o w/o Normalzed Executon Tme.. Normalzed Executon Tme.. Normalzed Executon Tme.. 4.4 6.7 8.9. 3.3.6 (a) JacobD 4.4 6.7 8.9. 3.3.6 (b) 4.4 6.7 8.9. 3.3.6 (c) Mol3D Fgure : Normalzed executon tme wth and wthout Temperature Aware Load Balancng Fgure 7: Zoomed Projectons for teratons It s nterestng to observe from Fgure that the dfference n our scheme and the w/o scheme s small to start wth but grows as we ncrease the CRAC set pont. Ths s because when the machne room s cooler, the cores take longer to heat up n the frst place. As a result, even the cores fallng n the hot spot area do not become so hot that they go to a very small frequency (we decrease frequency n steps of.3ghz). But as we keep on decreasng the coolng, the hot spots become more and more vsble, so much so that when the CRAC set pont s.6 C, Node (hot spot n our testbed) runs at the mnmum possble frequency almost throughout the experment. Our scheme does not suffer from ths problem snce t ntellgently assgns loads by takng core frequences nto account. But wthout our load balancer, the executon tme ncreases greatly (refer to Fgure for CRAC set pont.6 C). Ths happens because n the absence of load balancng, executon tme s determned by the slowest core.e. core wth the mnmum frequency. Mnmum Frequency (GHz).6.4..8.6.4 JacobD Mol3D 3 4 6 Tme (secs) Fgure 8: Mnmum frequency for all three applcatons For a more detaled look at our scheme s sequence of actons, we use Projectons [6], a performance analyss tool from the Charm++ nfrastructure. Projectons provdes a vsual demonstraton of multple performance data ncludng processor tmelnes showng ther utlzaton. We carred out an experment on 6-cores nstead of 8 and use projectons to hghlght the salent features of our scheme. We worked wth a smaller number of cores snce t would have been dffcult to vsually understand a 8-core tmelne. Fgure 6 shows the tmelnes and correspondng utlzaton for all 6 cores throughout the executon of. Both runs n the fgure had DVFS enabled. The upper run.e. the top 6 lnes, s the one where s executed wthout temperature aware load balancng whereas the lower part.e. the bottom 6 lnes, repeated the same executon wth our temperature aware load balancng. The length of the tmelne ndcates the total tme taken by an experment. The green and pnk colors show the computatons, whereas the whte lnes represents dle tme. Notce that the executon tme wth temperature aware load balancng s much less than that wthout t. To see how processors spend ther tme, we zoomed nto the boxed part of Fgure 6 and reproduced t n Fgure 7. It represents teratons of. Ths zoomed part belongs to the run wthout temperature aware load balancng. We can see that because of DVFS, the frst four cores work at a lower frequency than the remanng cores. They, therefore, take longer to complete ther tasks as compared to the remanng cores (longer pnk and green portons on the frst 4 cores). The remanng cores fnsh ther work quckly and then keep on watng for the frst 4 cores to complete ther tasks (depcted by whte spaces towards the end of each teraton). These results clearly suggest that the tmng penalty s dctated by the slowest cores. We also substantate ths by provdng Fgure 8 whch shows the mnmum frequency of any core durng a w/o run (CRAC set pont at 3.3 C). We can see from Fgure that and Mol3d have hgher penaltes as compared to JacobD. Ths s because the mnmum frequency reached n these applcatons s lower than that reached n JacobD. We now dscuss the overhead assocated wth our temperature aware load balancng. As outlned n Algorthm, our scheme has to measure core temperatures, do DVFS, decde new assgnments and then exchange tasks accordng to the new schedule. The major overhead n our scheme comes from the last tem.e. exchange of tasks. In comparson, temperature measurements, DVFS, and load balancng decsons take neglgble tme. To calbrate the communca-

Objects Mgrated (%) 7 6 4 3 JacobD Mol3D 3 4 6 Tme (secs) Fgure 9: Percent objects mgrated durng temperature aware load balancer run ton load we ncur on the system, we run an experment wth each of the three applcatons for ten mnutes and count the number of tasks mgrated at each step when we check core temperatures. Fgure 9 shows these percentages for all three applcatons. As we can see, the numbers are very small to make any sgnfcant dfference. The small overhead of our scheme s also hghlghted by ts superorty over the w/o scheme whch does temperature control through DVFS but no load balancng (and so, no object mgraton). One mportant observaton to be made from ths fgure s the larger number of mgratons n as compared to the other two applcatons. Ths s because t has a hgher CPU utlzaton. also consumes/dsspates more power than the other two applcatons and hence has more transtons n ts frequency. We explan and verfy these applcaton-specfc dfferences n power consumpton n the next Secton. 6. UNDERSTANDING APPLICATION REAC- TION TO TEMPERATURE CONTROL One of the reasons we chose to work wth three dfferent applcatons was to be able to understand how applcatonspecfc characterstcs react to temperature control. In ths secton, we hghlght some of our fndngs and try to provde a comprehensve and logcal explanaton for them. Average Frequency (GHz).6.4..8 JacobD Mol3D.6 3 4 6 Tme (secs) Fgure : Average frequency for all three applcatons wth CRAC at 3.3 C We start by referrng back to Fgure whch shows that suffers the hghest tmng penalty followed by Mol3D and JacobD. Our ntuton was that ths dfference could be explaned by the frequences at whch each applcaton s runnng along wth ther CPU utlzatons (see Table ). Fgure shows the average frequency across all 8 cores durng the executon tme for each applcaton. We were surprsed wth Fgure because t showed that both and Mol3D run at almost the same average frequency throughout the executon tme and yet ends up havng a much hgher penalty than Mol3D. Upon nvestgaton, we found that Mol3D s less senstve to frequency than. To further gauge the senstvty of our applcatons to frequency, we ran a set of experments n whch each applcaton was run at all avalable frequency levels. Fgure shows the results where executon tmes are normalzed wth respect to a base run where all 8 cores run at maxmum frequency.e..4ghz. We can see from Fgure that has the steepest curve ndcatng ts senstvty to frequency. On the other hand, Mol3D s the least senstve to frequency as shown by ts small slope. Ths gave us one explanaton for the hgher tmng penaltes for as compared to the other two. However, f we use ths lne of reasonng only, then JacobD s more senstve to frequency (as shown by Fgure ) and has a hgher utlzaton (Table ) and should therefore have a hgher tmng penalty than Mol3D. But Fgure suggests otherwse. Moreover, the average power consumpton of JacobD s also hgher than Mol3D (see Table ) whch should mply cores gettng hotter sooner whle runnng Jacob than wth Mol3d and shftng to lower frequency level. On the contrary, Fgure shows Jacob runnng wth a much hgher frequency than Mol3D. These counter ntutve results could only be explaned n terms of CPU power consumpton whch s hgher n case of Mol3D than for JacobD. To summarze, these results suggest that although the total power consumpton of the entre machne s smaller for Mol3D, the proporton consumed by CPU s hgher as compared to the same for JacobD. For some mathematcal backng to our clams, we look at the followng expresson for core temperatures [9]: T cpu = αt ac + βp + γ (3) Here T cpu s the core temperature, T ac s temperature of the ar comng from the coolng unt, P s power consumed by the chp, α, β and γ are constants whch depend on heat capacty and ar flow snce our CRAC mantans a constant arflow. Ths expresson shows that core temperatures are dependent on power consumpton of the chp rather than the whole machne, and therefore t s possble that the cores get hotter for Mol3D earler than wth JacobD due to hgher CPU power consumpton. So far, we have provded some logcal and mathematcal explanatons for our counter-ntutve results. But we wanted to explore them thoroughly and fnd more cogent evdence to our clams. As a fnal step towards ths verfcaton, we ran all three applcatons on 8 cores usng the performance capabltes of Perfsute [7] and collected nformaton about dfferent performance counters summarzed n Table. We can see that Mol3D faces fewer cache msses and has tmes more traffc between L and L cache (counter type Data Traffc L-L ) resultng n hgher MFLOP/s than JacobD. The dfference between the total power consumpton of JacobD and Mol3D can now be ex-

Table : Performance counters for one core Counter Type JacobD Mol3D Executon Tme (secs) 474 473 469 MFLOP/s 4 9 Traffc L-L (MB/s) 99, 3,44 Traffc L-DRAM (MB/s) 39 97 77 Cache msses to DRAM (bllons) 4.7 4. CPU Utlzaton (%) 87 83 93 Power (W) 47 33 8 Memory Footprnt(% of memory) 8..4 8. planed n terms of more DRAM access for JacobD. We sum our analyss up by remarkng that MFLOP/s seems to be the most vable decdng factor n determnng the tmng penalty that an applcaton would have to bear when coolng n the machne room s lowered. Fgure substantates our clam. It shows that, whch has the hghest MFLOP/s (Table ) suffers the most penalty followed by Mol3D and JacobD. Normalzed Executon Tme (C)..8.6.4. JacobD Mol3D.4.6.8..4 Frequency (GHz) Fgure : Normalzed executon tme for dfferent frequency levels Tmng Penalty (%) 4 3 3 JacobD MolD 4 6 8 4 6 Fgure : Tmng penalty for dfferent CRAC set ponts 7. ENERGY SAVINGS Ths secton s dedcated to a performance analyss for our temperature aware load balancng n terms of energy consumpton. We frst look at machne energy and coolng energy separately and then combne them to look at the total energy. 7. Machne Energy Consumpton Fgure 3 shows the normalzed machne energy consumpton (e norm), calculated as: e norm = e LB/e base (4) where e LB represents the energy consumed for temperature aware load balanced run and e base s executon tme wthout DVFS wth all cores workng at maxmum frequency. e norm, for w/o run s calculated n a smlar way wth e LB replaced by e NoLB. Statc power of CPU, along wth the power consumed by power supply, memory, hard dsk and the motherboard manly form the dle power of a machne. A node of our testbed has an dle power of 4W whch represents 4% of the total power when the machne s workng at full frequency assumng % CPU utlzaton. It s ths hgh dle/base power whch nflates the total machne consumpton n case of w/o runs as shown n Fgure 3. Ths s because for every extra second of penalty n executon tme, we wll pay an extra 4J per node n addton to the dynamc energy consumed by the CPU. Consderng ths, our scheme does well to keep the normalzed machne energy consumpton close to as shown n Fgure 3. We can better understand the reason why the w/o run s consumng much more power than our scheme f we refer back to Fgure 7. We can see that although the lower cores are dle after they are done wth ther tasks (whte porton enclosed n the rectangle), they stll consume dle power thereby ncreasng the total energy consumed. 7. Coolng Energy Consumpton Whle there exsts some lterature dscussng technques for savng coolng energy, those solutons are not applcable to HPC where applcatons are tghtly coupled. Our am n ths work, s to come up wth a framework for analyzng coolng energy consumpton specfcally from the perspectve of HPC systems. Based on such a framework, we can desgn mechansms to save coolng energy that are partcularly suted to HPC applcatons. We now refer to Equaton to nfer that T hot and T ac, are enough to compare energy consumpton for CRAC as the rest are constants. So we come up wth the followng expresson for normalzed coolng energy (c norm): c norm = T hot LB Tac LB T base hot T base ac t LB norm () where Thot LB represents temperature of hot ar leavng the machne room (enterng the CRAC) and Tac LB represents temperature of the cold ar enterng the machne room respectvely when usng temperature aware load balancer. Smlarly, when runnng all the cores at maxmum frequency wthout any DVFS, Thot base s the temperature of hot ar leavng the machne room and Tac base s the temperature of the cold ar enterng the machne room. t norm s the normalzed tme for the temperature aware load balanced run. Notce

.8.6 w/o.8.6 w/o.8.6 w/o Normalzed Machne Energy.4..8.6.4 Normalzed Machne Energy.4..8.6.4 Normalzed Machne Energy.4..8.6.4... 4.4 6.7 8.9. 3.3.6 4.4 6.7 8.9. 3.3.6 4.4 6.7 8.9. 3.3.6 (a) JacobD (b) (c) Mol3D Fgure 3: Normalzed machne energy consumpton wth and wthout Temperature Aware Load Balancng that we nclude the tmng penalty n our coolng energy model so that we ncorporate the addtonal tme for whch coolng must be done. Fgure 4 shows the normalzed coolng energy for both wth and wthout temperature aware load balancer. We can see from the fgure that both schemes end up savng some coolng energy but temperature aware load balancng outperforms w/o scheme by a sgnfcant margn. Our temperature readngs showed that the dfference between T hot and T ac was very close n both cases.e. our scheme and the w/o scheme, and the savngs n our scheme was a result of savngs from t norm. 7.3 Total Energy Consumpton Although most data centers report coolng to account for % [9, 3, 7] of total energy, we decded to take a conservatve fgure of 4% [] for t n our calculatons of total energy. Fgure shows the percentage of total energy we save and the correspondng tmng penalty we end up payng for t. Although t seems that does not gve us much room to decrease ts tmng penalty and energy, we would lke to menton that our maxmum threshold of 44 C was very conservatve for t. On the other hand, results from Mol3D and JacobD are very encouragng n the sense that f a user s wllng to sacrfce some executon tme, he can save a consderable amount of energy keepng core temperatures n check. It should also be notced that our current constrants are very strct consderng that we do not allow any core to go above the threshold. If we were to allow for a range of temperatures nstead of one strct threshold, we can mprove the tmng penalty even more. For example, we dd an experment wth Mol3D, where the allowed core temperature range was set to 44 C 49 C and CRAC set pont was 3.3 C. Wth these settngs, we saved 8% energy after payng only 4% tmng penalty. To quantfy energy savngs achevable wth our technque, we plot normalzed tme aganst normalzed energy (Fgure 6). The fgure shows data ponts for both our scheme and w/o scheme. We can see that for each CRAC set pont, our scheme moves the correspondng w/o pont towards the left (reducng energy) and down (reducng tmng penalty). The slope of these curves would gve us the number of seconds the executon tme ncreases for each joule saved n energy. As we see JacobD has a hgher potental for energy savng as compared to Mol3D because of the lower MFLOP/s. 8. RELATED WORK Most researchers from HPC have focused on mnmzng machne energy consumpton as opposed to coolng energy [,, ]. Gven a target program, a DVFS enabled cluster, and constrants on power consumpton, they [8] come up wth a frequency schedule that mnmzes executon tme whle stayng wthn the power constrants. Our work dffers n that we base our DVFS decsons on core temperatures for savng coolng energy whereas they devse frequency schedules accordng to task schedule rrespectve of core temperatures. Ther scheme works wth load balanced applcatons only whereas ours has no such constrants. In fact one of the major features of our scheme s that t strves to acheve a good load balance. A runtme system named, PET (Performance, power, energy and temperature management), by Hanson et al [4], tres to maxmze performance whle respectng power, energy and temperature constrants. Our goal s smlar to them but we acheve t n a multcore envronment whch adds an addtonal dmenson of load balancng. The work of Banarjee et al [] comes closest to ours n the sense that they also try to mnmze coolng costs n an HPC data center. But ther focus s on controllng the CRAC set ponts rather than the core temperatures. In addton, they need to know the job start and end tmes beforehand to come up wth the correct schedule whereas our technque does not rely on any pre-runs. Merkel et al [] also explore the dea of task mgraton from hot to cold cores. However, they do not do t for parallel applcatons and therefore do not have to deal wth complcatons n task mgraton decsons because of synchronzaton prmtves. In another work, Tang et al.[] have proposed a way to decrease coolng and avod hot spots by mnmzng the peak nlet temperature from the machne room through ntellgent task assgnment. But ther work s based on a small-scale data center smulaton whle ours s comprsed of expermental results on a reasonably large testbed. Work related to coolng energy optmzaton and hot-spot avodance has been done extensvely n non HPC data centers [,,, 3, ]. But most of ths work reles on placng jobs such that jobs expected to generate more heat are placed on nodes located at relatvely cooler areas n the machne room and vce versa. Rajan et al [4] dscuss the effectveness of system throttlng for temperature aware schedulng. They clam system throttlng rules to be the best one can acheve under certan assumptons. But one of ther assumptons, non-mgrateablty of tasks, s clearly not true

. w/o. w/o. w/o Normalzed Coolng Energy.8.6.4. Normalzed Coolng Energy.8.6.4. Normalzed Coolng Energy.8.6.4. 4.4 6.7 8.9. 3.3.6 (a) JacobD 4.4 6.7 8.9. 3.3.6 (b) 4.4 6.7 8.9. 3.3.6 (c) Mol3D Fgure 4: Normalzed coolng energy consumpton wth and wthout Temperature Aware Load Balancng Tmng Penalty/ Power Savng (%) 4 4 3 3 Tme Penalty Total Energy Savng Tmng Penalty/ Power Savng (%) 4 4 3 3 Tme Penalty Total Energy Savng Tmng Penalty/ Power Savng (%) 4 4 3 3 Tme Penalty Total Energy Savng 4.4 6.7 8.9. 3.3.6 (a) JacobD 4.4 6.7 8.9. 3.3.6 (b) 4.4 6.7 8.9. 3.3.6 (c) Mol3D Fgure : Tmng penalty and power savngs n percentage for temperature aware load balancng.4.3 w/o.4.3 w/o.6c 3.3C.4.3 w/o Normalzed Tme.3. 3.3C..6C.C. 3.3C 8.9C. 6.6C.C. 8.9C4.4C 6.6C 4.4C.7.8.8.9.9... Normalzed Energy (a) JacobD Normalzed Tme.3..... 3.3C.C 8.9C 6.6C.C 8.9C 6.6C 4.4C 4.4C.7.8.8.9.9... Normalzed Energy (b) Normalzed Tme.3......6C 3.3C 3.3C 6.6C 8.9C.C 4.4C 8.9C 6.6C 4.4C.C.7.8.8.9.9... Normalzed Energy (c) Mol3D Fgure 6: Normalzed tme as a functon of normalzed energy for HPC applcatons we target. Another recent approach s used by Le at al [9] where they swtch machnes on and off n order to mnmze total energy to meet the core temperature constrants. However, they do not consder parallel applcatons. 9. CONCLUSION We expermentally showed the possblty of savng coolng and total energy consumed by our small data center for tghtly coupled parallel applcatons. Our technque not only saved coolng energy but also mnmzed the tmng penalty assocated wth t. Our approach was conservatve n a manner that we set hard lmts on absolute values of core temperature. However, our technque can readly be appled to constran core temperatures wthn a specfed temperature range whch can result n much less tmng penalty. We carred out a detaled analyss to reveal the relatonshp between applcaton characterstcs and the tmng penalty that can be expected f t were to constran core temperatures. Our technque was successfully able to dentfy and neutralze a hot spot from our testbed. We plan to extend our work by ncorporatng crtcal path analyss of parallel applcatons n order to make sure that we always try to keep all tasks on crtcal path on the fastest cores. Ths would further reduce our tmng penalty and possbly reduce machne energy consumpton. We also plan to extend our work n such a way that nstead of usng DVFS to constran core temperatures, we apply t to meet a certan maxmum power threshold that a data center wshes not to exceed. Acknowledgments We are thankful to Prof. Tarek Abdelzaher for lettng us use the testbed for expermentaton.

. REFERENCES [] A. Banerjee, T. Mukherjee, G. Varsamopoulos, and S. Gupta. Coolng-aware and thermal-aware workload placement for green hpc data centers. In Green Computng Conference, Internatonal, pages 4 6,. [] C. Bash and G. Forman. Cool job allocaton: measurng the power savngs of placng jobs at coolng-effcent locatons n the data center. In 7 USENIX Annual Techncal Conference on Proceedngs of the USENIX Annual Techncal Conference, pages 9: 9:6, Berkeley, CA, USA, 7. USENIX Assocaton. [3] R. S. C. D. Patel, C. E. Bash. Smart coolng of datacenters. In IPACK 3: The PacfcRm/ASME Internatonal Electroncs Packagng Techncal Conference and Exhbton. [4] H. Hanson, S. Keckler, R. K, S. Ghas, F. Rawson, and J. Rubo. Power, performance, and thermal management for hgh-performance systems. In IEEE Internatonal Parallel and Dstrbuted Processng Symposum (IPDPS), pages 8, march 7. [] L. Kalé. The Chare Kernel parallel programmng language and system. In Proceedngs of the Internatonal Conference on Parallel Processng, volume II, pages 7, Aug. 99. [6] L. V. Kalé and A. Snha. Projectons: A prelmnary performance tool for charm. In Parallel Systems Far, Internatonal Parallel Processng Symposum, pages 8 4, Newport Beach, CA, Aprl 993. [7] R. Kufrn. Perfsute: An accessble, open source performance analyss envronment for lnux. In In Proc. of the Lnux Cluster Conference, Chapel,. [8] E. Kursun, C. yong Cher, A. Buyuktosunoglu, and P. Bose. Investgatng the effects of task schedulng on thermal behavor. In In Thrd Workshop on Temperature-Aware Computer Systems (TAC S 6, 6. [9] H. Le, S. L, N. Pham, J. Heo, and T. Abdelzaher. Jont optmzaton of computng and coolng energy: Analytc model and a machne room case study. In The Second Internatonal Green Computng Conference (n submsson),. [] A. Merkel and F. Bellosa. Balancng power consumpton n multprocessor systems. In Proceedngs of the st ACM SIGOPS/EuroSys European Conference on Computer Systems 6, EuroSys 6. ACM. [] L. Mnas and B. Ellson. Energy Effcency For Informaton Technolog: How to Reduce Power Consumpton n Servers and Data Centers. Intel Press, 9. [] L. Paroln, B. Snopol, and B. H. Krogh. Reducng data center energy consumpton va coordnated coolng and load management. In Proceedngs of the 8 conference on Power aware computng and systems, HotPower 8, pages 4 4, Berkeley, CA, USA, 8. USENIX Assocaton. [3] J. C. Phllps, G. Zheng, S. Kumar, and L. V. Kalé. NAMD: Bomolecular smulaton on thousands of processors. In Proceedngs of the ACM/IEEE conference on Supercomputng, pages 8, Baltmore, MD, September. [4] D. Rajan and P. Yu. Temperature-aware schedulng: When s system-throttlng good enough? In Web-Age Informaton Management, 8. WAIM 8. The Nnth Internatonal Conference on, pages 397 44, july 8. [] B. Rountree, D. K. Lowenthal, S. Funk, V. W. Freeh, B. R. de Supnsk, and M. Schulz. Boundng energy consumpton n large-scale mp programs. In Proceedngs of the ACM/IEEE conference on Supercomputng, pages 49: 49:9, 7. [6] O. Sarood, A. Gupta, and L. V. Kale. Temperature aware load balancng for parallel applcatons: Prelmnary work. In The Seventh Workshop on Hgh-Performance, Power-Aware Computng (HPPAC ), Anchorage, Alaska, USA,. [7] R. Sawyer. Calculatng total power requrments for data centers. Amercan Power Converson, 4. [8] R. Sprnger, D. K. Lowenthal, B. Rountree, and V. W. Freeh. Mnmzng executon tme n mp programs on an energy-constraned, power-scalable cluster. In Proceedngs of the eleventh ACM SIGPLAN symposum on Prncples and practce of parallel programmng, PPoPP 6, pages 3 38, New York, NY, USA, 6. ACM. [9] R. F. Sullvan. Alternatng cold and hot asles provdes more relable coolng for server farms. Whte Paper, Uptme Insttute,. [] Q. Tang, S. Gupta, D. Stanzone, and P. Cayton. Thermal-aware task schedulng to mnmze energy usage of blade server based datacenters. In Dependable, Autonomc and Secure Computng, nd IEEE Internatonal Symposum on, pages 9, 6. [] Q. Tang, S. Gupta, and G. Varsamopoulos. Energy-effcent thermal-aware task schedulng for homogeneous hgh-performance computng data centers: A cyber-physcal approach. Parallel and Dstrbuted Systems, IEEE Transactons on, 9():48 47, 8. [] L. Wang, G. von Laszewsk, J. Dayal, and T. Furlan. Thermal aware workload schedulng wth backfllng for green data centers. In Performance Computng and Communcatons Conference (IPCCC), 9 IEEE 8th Internatonal, pages 89 96, 9. [3] L. Wang, G. von Laszewsk, J. Dayal, X. He, A. Younge, and T. Furlan. Towards thermal aware workload schedulng n a data center. In Pervasve Systems, Algorthms, and Networks (ISPAN), 9 th Internatonal Symposum on, pages 6, 9. [4] G. Zheng. Achevng hgh performance on extremely large parallel machnes: performance predcton and load balancng. PhD thess, Department of Computer Scence, Unversty of Illnos at Urbana-Champagn,.