How To Make A Co-Ocaton Work For Free

Enabng Far Prcng on HPC Systems wth Node Sharng Aex D. Bresow Ananta Twar Martn Schuz Laura Carrngton Lngja Tang Jason Mars Unversty of Caforna, San Dego, CA, USA, abresow@cs.ucsd.edu San Dego Supercomputer Center, La Joa, CA, USA, {twar, carrng}@sdsc.edu Lawrence Lvermore Natona Laboratory, Lvermore, CA, USA, schuzm@n.gov Unversty of Mchgan, Ann Arbor, MI, USA, {ngja, profmars}@eecs.umch.edu ABSTRACT Co-ocaton, where mutpe jobs share compute nodes n arge-scae HPC systems, has been shown to ncrease aggregate throughput and energy effcency by 1 to 2%. However, system operators dsaow co-ocaton due to far-prcng concerns,.e., a prcng mechansm that consders performance nterference from co-runnng jobs. In the current prcng mode, appcaton executon tme determnes the prce, whch resuts n unfar prces pad by the mnorty of users whose jobs suffer from co-ocaton. Ths paper presents POPPA, a runtme system that enabes far prcng by deverng precse onne nterference detecton and factates the adopton of supercomputers wth co-ocatons. POPPA everages a nove shutter mechansm a cycc, fne-graned nterference sampng mechansm to accuratey deduce the nterference between co-runners to provde unbased prcng of jobs that share nodes. POPPA s abe to quantfy nter-appcaton nterference wthn 4% mean absoute error on a varety of co-ocated benchmark and rea scentfc workoads. Keywords Onne Prcng, Supercomputer Accountng, Resource Sharng, Chp Mutprocessor, Contenton 1. INTRODUCTION Supercomputers typcay have hundreds to thousands of users and consst of tens to thousands of ndvdua servers connected over a hgh-speed optca nterconnect. At any one tme, many users concurrenty utze the system. The current approach has been to gve each user a non-overappng set of compute nodes on whch to run hs or her appcaton. Whe ths approach prevents jobs from dfferent users from cobberng one another, t eads to a mssed performance opportunty. In fact, recent work has shown that co-ocaton, where a set of jobs from dfferent users runs on a shared set of compute nodes, can ncrease mean appcaton perfor- Permsson to make dgta or hard copes of a or part of ths work for persona or cassroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commerca advantage and that copes bear ths notce and the fu ctaton on the frst page. Copyrghts for components of ths work owned by others than the author(s) must be honored. Abstractng wth credt s permtted. To copy otherwse, or repubsh, to post on servers or to redstrbute to sts, requres pror specfc permsson and/or a fee. Request permssons from Permssons@acm.org. SC13 November 17-21, 213, Denver, CO, USA Copyrght 213 ACM 978-1-453-2378-9/13/11 http://dx.do.org/1.1145/25321.253256...$15.. Scaed to Runnng Aone 2. 1.8 1.6 1.4 1.2 1..8.6.4.2. amg gems ammps Co-runners mc Prce SOP Prce Far Prce Far Mean namd taubench Fgure 1: Performance of GTC, a pasma physcs code, when co-ocated wth the appcatons on the x-axs. The current prcng mechansm penazes the user for co-ocatng ther job by chargng them more when ther job degrades more. mance and system energy effcency by 2% by reducng contenton for shared resources n the memory subsystem and nter-node network [38, 33, 2]. In addton, current archtectura trends and exascae computng studes suggest that the beneft of co-ocaton s key to ncrease. The studes project that compute nodes w have hundreds to thousands of cores [16]. For some appcatons, t may not be possbe to use a of these cores effcenty. In partcuar, 8% of a XSEDE jobs use ess than 512 cores [11, 45], whch means co-ocaton w key be necessary to utze a of a node s cores. Co-ocaton seems nevtabe for arger jobs as we. Projected scang trends suggest an ncrease n the number of cores per node that outpaces ncreases n memory bandwdth and cache capacty, whch w reduce the resources avaabe per core [16]. To mtgate contenton, resource-hungry jobs w have to be spread out over more compute nodes and pared wth resource-ght jobs to mantan hgh system utzaton [2]. Athough co-ocaton s benefca to performance and energy effcency, t aso creates a new set of chaenges, one of whch s far prcng. Far prcng s a concern because athough there s a net beneft from co-ocaton, some parngs can cause one of the appcatons to sow down. When ths happens, we argue that the user shoud be dscounted. However, f we appy the current state-of-practce (SOP) n HPC nfrastructures, where users are bed proportonay to the

tme to execute ther job, we fnd there s gross nequty users whose jobs beneft from co-ocaton pay comparatvey ess whe users whose jobs do not beneft pay more. Fgure 1 ustrates the chaenge. Under the current stateof-practce, a user runnng GTC[41], a pasma physcs code, pays 6% more when co-ocated wth LAMMPS[3], a moecuar dynamcs code, versus AMG[15, 1], a parae agebrac mutgrd sover. To remedy ths probem, we suggest dscountng a user based on the nterference caused by the other co-runnng appcatons. The greater the nterference, the greater the dscount. The green bars show one such scheme. Because co-ocaton ncreases machne throughput per unt tme, these dscounts can be vewed as passng the effcency savngs from co-ocaton back to the end user when ther expectaton of servce s voated. Athough the concept of progressve dscounts s smpe, the reazaton of such a pocy on rea systems poses a number of practca chaenges. In partcuar, a far prcng mode of ths nature requres precsey quantfyng the nterference due to shared resource contenton. Whe there has been sgnfcant research nto predctng cross-core nterference, many of the technques make heavy use of statc profng or have been taored to specfc machnes or appcatons [42, 25, 26]. Even though ths work has yeded consderabe nsght nto the probem of shared resource contenton, we argue that n practce, t s not practca for precse prcng on a rea HPC custer. In ths doman, statc profng and machne- or appcaton-specfc approaches are not sutabe as jobs may run very shorty after submsson and ther characterzatons may not be known a pror. Athough appcaton profng may enrch the souton space, we note that aterng even a snge nput parameter for an appcaton can vasty change ts characterstcs. For exampe, doubng a snge array dmenson can often radcay transform an appcaton s senstvty to and aggressveness on the memory subsystem. Thus, an nstantaneous and dynamc mechansm s needed to contnuousy montor and quantfy the nterference jobs suffer to drve precse prcng. In addton to beng dynamc and precse, the fundamenta prcng mechansm must aso be ghtweght. The underyng prcng agent has to be mosty nvsbe to the appcaton and therefore must have a neggbe overhead, beow the system nose threshod. These objectves ead us to the two key nsghts of the work ony a software system that uses emprca, onne tests s sutabe for ths probem doman, and such an approach must be agnostc to the underyng software and hardware. In ths paper, we present such a souton: the Persstent Onne Precse Prcng Agent (POPPA). POPPA s a ghtweght runtme system that utzes a cycc, fne-gran, nterference sampng mechansm to accuratey deduce the nterference between co-runners. The key desgn feature of POPPA s a dynamc contenton detecton technque we ca shutterng. For bref perods of executon, POPPA pauses a appcatons but one and measures how the seected appcaton s performance changes versus runnng co-ocated. From the dsparty between the appcaton s rate of forward progress made whe runnng co-ocated versus shuttered, POPPA s abe to precsey determne the mpact of nterference resutng from co-ocaton and use these measurements to drve far prcng for a users jobs. The contrbutons of ths work are as foows: We ntroduce POPPA, a ghtweght, workoad and machne agnostc runtme system that enabes far prcng for HPC custers. POPPA functons entrey n software, requres no changes to the system stack n current HPC custers, and s ready depoyabe. We present the desgn of precse shutterng, a mechansm for the precse onne measurement of the performance mpact of cross-core nterference. Our precse shutterng approach functons dynamcay and requres no a pror knowedge or profng of the appcatons. We present a new prcng mode for HPC custers based on POPPA to provde far prcng to users. We provde a thorough evauaton of POPPA s effcacy and robustness as the centra accountng mechansm on HPC custers wth a mx of MPI benchmarks and rea workoads. POPPA predcts co-ocated appcaton run tme wth 4% mean absoute error and ncurs ess than 1% overhead. Usng POPPA, we are abe to dscount the average user by 7.4% and dever a prcng dstrbuton that cosey resembes that of an omnscent orace. 2. BACKGROUND AND MOTIVATION In order to better understand why far prcng s of such mportance, we must frst expore the current state-of-practce n accountng on supercomputers. We start by examnng the accountng and aocaton mode found n the Unted States Department of Energy Offce of Scence INCITE program [12] and the Natona Scence Foundaton XSEDE program [11], two of the argest U.S. programs that provde resources to the genera HPC research communty. Each of these programs factates access to a number of arge scae computng nfrastructures. To successfuy obtan an aocaton, researchers submt grant proposas and, after revews, are awarded tme on those systems as a fnte number of servce unts (SUs). When a user runs a job on a system, they depete ther bank of SUs at a rate proportona to the ength of ther programs executon and the number of compute nodes that they request. In ths mode, users need strong guarantees that the vaue of an SU w not be negatvey affected by other users jobs runnng on the same computng resources. Smary, supercomputer admnstrators care about user satsfacton and are ncentvzed to provde users wth the best possbe experence because ndvdua supercomputng centers are awarded funds argey based on the success and popuarty of ther factes. Consequenty, we observe that throughout a eves of the fundng adder, far prcng and accountng are cruca concerns. Regardess of what mechansms are mpemented to mprove supercomputer performance, energy effcency or faut toerance, they must not pervert the farness of the prcng scheme. 2.1 MPI Programmng Mode Most arge scae scentfc appcatons utze the Message Passng Interface (MPI) as the core abstracton to factate workoad dstrbuton across a custer. Two man characterstcs of MPI programs are as foows: 1) Snge Program Mutpe Data (SPMD): MPI processes execute the same statc program bnary and use unque dentfers caed ranks to dctate communcaton patterns as we as whch bocks of code get executed by dfferent processes. Whe ths aows for a arge amount of potenta

dversty between processes, n practce most MPI programs are Snge Program Mutpe Data (SPMD): a processes execute the same core agorthm on dfferent data. Thus wthn an MPI program, a the processes have hgh smarty, e.g., they a compete for the same resources. 2) Tghty couped communcaton synchronzaton: The vast majorty of MPI programs exhbt tghty couped communcaton synchronzaton. Because of ths tght synchronzaton, processes must execute n reatve ock-step. If a process reaches an expct or mpct barrer before the other necessary partes, t must wat unt a others make smar progress before proceedng. 2.2 Co-ocaton of MPI programs When we reason about the nature of MPI programs, t qucky becomes evdent that executng a snge MPI program across a prvate set of compute nodes s an neffcent use of system resources. The homogenety between MPI processes and the fact that they are tghty couped mean that many processes w execute the same program regons wth hgh concurrency. When ths happens, there s hgh rsk for resource contenton and performance degradaton homogeneous processes have hgh propensty to evct one another s data n the shared ast eve cache (LLC), contend for the memory controer, saturate off-chp bandwdth to man memory, and cause a backog of messages for nternode communcaton. Prevous research shows that homogeneous MPI processes can degrade one another s performance by more than 2x [2, 38]. In addton, these works show that ntroducng heterogenety n workoads by co-ocatng mutpe MPI programs on dsjont cores can drastcay mprove performance and energy effcency. In fact, both studes fnd that aggregate throughput ncreases by 12 to 23% on average over the current state of practce, and [2] shows that system energy effcency ncreases by 11 to 22%. In concuson, gven the hgh cost of arge supercomputers and the great performance and effcency beneft of coocaton, t s essenta that we provde far prcng mechansms to make co-ocaton practca. 3. POPPA OVERVIEW In ths secton, we present the overvew of the Persstent Onne Precse Prcng Agent (POPPA) framework. Our prmary desgn objectve for POPPA s to provde accurate performance nterference estmates for parae appcatons wth neggbe overhead. As shown n Fgure 2, POPPA conssts of a man montorng agent caed the Controer and a seres of Executon Managers. Executon Manager: Each Executon Manager s responsbe for aunchng and overseeng the entre executon of a parae appcaton on a gven machne. The Executon Managers read from the centra job queue and seect the next job to run accordng to the job prorty and ts resource needs. An Executon Manager aunches the seected job and attaches a performance montorng context (PMC) to the job. The PMC montors the job performance by readng and evauatng approprate hardware performance counters. Durng executon, the Executon Manager updates and reports the current status and performance data of the job to the Controer. Controer: The Controer s the man component of POPPA. Its prncpe responsbty s to conduct shutter- Job Queue Pop Job Persstent Onne Precse Prcng Agent Prcer Controer Executon Managers PMC Send Performance Data Spawn, Manage, K Start Appcaton Send Prce Log Read Account Manager Fgure 2: Interacton between POPPA components and other enttes ng, a mechansm to measure and quantfy the performance nterference among the co-runnng appcatons. In essence, the Controer perodcay pauses each appcaton but one for a very short perod and montors the performance mpact on the one runnng appcaton. To measure ths mpact, the Controer probes the PMCs of each actve job to acqure the performance data and ogs t. We present more detas of the shutterng mechansm ncudng our agorthms and poces n Secton 5 and evauate ts accuracy and overhead n Secton 8. Fgure 2 presents how POPPA can be used for prcng. After executon of a job has competed, the Prcer thread anayzes the raw performance data ogged by the Controer and quantfes the performance nterference and degradaton. More detas of the anayss and prcng are presented n Sectons 4 and 6. Based on the quantfcaton, the Prcer produces the prce to be charged and propagates t to the Account Manager, whch then deducts the prce from the user s bank of SUs. 4. PRICING MODEL In ths secton, we dscuss the key ssues reated to prcng and accountng on current supercomputers and extend those notons to a supercomputer wth job co-ocatons. 4.1 Prcng Wthout Co-ocaton For purposes of ths dscusson, assume that a user wants to run a job on a supercomputer and that P denotes the prce that the user s charged for runnng. In present day systems, P s gven by Equaton 1, where L s a rate constant n terms of servce unts per core per tme quanta, C s the number of cores that a job uses n whoe compute node ncrements, and T s the run tme of the program. P = L C T (1) From ths equaton, we can see that the prce varabe P s neary proportona to both the cores varabe C and the tme varabe T. 4.2 Prcng Wth Co-ocaton In ths secton, we propose how one coud modfy the exstng prcng mode to more fary prce appcatons when co-ocatons are present. In partcuar, f we have a job that s co-ocated wth a set of jobs J, we want a formua

that w produce a reasonabe prce P co(j), whch takes nto account the net nterference from a appcatons n J. To ths end, we repace L wth a rate functon F, yedng Equaton 2, where F : R R R. T soo s the run tme when the job gets a compute nodes to tsef and T co(j) s the run tme of the job when s co-ocated wth the set J of other jobs. Executon Tme KEY Pared Executon Pared + IPC Measure P co(j) = F (T soo, T co(j) ) C T soo (2) Ideay, F s monotoncay non-ncreasng so that the more degradaton an appcaton suffers from co-ocaton, the more the user s dscounted. For the purposes of ths paper, we assume utty s proportona to 1 mnus the ratona degradaton. Therefore f we equate utty to farness, then we seect F such that users are dscounted at a rate proportona to the degradaton that each of ther jobs experences due to contenton from co-runners. degradaton, then we want P co(j) Consequenty we defne F as foows: F (T soo, T co(j) ) = L T soo T co(j) Thus f D co(j) = (1 D co(j) s the ) P soo. = L (1 D co(j) ) (3) By substtutng Equaton 3 nto Equaton 2 we see that we acheve the specfc prcng mode shown n Equaton 4. P co(j) = L T soo C T co(j) T soo (4) Whe Equaton 4 s good for the user, we acknowedge that t s an deastc mode. Its smpcty makes t easy for end users to understand; however, we note other factors such as resource manager queue wat tmes, job prorty, workoad composton, the rato of each shared resource a job consumes, machne archtecture, and schedung pocy,.e. capabty versus capacty are aso mportant factors when determnng a far prce. Thus supercomputng factes w have to decde what F makes sense for each of ther systems. 5. PRECISE SHUTTER MECHANISM As prevousy mentoned, POPPA s chef desgn objectve s to produce far prces wth hgh precson, ow overhead, and wthout the need for a pror knowedge. To acheve these goas we have desgned precse shutterng, an onne co-runner nterference maskng approach. Essentay, the precse shutterng mechansm functons by aternatng an appcaton s executon envronment between one where corunners are executng and another where they are effectvey absent. Fgure 3 shows shutterng n acton on two appcatons A and B that are co-ocated. The shutterng agorthm aternates between executon regons where A and B co-execute, A executes whe B seeps, A and B co-execute, and B executes whe A seeps. We repeat ths pattern throughout the executon of the programs. To gan nsght from shutterng, we must measure the performance of each appcaton before, durng, and after shutter regons. Durng each shutter of duraton S, we everage hardware performance montors va bpfm4 [7, 27] to measure the nstructons per cyce of the soe non-seepng appcaton. To nfer the degradaton due to co-runners, we aso measure the nstructons per cyce (IPC) of a actve appcatons S mcroseconds before the shutter and S mcroseconds drecty after t. Program A Program B Soo + IPC Measure Shutter Perod Fgure 3: Shown here s shutterng n acton on two separate jobs. Durng a shutter, one job executes whe a others seep. Snce we are prmary concerned by how performance changes wth the presence or absence of contenton, we ony need to montor the performance durng sma wndows around shutters. We aso perform each shutter nfrequenty to mnmze the perturbaton of appcaton executon and parameterze the rate of shutter sampes to contro POPPA s overhead. As we show n ths work, frequent shutters are not requred to produce an accurate predctve mode. Agorthm 1 Measure(, S, K) 1: Intaze array perfvaue of ength A[] 2: for k = to K 1 do 3: for each thread t that s part of A[] do 4: perfvaue[t] = ReadCounters(t) 5: end for 6: Seep for S µs 7: for each thread t that s part of A[] do 8: perfdct[t].append(readcounters(t)-perfvaue[t]) 9: end for 1: end for Agorthm 2 Shutter Core(j, S, K) 1: for = to A 1, where j do 2: for each thread t that s part of A[] do 3: Pause t 4: end for 5: end for 6: Measure(j, S, K) 7: for = to A 1, where j do 8: for each thread t that s part of A[] do 9: Resume t 1: perfdct[t].append(thread ASLEEP) 11: end for 12: end for 5.1 Agorthms In ths secton, we present the ogc of the shutter mechansm, whose core parts are shown n Agorthms 1, 2 and 3. Beow we defne a st of common data structures and constants used by the agorthms: A, an array of co-ocated appcatons perfdct, a ookup tabe that stores the measured IPC vaues of each appcaton

Agorthm 3 POPPA Core 1: j = 2: whe true do 3: for = to A 1 n parae do 4: Measure(, S, K) 5: end for 6: Shutter Core(j, S, K) 7: for = to A 1 n parae do 8: Measure(, S, K) 9: end for 1: j = (j + 1) mod A 11: Seep P µs 12: end whe K, the number of IPC measurements to make n a row n a specfc regon 1 S, the ength of the each measurement n µs P, the ength of tme between groups of measurements,.e. the norma executon perod, n µs S, the ength of a shutter, approxmatey K S The core routne s Agorthm 3. At each teraton, we frst measure the IPC of each appcaton whe co-ocated (nes 3-5). We then shutter appcaton j by cang Shutter Core (ne 6), whch subsequenty cas Measure to measure the IPC whe j s runnng aone. After that, we measure the IPC of a appcatons and ncrement j (nes 7-1). Then the shutter component of POPPA goes to seep for P µs of norma executon (ne 11). Snce POPPA s persstent, ths process repeats contnuay as appcatons end and new appcatons enter the appcaton poo. 5.2 Tunng the Shutter Mechansm The shutter mpementaton presents a number of chaenges. In partcuar, seectng the correct granuarty to shutter at s key to accuratey quantfyng nterference wthout notceaby addng to t. The frst parameter s the gap between shutters P. As P s decreased, the amount of tme that POPPA s actve ncreases, consequenty aso ncreasng overhead. Snce utzaton n supercomputers s often above 95%, we assume that each core has an appcaton thread assgned to t. Due to ths fact, POPPA must tme sce wth appcaton threads. If POPPA s actve for x% of a snge core s executon tme, then assumng POPPA threads do not mgrate, one of the co-runnng appcatons s key to suffer at east an x% ht to performance due to synchronzaton between processes. Snce the POPPA runtme nevtaby has overhead, we expermented wth conductng round-robn mgraton of the POPPA threads to dstrbute the performance mpact of tme scng across a appcaton threads; however, we determned that a better souton was to seect vaues for K, P and S that make POPPA s CPU utzaton very ow, as mgraton s not guaranteed to be fne-gran enough to mtgate the effect of tme scng. Another mportant parameter s S the duraton of a shutter. In our mpementaton, ths quantty s equa to the base cost of dong a shutter on 8 MPI processes, approxmatey 12 to 2µs (see Fgure 4 n Secton 8.1), pus K S, where K S s the product of the number of consecutve measurements and the ength of each such measurement. Durng 1 We fx K = 1 for experments and anayses n Secton 8. a shutter, the paused appcaton makes no progress, thus keepng shutter duraton very short reatve to P s a prmary concern. An unexpected fnd reatng to the shutter mechansm s that n certan cases, POPPA actuay sghty mproves the performance of co-ocated appcatons. Durng shutters, appcatons that seep sacrfce a sma amount of forward progress and the one runner receves a performance boost from reduced contenton. When the net performance boost from runnng n soaton offsets the net performance oss from seepng, appcatons speed up reatve to the basene co-schedue performance. For pars of two appcatons, speedup occurs when a co-schedue ncreases one appcaton s run tme by more than 2x reatve to runnng wth haf the cores de per socket. Ths phenomenon s demonstrated emprcay n Secton 8.2. 6. ESTIMATING DEGRADATION In ths secton, we present our method for nkng the raw data that POPPA produces to the actua prces we charge. 6.1 Ideazed Mode for Degradaton Our prcng mode assumes that for an appcaton, we know the degradaton D co(j) that suffers as a resut of coocaton wth a set J of appcatons. In our prcng mode dscusson, we formuated 1 D co(j) as T soo. Whe ths T co(j) gves us a precse way to cacuate degradaton, POPPA cannot drecty measure T soo. Thus, we modfy the formuaton such that t s amenabe to the IPC data that POPPA produces. On modern chp mutprocessors, f we are gven an executon tme n seconds, we can convert ths to a vaue n cock cyces. Thus f we know the cock tcks per second, we can wrte the performance of normazed to runnng aone as the rato of cock cyces C soo P erf norm = 1 D co(j) and C co(j) = Csoo C co(j) (see beow). Addtonay, f we assume to be a truy sera program, then t s the case that s dynamc nstructons I do not change. Thus I soo = I co(j), and consequenty we can transform Equaton 5 nto a rato of IPCs by mutpyng by I co(j), yedng the foowng: I soo P erf norm = IP Cco(J) IP C soo 6.2 Known Chaenges wth Parae Programs For parae programs, however, t turns out that Equaton 6 s often mprecse. Many parae programs contan mutexes, semaphores, and other ockng mechansms to enforce program correctness by preventng data races. When a oad mbaance occurs, that s, one parae process advances faster than ts sbngs, these ockng mechansms can dstort both dynamc nstructon count and CPU cock cyces. Wth MPI, ths ssue s qute prevaent. If a communcaton routne s mpemented as bockng, then t s common practce to have the thread that ntated the routne to po for a certan number of cyces and then seep. Durng ths pong perod, the thread executes a whe oop where t contnuay tests whether the communcaton operaton has (5) (6)

competed. If the thread fas to fnsh the communcaton operaton wthn a certan nterva, t s put to seep and sgnaed to wake up when the operaton has competed. Because contenton and background nose on the system can cause ths pong perod to change n duraton, the number of dynamc nstructons attrbuted to these communcaton regons s varabe. Wth MVAPICH2, the MPI-2 mpementaton, the maxmum pong perod can be adjusted [52]. Whe we were tempted to dsabe pong, we knew that dong so woud be dsadvantageous. In partcuar, pong greaty ncreases ndvdua appcaton performance because the bockng thread avods the performance ht assocated wth gong to seep and wakng back up, as t can proceed as soon as communcaton has fnshed. Thus, we decded to keep the parameters that maxmzed performance even though t made precse predcton more chaengng. 6.3 Fterng Even though Equaton 6 s mprecse n the presence of varabe executon, we fnd that n practce, t s st suffcent for producng reasonabe degradaton estmates. We aso assume that the average over the N IPC sampes that we coect s roughy equvaent to the actua average IPC durng shutters (IP C soo ) and durng norma pared executon (IP C co ). These assumptons are presented beow n Equatons 7 and 8. IP C soo P erf norm N soo j= IP C,j soo N soo IP Cco(J) IP C soo and IP C co (7) N co j= IP Cco,j N co (8) POPPA gves us data n the form of a stream of bocks of IPC measurements, each consstng of K IPC measurements just before a shutter, K measurements durng a shutter, and K afterward. We denote ths stream of bocks as B and the th such bock as B ; wthn each bock B, the K IPC vaues n B before the shutter are denoted as IP C before, the K IPC vaues durng a shutter as IP C durng, and the K IPC vaues after a shutter as IP C after (IP C before. Thus B =, IP C durng, IP C after ). We denote the arth-, IP C durng metc means of each of these vaues as IP C before and IP C after. Usng ths notaton, we present the fterng agorthm (Agorthm 4) that aows us to ncrease the precson of the performance estmate. Agorthm 4 Ftered Predcton(IPC Tupes B) 1: Intaze IP C co and IP C soo to 2: for each (IP C before, IP C durng, IP C after ) n B do 3: f IP C before IP C after < δ and IP C before IP C durng and IP C after 4: IP C co = +.5(IP C before 5: IP C soo = + IP C durng 6: end f 7: end for 8: Return ( IP Csoo IP C co ) IP C soo < IP C durng then + IP C after ) < Agorthm 4 ams to reduce nose from sampng IPC. It removes groups of IPC vaues where the IPC durng a shutter s not greater than the IPC drecty before and after. Snce a shutter can ony reeve shared resource contenton, the IPC durng a shutter shoud aways exceed the IPC before and after a shutter f a measurements occur durng the same computatona phase. The second mechansm, whch states that the absoute dfference n IPC before and after cannot exceed δ works to ensure that custers that cross phase boundares are removed. We emprcay determned δ =.5 to be a reasonabe vaue. 7. EXPERIMENTAL SETUP Ths secton descrbes our methodoogy. We ran our experments on the Gordon Supercomputer [32, 49]. Each node s dua-socket. For each socket, there s an 8-core Inte EM64T Xeon E5 (Sandy Brdge) processor. Smutaneous mutthreadng s dsabed [61]. The CPU frequency s 2.6Ghz, and each core has prvate 32KB nstructon and data L1 caches, a prvate 256KB L2 cache, and each socket has 2MB of L3. There are 64GB of DRAM. Compute nodes run CentOS nux wth kerne verson 2.6.32. The nterconnect s QDR InfnBand wth 8GB/s of bdrectona bandwdth, and the topoogy s a 3D torus of swtches [1, 57]. Our appcatons and benchmarks are shown n the tabe that foows. These benchmarks and appcatons encompass a wde varety of scentfc domans such as subatomc partce physcs [5], pasma physcs [41], moecuar dynamcs [3], ocean modeng [2], computatona fud dynamcs [6, 8], shock hydrodynamcs [36], fnte eement methods [4] aong wth varous other numerca methods that are of hgh nterest to the HPC communty. We aso note that GTC and MILC, n partcuar, use a substanta number of dedcated aocaton hours on many eadershp cass machnes. Benchmarks, Mnapps and Appcatons Swm [9], ADVECT3D [51], pcubed [39] NAS Parae Benchmarks: CG, FT, LU, MG [14, 47] Luesh [36], MnGhost [4], MnFE [4], NekBone [6, 8] GTC [41], LAMMPS [3], MILC [5], POP [2] We compe GTC, LAMMPS, MILC, POP, CG, FT, LU and MG wth GNU compers verson 4.7 and MVAPICH2 verson 1.7. LULESH, MnGhost, MnFE, and NekBone are comped wth PGI compers verson 11.9 and OpenMPI verson 1.6. In our experments, we co-ocate two 8 process MPI appcatons together on the same set of sockets. Each socket has haf ts cores run one appcaton and the other haf run the other. Appcatons co-run together for a mnmum of 5 teratons of both appcatons. As soon as one appcaton ends, we mmedatey restart t. Data coecton stops once both appcatons have competed 5 teratons. For the shutter mechansm, we fx K = 1 and P = 2ms. 8. EVALUATION In ths secton, we evauate the accuracy, overhead, and the prcng farness of POPPA. 8.1 Quantfyng POPPA s Base Overhead In ths secton, we quantfy the mnmum tme to execute components wthn the man oop of the POPPA daemon. The man oop conssts of the three core operatons of Agorthm 3 measurng the IPC of the appcaton just pror to the shutter, ssung the shutter and measurng the IPC of

Tme (us) 6 5 4 3 2 Man Loop Resdua Read Counters Pre/Post Shutter Shutter Resdua Read Counters Shutter Program Executon Tme (s) 16 15 14 13 12 11 1 CG-NULL FT-NULL LU-NULL MG-NULL CG-FIT FT-FIT LU-FIT MG-FIT 1 9 1 2 3 4 5 6 7 8 Number of Threads per Job Fgure 4: Breakdown of base overhead to execute a snge teraton of POPPA s core agorthm, where readng PMC vaues domnates tota tme the appcaton durng that wndow, and measurng the IPC of the appcaton mmedatey foowng the shutter. For these experments, we co-ocate two MPI benchmarks, an auto-generated oop from the pcubed benchmark sute and a busy oop, caed the NULL co-runner, that runs for the duraton of the pcubed oop. In POPPA, we set a of the seep parameters to, so we can measure the mnmum executon tme for a subcomponents of the oop. Durng each teraton of the man oop, we measure ts tota executon tme, the tme to measure the IPC both before and after the shutter, the tota executon tme of the shutter, the tme to send the SIGSTOP and SIGCONT sgnas, and the tme to make the IPC measurements durng the shutter. Fgure 4 presents the resuts. On the x-axs we vary the number of threads n each job. So 4 corresponds to four pcubed tasks bound to cores, 2, 4, and 6 and four busy oop tasks bound to cores 1, 3, 5, and 7. The y-axs shows the tota tme n µs to execute the man oop. When studyng ths fgure, severa nterestng trends emerge. Not surprsngy, addng more threads ncreases the mnmum oop executon tme. Executon tme s domnated by IPC measurement n the form of cas to bpfm, partcuary those outsde the shutter regon. In fact, we spend about 4x as much tme measurng the IPC outsde of shutter regons compared to wthn them. Ths dfference n overhead resuts from 1) we ony measure actve threads wthn a shutter, whch s an optmzaton decson that we made, so the overhead to read the performance counters doubes outsde of a shutter, and 2) we make two sets of IPC measurements outsde of a shutter (before and after) versus a snge set of measurements durng one. We see that the mean tme to shutter does not exceed 13µs and the mean tme to execute the man oop does not exceed 5µs. Thus, our mechansm s fne graned enough to measure the IPC at sub-msecond ntervas for thread counts that are representatve of contemporary mut-socket systems. In addton to the mnmum deays ncurred by shutterng, we quantfy the effect of enargng the amount of tme spent n a shutter. For ths experment, we fx the seep tme at the end of the man oop, P (see Secton 5.1), to 2,µs and ncrease the shutter duraton, S (see Secton 5.1), mutpcatvey by factors of 2 from 2µs to 49,6µs. We separatey co-run each of the NAS Parae Benchmarks (NPB) wth the busy oop NULL. Snce NULL generates no nterference, any daton n run tme s a drect resut of ncreasng 8 256 124 496 16384 65536 262144 Shutter Duraton (us) Fgure 5: The reatve overhead of expandng the duraton of a shutter, where ponts correspond to measurements and nes correspond to nstantatons of the mode the shutter wndow. Fgure 5 presents the resuts. A four benchmarks exhbt a smar trend. When S s sma reatve to P, the overhead s sma, but as the rato S : P ncreases, so does the overhead. However, the overhead begns to fatten out as S approaches and exceeds the vaue of P. We need to formuate an anaytca mode for the overhead that a prcng shutter creates for an arbtrary co-ocated poo of n jobs. To do so, we examne the overhead from n consecutve shutters. Over the course of n shutters, each job w run n soaton once and seep n 1 tmes whe a snge other job enjoys the prvege. Each such shutter has duraton S. Thus each job w seep for (n 1) S seconds. The tota tme for n teratons of the man oop of the daemon s aso mportant for the anayss. Measurng the IPC before, durng and after a shutter s 3S, as each takes S tme. After ths, the daemon seeps P seconds. Ths pattern s cycc, so the combned tme s n (3S + P ). Equaton 9 shows rato of seep tme to tota tme. seep tme Z(S, P ) = tota tme = (n 1) S (9) n (3S + P ) The mode for the executon tme of the jobs n Fgure 5 s shown beow: 1 n (3S + P ) T (S, P ) = T = T (1) 1 Z(S, P ) 2nS + np + S Here T s the run tme of appcaton when co-ocated wth the NULL co-runner. When we examne the mode ft to the data n Fgure 5, we observe that CG-FIT, FT-FIT, LU-FIT, MG-FIT amost exacty predct the actua overhead of the shutter for a S n {1 2 k µs 1 <= k <= 12} and a fxed P of 2ms. Ths mode ncorporates S, P, and T; f we know any two of these quanttes, we can sove for the thrd. Thus admnstrators can decde on a system by system bass what s exacty an acceptabe amount of degradaton due to the prcng shutter and choose vaues of S and P accordngy. 8.2 Determnng the Sampng Rate In ths secton, we evauate the precson and overhead of the POPPA daemon for dfferent shutter engths (S vaues) whe keepng P fxed to 2ms. We saw n the prevous secton, that the overhead due to the shutterng mechansm has an anaytca upper bound gven by Equaton 1. Usng ths equaton, we seected vaues of S wth ess than 5% overhead: 2, 4, 8, 16, 32, 64, 128, and 256µs.

Percent Degradaton 6 5 4 3 2 Percent Degradaton 6 5 4 3 2 Measured Deg. Measured Deg. Predcted Deg. Predcted Deg. Co-ocaton Ony Co-ocaton and POPPA No Fterng Fterng Percent Degradaton 6 5 4 3 2 Percent Degradaton 6 5 4 3 2 1 1 1 1 2 8 32 128 256 Duraton of Each IPC Measurement (us) 2 8 32 128 256 Duraton of Each IPC Measurement (us) 2 8 32 128 256 Duraton of Each IPC Measurement (us) 2 8 32 128 256 Duraton of Each IPC Measurement (us) (a) CG (b) FT (c) LU (d) MG Fgure 6: Effect of shutter duraton on accuracy and overhead for each NPB co-run wth ADVECT3D-256 Percent Degradaton 16 14 12 1 8 6 4 Percent Degradaton 16 14 12 1 8 6 4 Measured Deg. Measured Deg. Predcted Deg. Predcted Deg. Co-ocaton Ony Co-ocaton and POPPA No Fterng Fterng Percent Degradaton 16 14 12 1 8 6 4 Percent Degradaton 16 14 12 1 8 6 4 2 2 2 2 2 8 32 128 256 Duraton of Each IPC Measurement (us) 2 8 32 128 256 Duraton of Each IPC Measurement (us) 2 8 32 128 256 Duraton of Each IPC Measurement (us) 2 8 32 128 256 Duraton of Each IPC Measurement (us) (a) CG (b) FT (c) LU (d) MG Fgure 7: Effect of shutter duraton on accuracy and overhead for each NPB co-run wth Swm-15 We ran two sets of parwse experments. In the frst, we co-ocated the NPBs wth a contentous co-runner (AD- VECT3D wth a grd sze of 256 3 ), and n the other we co-schedued the NPBs wth a moderatey contentous corunner (Swm wth a grd dmenson of 15 3 ). Fgures 6a, 6b, 6c, and 6d show the performance predcton accuracy of the POPPA daemon for CG, FT, LU, and MG when they are co-ocated wth ADVECT3D. Both the accuraces of the unftered and ftered predctors are shown. For carty, we opt not to present the resuts for 4, 16 and 64µs. In ths set of experments, we are abe to very accuratey predct the contenton wth neggbe overhead. Fterng mproves predcton performance. Our predctors have the argest error for FT. S = 2µs gves the hghest accuracy, but as S ncreases, so does the error. Ths error resuts from FT s very fne gran phases, whch coarser granuarty shutters have troube capturng. Fgures 7a, 7b, 7c, and 7d show the predcton accuracy for the NPBs pared wth Swm. Agan, our predcton accuracy s very precse. In ths case, we note that the ftered predcton s sometmes overy zeaous when predctng contenton. However, ths resut s unsurprsng gven that fterng removes custers of IPC measurements where the IPC measured durng a shutter does not exceed the IPC drecty before and after. A contrastng fndng between the experments wth AD- VECT3D and Swm concerns daemon overhead as a functon of S. In the experments wth ADVECT3D, overhead s fat regardess of S whereas t sharpy ncreases wth Swm. Ths dvergence s caused by the fact that ADVECT3D s confgured to be contentous whereas Swm s not. Durng a shutter, the one runnng appcaton receves a respte from the contenton generated by the other appcaton. In the case of the NPBs wth ADVECT3D, ths causes each NPB to speed up by approxmatey 2x, whch offsets the ost throughput from seepng durng aternate shutters. By contrast, Swm degrades each NPB by at most 15%, so the tme spent seepng cannot be masked. Overhead as a Percentage of Executon Tme 2. 1.5 1..5. -.5-1. 2 4 8 16 32 64 128 256 Duraton of Each IPC Measurement (us) Fgure 8: Overhead of POPPA on NAS benchmarks co-ocated wth ADVECT-256 These experments show that the shutter duraton S s argey rreevant for accuracy. Thus when seectng S, t makes sense to seect a vaue that nduces mnma overhead and run tme varaton. Fgures 8 and 9 present both the daemon s overhead and ts dstrbuton for the surveyed vaues of S. In Fgure 8, regardess of the vaue of S, overhead due to the prcng shutter never exceeds 2%. However, n Fgure 9, ths vaue exceeds 4%, whch s ceary too costy. S = 32µs devers an overhead of ess than 1% and wth the smaest varaton. For ths reason, we use S = 32µs for the remander of our experments. 8.3 Parwse Evauaton In ths secton, we evauate the precson of POPPA on parwse co-ocatons. Snce our ftered predcton was better n aggregate n our prevous experments, we appy that predcton mechansm rather than the smpe one. We run co-schedues of a possbe combnatons of our 12 benchmarks and rea appcatons. Fgure 1 shows the accuracy of our ftered predctor at quantfyng degradaton. The x-axs sts the names of the benchmarks, and the y-axs sts the co-runners. Indvd-

mean 1.9 2.2 4.7 2.5 3.1 3.4 4.2 3.6 4.5 8.7 4.6 5.3 2 mean 4.4 8.2 2.9 5.3 7.7 12.3 8.2 6.8 19.8 17.3 16.5 21.3 4 mg -2.9-4.4-3.6-1.7-4.6-4.9-6.3-5.8-3.4-12.3-2.3 2.1 4.5 15 mg 7.2 13.3 36.5 1.4 14.3 19.4 15.8 12.4 33.5 28.8 33.7 38.3 22. 35 u ft -1.8-3. -.2-4.9-1.5-8.7 1. -2.7-3.5-3.4 -.7-3. -6.1-5.3-6.1-2.6-1.9-4.8-4.8-1.1-15.1-9.1-4.5-9.9 3.6 6.7 1 u ft 4.2 6.7 7.7 14.1 25.4 33. 5.2 11.8 1.1 13.3 17.8 13.7 12.2 11.8 8.4 1.8 19.8 32.8 2.5 28.4 22.3 21.1 33. 28.8 15.6 18.9 3 cg nekproxy -1.9-1.4-2. -.8-4. -1.9 2.2 2.8-1. 1.1-2.4-3.4 -.8-1.6-3.3-2.6 1.3 -.8-6. -7.6-2.7-3.3-3.9-8.2 2.6 3. 5 cg nekproxy 3.9 2.6 7.6 5.3 16.8 12.8 3.4 2.8 6. 4. 8.8 7.1 4.4 4. 6.3 4.3 14.9 11.4 12.7 12.8 1.7 8.8 13.6 13.6 9.1 7.5 25 mnghost -2.3-2.8-5.9 1.7-1.4-1.7-9.8-5.8-1.8-11.8-1.4-2.3 4.8 mnghost 5.6 11.1 3.4 7.2 1.3 16.9 15.7 9.9 23.8 24. 2.9 28.2 17. 2 mnfe uesh.9-2.1 4.5-1.5 17.1-4.5 3.9 1.8 5.9 -.7 14.8-5.5 5.7-3.4.3-3.1 11.3-4.1 3.8-9.7 9.8-7.1 11.6-9.7 7.5 4.4-5 mnfe uesh 2.9 3.7 6. 6.8 13.3 15.9 4.2 3.8 4.7 7. 18.4 9.5 4.6 6.1 3.8 5.4 12.4 16.5 8.7 16.6 13. 12.9 9.9 17.7 8.5 1.2 15 pop mc -.5-2.1.3-3.3-3.8-3.9 3.9.7 2.9-1.9 -.5-4.8.2-4.3.1-3.9-7. -4.6-2.4-4. -1. -5.9-4.4-5.1 2.5 4.2-1 pop mc 1.6 5. 4.1 1.7 11. 26.4 1.4 6.2 1.6 9. 4.5 13. 1.3 9.7.8 8.1 12.6 25.9 8. 2.3 5.6 19.4 8.7 23.8 5.1 14.8 1 ammps -.7 -.4 2.2 4.2 4.8.1 1. -1.4 1.9-4.3.1-3.1 3.4-15 ammps 1.7 4. 4.6 1.1 1.7 3.1.4 2.2 5. 7.1 3.4 5.6 13.6 5 gtc -2.9.6 4. 5.8 3.7 -.1 1.7-1. -8.5-2.1 -. -3. 2.8-2 gtc 4.7 3.5 4.9 1.2 2. 3.2.2 2.3 15.4 5.1 3.5 6.3 4.4 gtc ammps mc pop uesh mnfe mnghost nekproxy cg ft u mg mean gtc ammps mc pop uesh mnfe mnghost nekproxy cg ft u mg mean Fgure 1: Run tme predcton accuracy (%) for jobs on the x-axs co-ocated wth jobs on the y-axs Overhead as a Percentage of Executon Tme 6 4 2-2 -4-6 2 4 8 16 32 64 128 256 Duraton of Each IPC Measurement (us) Fgure 9: Overhead of POPPA on NAS benchmarks co-ocated wth Swm-15 ua ces present the percentage dfference n predcted run tme versus actua, where negatve vaues represent underpredcton and postve vaues represent overpredcton. The top row mean presents the mean absoute error across the apps, and the rght most coumn mean presents the mean absoute error that an appcaton creates n the predcton accuracy for the other codes. Fgure 11 presents the degradaton of each appcaton as a percentage of run tme reatve to runnng wth the NULL co-runner,.e haf the cores vacant on each socket. The top row presents the mean degradaton of each scentfc code on the x-axs and the rght most coumn presents the mean degradaton each appcaton on the y-axs causes to ts corunners. If we study Fgures 1 and 11 n concert, a number of nterestng trends emerge. POPPA does we at quantfyng degradaton for a parngs consstng excusvey of our rea appcatons, GTC, LAMMPS, MILC, and POP. Our mean absoute error s 2.5% and absoute error never exceeds 5.8%. We accuratey characterze both ends of the spectrum. We predct hgh degradaton for MILC pared wth tsef and we nether sgnfcanty underpredct or overpredct for parngs wth ow mutua contenton such as GTC-LAMMPS and LAMMPS-POP. For parngs of rea apps wth benchmarks, Fgure 11: Performance degradaton (%) for jobs on the x-axs co-ocated wth jobs on the y-axs the predcton accuracy s generay qute good except for when MILC s co-ocated wth MnFE and FT. For our proxy apps LULESH, MnFE, MnGhost and NekProxy (NekBone), the resuts are more mxed. We are abe to predct ther performance wth a mean absoute error of 3.8%. MnFE s a partcuary nterestng because n each case we overpredct the degradaton for ts co-runner (mean of 7.5%). Ths overpredcton s an artfact of the fterng agorthm. When we use our unftered predctor, we overpredct by at most 1.5% for MnFE s co-runners. MnGhost, by contrast causes us to underpredct contenton for some of ts co-runners. On the NPBs, our predcton error s sghty hgher. If we excude FT, our mean absoute predcton error s wthn 5.3%. FT however, poses chaenges both for ts predcton and appcatons t s co-ocated wth. In both cases, we underpredct the actua degradaton. Ths underpredcton s due to the duraton S of the shutter. If we reexamne Fgure 6b, we observe that S = 2µs yeds the hghest accuracy when FT s co-ocated wth a contentous co-runner. We aso observe n Fgure 7b that out of the possbe vaues for S, S = 32µs prognostcates the owest contenton. On the whoe, our system s generous and tends towards modesty underpredctng contenton. Our mean absoute error across a parngs s 4.%. 8.4 Prcng Farness In ths secton, we show POPPA s prcng farness versus the state-of-practce and the orace. Fgure 12 shows the dstrbuton of reatve SUs charged for each appcaton usng the dfferent prcng schemes. On average, the state-ofpractce woud charge users 14% more as resut of co-ocatng ther jobs. Jobs that degrade more, pay more. POPPA on the other hand dscounts users by an average of 7.4%, whch s cose to the 11.5% dscount that the orace woud offer. When we examne the mnmum and maxmum reatve SUs charged, we aso see favorabe resuts for POPPA. The maxmum dscount gven by POPPA s 4.8%, whch s cose to the orace s 38.3%. The max normazed prce pad by a user usng POPPA s counse s 13.8% of the spread basene versus the orace s 99.8%. In the mnorty of cases where

Prce Normazed to Runnng Soo 1.6 1.4 1.2 1..8 SOP POPPA Orace.6 cg ft gtc ammps u uesh mg Appcatons Fgure 12: The dstrbuton of prces a user woud pay for a gven appcaton when usng ether the state-of-practce (SOP), POPPA, or the maxmay far Orace mc mnfe mnghost nekproxy pop POPPA charges more than the spread basene (23/144), t s usuay smaer than run-to-run varaton, wth a mean surcharge of 1.3%. In addton, the mean prce pad for each appcaton never exceeds 99.2% of the basene, and thus over tme, a users w receve a dscount. Contrast ths wth the state-of-practce, where a user runnng MILC n the worst case can pay up to 62.1% more and on average woud expect to pay 24.9% more as a resut of cross-appcaton nterference. If we consder the mpact of POPPA s dscounts, we fnd they are entrey tenabe. Reca that the job strpng study [2] found that co-ocatng MPI benchmarks and fu-scae appcatons at scae ncreased mean system throughput by 12 to 23%. Thus dscountng users by a mean 7.4% does not nfate the purchasng power of SUs, and so SU aocaton need not be changed. 9. RELATED WORK There are a number of works that nvestgate prcng or dentfy prcng as a key ssue for arge scae grd and coud nfrastructures [13, 63, 48, 54]. Our work dffers from these works n that we address the prcng ssue n supercomputers wth co-ocatons. To the best of our knowedge, our work s the frst to expore ths probem space. Athough ths work addresses chaenges reated to far prcng, t shares smartes wth research that addresses dentfyng and mtgatng contenton n mutcore systems. Eary work on smutaneous mut-threadng processors nvestgated co-schedung of heterogeneous threads [55, 56, 21] as a way to ncrease throughput by reducng contenton. Cross core contenton has aso been extensvey studed [22, 65, 44, 43]. A mechansm smar to the prcng shutter s expored n [44] but dffers n that t s n the commerca data center space and n that t focuses on L3 mss rates wth and wthout the presence of contenton. Another souton to mtgatng contenton has been cache parttonng both n software and n hardware [46, 58, 53, 23]. Core fuson s an archtectura desgn that heps reduce the cross core contenton probem by dynamcay combnng smper cores nto arger cores [34, 59]. Others have examned usng schedung to mtgate contenton [64, 29, 28, 18, 17] and [5, 6] nvestgate schedung consderatons n mapreduce envronments. There are aso studes that evauate the effectveness of anaytca and statstca modes to sove probems reated to contenton [4, 62, 24, 31]. The computatona compexty, heurstcs and approxmaton agorthms for optma mutprocessor schedung are expored n [3, 19, 37, 35]. 1. CONCLUSION We have provded a mechansm to enabe far prcng on HPC systems, one of the fundamenta roadbocks to enabe node sharng on HPC systems. By empoyng POPPA, we can accuratey measure performance degradaton across a range of MPI appcatons. Usng ths data, we prce users n a fashon that approaches the optma farness provded by the orace, and our mean absoute predcton error s 4% across a combnatons of 12 appcaton codes. POPPA s not a defntve souton to the prcng probem but a key part of a more hostc souton. Gong forward, the deveopment of addtona, ght-weght technques for appcaton ntrospecton w become essenta. By harnessng ths dynamc nformaton, further optmzaton opportuntes w arse. Through combnng these soutons, the road to exascae supercomputers ooks brght. 11. ACKNOWLEDGEMENTS The authors thank the anonymous revewers for ther tme and feedback. In addton, they thank Professor Mke Norman of UCSD, Professor Leo Porter of Skdmore Coege, and Terr Qunn of LLNL. Some of the deas n ths paper were nspred by dscussons wth the ate Aan Snavey. Part of ths work was performed under the auspces of the U.S. Department of Energy by Lawrence Lvermore Natona Laboratory under Contract DE-AC52-7NA27344 (LLNL- CONF-635977). Ths work was supported n part by the DOE Offce of Scence, Advanced Scentfc Computng Research, under award number 62855 Beyond the Standard Mode Towards an Integrated Modeng Methodoogy for the Performance and Power ; PNNL ead nsttuton; Program Manager Sona Sachs. The authors acknowedge Natona Scence Foundaton support under CCF-132682.

12. REFERENCES [1] Asc sequoa benchmark codes. https://asc.n.gov/sequoa/benchmarks/. [2] Cesm1.: Parae ocean program (pop2). http://www.cesm.ucar.edu/modes/cesm1./pop2/. [3] Large-scae Atomc/Moecuar Massvey Parae Smuator. http://ammps.sanda.gov/. [4] Mantevo sute. http://www.mantevo.org/. [5] MIMD Lattce Computaton (MILC) Coaboraton. http://www.physcs.ndana.edu/~sg/mc.htm. [6] Nek5 project. https: //nek5.mcs.an.gov/ndex.php/man_page. [7] perfmon 2: mprovng performance montorng on nux. http://perfmon2.sourceforge.net/. [8] Proxy-apps for therma hydraucs. https://cesar.mcs.an.gov/content/software/ therma_hydraucs. [9] Spec cpu 2 benchmark sute. www.spec.org/cpu2. [1] Gordon user gude. http://www.sdsc.edu/us/resources/gordon/, 212. [11] Extreme Scence and Engneerng Dscovery Envronment. www.xsede.org, 213. [12] Innovatve and Nove Computatona Impact on Theory and Experment. http://www. doeeadershpcomputng.org/ncte-program/, 213. [13] M. Armbrust, A. Fox, R. Grffth, A. D. Joseph, R. Katz, A. Konwnsk, G. Lee, D. Patterson, A. Rabkn, I. Stoca, et a. A vew of coud computng. Communcatons of the ACM, 53(4), 21. [14] D. H. Baey, E. Barszcz, J. T. Barton, D. S. Brownng, R. L. Carter, L. Dagum, R. A. Fatooh, P. O. Frederckson, T. A. Lasnsk, R. S. Schreber, H. D. Smon, V. Venkatakrshnan, and S. K. Weeratunga. The nas parae benchmarks summary and premnary resuts. In Proceedngs of the 1991 ACM/IEEE conference on Supercomputng, Supercomputng 91, New York, NY, USA, 1991. ACM. [15] A. H. Baker, T. Gambn, M. Schuz, and U. M. Yang. Chaenges of scang agebrac mutgrd across modern mutcore archtectures. In Parae & Dstrbuted Processng Symposum (IPDPS), 211 IEEE Internatona. IEEE, 211. [16] K. Bergman, S. Borkar, D. Campbe, W. Carson, W. Day, M. Denneau, P. Franzon, W. Harrod, J. Her, S. Karp, S. Kecker, D. Ken, R. Lucas, M. Rchards, A. Scarpe, S. Scott, A. Snavey, T. Sterng, R. S. Wams, and K. Yeck. Exascae computng study: Technoogy chaenges n achevng exascae systems. www.cse.nd.edu/reports/28tr-28-13.pdf, 28. [17] S. Bagodurov and A. Fedorova. Towards the contenton aware schedung n hpc custer envronment. In Journa of Physcs: Conference Seres, voume 385. IOP Pubshng, 212. [18] S. Bagodurov, S. Zhuravev, and A. Fedorova. Contenton-aware schedung on mutcore systems. ACM Transactons on Computer Systems, 28, 21. [19] J. Bazewcz, J. K. Lenstra, and A. Kan. Schedung subject to resource constrants: cassfcaton and compexty. Dscrete Apped Mathematcs, 5(1), 1983. [2] A. D. Bresow, L. Porter, A. Twar, M. Laurenzano, L. Carrngton, D. M. Tusen, and A. E. Snavey. The case for coocaton of hpc workoads. Concurrency and Computaton: Practce and Experence, 213. [21] F. J. Cazora, P. M. Knjnenburg, R. Sakearou, E. Fernández, A. Ramrez, and M. Vaero. Predctabe performance n SMT processors. In 1st Conference on Computng Fronters, 24. [22] D. Chandra, F. Guo, S. Km, and Y. Sohn. Predctng nter-thread cache contenton on a chp mut-processor archtecture. In 11th Internatona Symposum on Hgh-Performance Computer Archtecture, 25. [23] J. Chang and G. S. Soh. Cooperatve cache parttonng for chp mutprocessors. In Proceedngs of the 21st annua nternatona conference on Supercomputng. ACM, 27. [24] T. Dwyer, A. Fedorova, S. Bagodurov, M. Roth, F. Gaud, and J. Pe. A practca method for estmatng performance degradaton on mutcore processors, and ts appcaton to hpc workoads. In Proceedngs of the Internatona Conference on Hgh Performance Computng, Networkng, Storage and Anayss. IEEE Computer Socety Press, 212. [25] D. Ekov, N. Nkoers, D. Back-Schaffer, and E. Hagersten. Cache pratng: Measurng the curse of the shared cache. In Parae Processng (ICPP), 211 Internatona Conference on. IEEE, 211. [26] D. Ekov, N. Nkoers, D. Back-Schaffer, and E. Hagersten. Bandwdth bandt: Understandng memory contenton. In Performance Anayss of Systems and Software (ISPASS), 212 IEEE Internatona Symposum on. IEEE, 212. [27] S. Eranan. Perfmon: Lnux performance montorng for a-64. Downoadabe software wth documentaton, http://www. hp. hp. com/research/nux/perfmon, 23. [28] A. Fedorova, M. Setzer, and M. D. Smth. Cache-far thread schedung for mutcore processors. Dvson of Engneerng and Apped Scences, Harvard Unversty, Tech. Rep. TR-17-6, 26. [29] A. Fedorova, M. Setzer, and M. D. Smth. Improvng performance soaton on chp mutprocessors va an operatng system scheduer. In 16th Internatona Conference on Parae Archtecture and Compaton Technques, 27. [3] M. R. Garey and D. S. Johnson. Compexty resuts for mutprocessor schedung under resource constrants. SIAM Journa on Computng, 4(4), 1975. [31] S. Govndan, J. Lu, A. Kansa, and A. Svasubramanam. Cuanta: quantfyng effects of shared on-chp resource nterference for consodated vrtua machnes. In Proceedngs of the 2nd ACM Symposum on Coud Computng. ACM, 211. [32] J. He, A. Jagatheesan, S. Gupta, J. Bennett, and A. Snavey. Dash: a recpe for a fash-based data ntensve supercomputer. In 21 ACM/IEEE Internatona Conference for Hgh Performance Computng, Networkng, Storage and Anayss, 21. [33] C. Iancu, S. Hofmeyr, F. Bagojevc, and Y. Zheng. Oversubscrpton on mutcore processors. In Parae Dstrbuted Processng (IPDPS), 21 IEEE Internatona Symposum on, apr 21. [34] E. Ipek, M. Krman, N. Krman, and J. F. Martnez. Core fuson: accommodatng software dversty n chp mutprocessors. ACM SIGARCH Computer Archtecture News, 35(2), 27. [35] Y. Jang, X. Shen, J. Chen, and R. Trpath. Anayss and approxmaton of optma co-schedung on chp mutprocessors. In Proceedngs of the 17th nternatona conference on Parae archtectures and compaton technques. ACM, 28. [36] I. Karn, A. Bhatee, J. Keaser, B. L. Chamberan, J. Cohen, Z. DeVto, R. Haque, D. Laney, E. Luke, F. Wang, et a. Exporng tradtona and emergng parae programmng modes usng a proxy appcaton. 27th IEEE Internatona Parae & Dstrbuted Processng Symposum (IEEE IPDPS 213), Boston, USA, 213. [37] H. Kasahara and S. Narta. Practca mutprocessor

schedung agorthms for effcent parae processng. IEEE Transactons on Computers, 33(11), 1984. [38] M. J. Koop, M. Luo, and D. K. Panda. Reducng network contenton wth mxed workoads on modern mutcore, custers. In Custer Computng and Workshops, 29. CLUSTER 9. IEEE Internatona Conference on. IEEE, 29. [39] M. A. Laurenzano, M. Meswan, L. Carrngton, A. Snavey, M. M. Tkr, and S. Pooe. Reducng Energy Usage wth Memory and Computaton-Aware Dynamc Frequency Scang. In Proceedngs of the 17th nternatona Euro-Par conference on Parae processng, EuroPar 11, Bordeaux, France, 211. [4] S.-H. Lm, J.-S. Huh, Y. Km, G. M. Shpman, and C. R. Das. D-factor: a quanttatve mode of appcaton sow-down n mut-resource shared systems. In Proceedngs of the 12th ACM SIGMETRICS/PERFORMANCE jont nternatona conference on Measurement and Modeng of Computer Systems. ACM, 212. [41] Z. Ln, G. Rewodt, S. Ether, T. S. Hahm, W. W. Lee, J. L. V. Lewandowsk, Y. Nshmura, and W. X. Wang. Partce-n-ce smuatons of eectron transport from pasma turbuence: recent progress n gyroknetc partce smuatons of turbuent pasmas. Journa of Physcs: Conference Seres, 16(1), 25. [42] J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubbe-up: Increasng utzaton n modern warehouse scae computers va sensbe co-ocatons. In MICRO 11: Proceedngs of The 44th Annua IEEE/ACM Internatona Symposum on Mcroarchtecture, New York, NY, USA, 211. ACM. [43] J. Mars, L. Tang, and M. L. Soffa. Drecty characterzng cross core nterference through contenton synthess. In Proceedngs of the 6th Internatona Conference on Hgh Performance and Embedded Archtectures and Compers. ACM, 211. [44] J. Mars, N. Vachharajan, R. Hundt, and M. L. Soffa. Contenton aware executon: onne contenton detecton and response. In Proceedngs of the 8th annua IEEE/ACM nternatona symposum on Code generaton and optmzaton. ACM, 21. [45] R. L. Moore, D. L. Hart, W. Pfeffer, M. Tatnen, K. Yoshmoto, and W. S. Young. Trestes: a hgh-productvty hpc system targeted to modest-scae and gateway users. In Proceedngs of the 211 TeraGrd Conference: Extreme Dgta Dscovery, TG 11, New York, NY, USA, 211. ACM. [46] F. Mueer. Comper support for software-based cache parttonng. In ACM Sgpan Notces, voume 3. ACM, 1995. [47] NAS. Nas parae benchmarks webste, http://www.nas.nasa.gov/resources/software/npb.htm. [48] D. Nu, C. Feng, and B. L. Prcng coud bandwdth reservatons under demand uncertanty. In Proceedngs of the 12th ACM SIGMETRICS/PERFORMANCE jont nternatona conference on Measurement and Modeng of Computer Systems. ACM, 212. [49] M. Norman and A. Snavey. Acceeratng data-ntensve scence wth Gordon and Dash. In 21 TeraGrd Conference, 21. [5] J. Poo, D. Carrera, Y. Becerra, J. Torres, E. Ayguadé, M. Stender, and I. Whaey. Performance-drven task co-schedung for mapreduce envronments. In Network Operatons and Management Symposum (NOMS), 21 IEEE. IEEE, 21. [51] L.-N. Pouchet, U. Bondhugua, C. Bastou, A. Cohen, J. Ramanujam, and P. Sadayappan. Combned teratve and mode-drven optmzaton n an automatc paraezaton framework. In Hgh Performance Computng, Networkng, Storage and Anayss (SC), 21 Internatona Conference for. IEEE, 21. [52] MVAPICH Team. Mvapch2 1.8 user gude. http://mvapch.cse.oho-state.edu/support/ mvapch2-1.8_user_gude.pdf, 212. [53] M. K. Quresh and Y. N. Patt. Utty-based cache parttonng: A ow-overhead, hgh-performance, runtme mechansm to partton shared caches. In Proceedngs of the 39th Annua IEEE/ACM Internatona Symposum on Mcroarchtecture. IEEE Computer Socety, 26. [54] B. Sharma, R. K. Thuasram, P. Thuasraman, S. K. Garg, and R. Buyya. Prcng coud compute commodtes: a nove fnanca economc mode. In Proceedngs of the 212 12th IEEE/ACM Internatona Symposum on Custer, Coud and Grd Computng (ccgrd 212). IEEE Computer Socety, 212. [55] A. Snavey and D. M. Tusen. Symbotc jobschedung for a smutaneous mutthreaded processor. In 9th Internatona Conference on Archtectura Support for Programmng Languages and Operatng Systems, 2. [56] A. Snavey, D. M. Tusen, and G. Voeker. Symbotc jobschedung wth prortes for a smutaneous mutthreadng processor. In ACM SIGMETRICS Performance Evauaton Revew, voume 3. ACM, 22. [57] S. M. Strande, P. Ccott, R. S. Snkovts, W. S. Young, R. Wagner, M. Tatnen, E. Hocks, A. Snavey, and M. Norman. Gordon: desgn, performance, and experences depoyng and supportng a data ntensve supercomputer. In Proceedngs of the 1st Conference of the Extreme Scence and Engneerng Dscovery Envronment: Brdgng from the extreme to the campus and beyond. ACM, 212. [58] G. E. Suh, L. Rudoph, and S. Devadas. Dynamc parttonng of shared cache memory. The Journa of Supercomputng, 28(1), 24. [59] M. A. Sueman, M. Hashem, C. Wkerson, Y. N. Patt, et a. Morphcore: An energy-effcent mcroarchtecture for hgh performance p and hgh throughput tp. In Mcroarchtecture (MICRO), 212 45th Annua IEEE/ACM Internatona Symposum on. IEEE, 212. [6] C. Tan, H. Zhou, Y. He, and L. Zha. A dynamc mapreduce scheduer for heterogeneous workoads. In Grd and Cooperatve Computng, 29. GCC 9. Eghth Internatona Conference on. IEEE, 29. [61] D. M. Tusen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Expotng choce: nstructon fetch and ssue on an mpementabe smutaneous mutthreadng processor. In 23rd Annua Internatona Symposum on Computer Archtecture, May 1996. [62] N. Vasć, D. Novakovć, S. Mučn, D. Kostć, and R. Banchn. Dejavu: acceeratng resource aocaton n vrtuazed envronments. In Proceedngs of the seventeenth nternatona conference on Archtectura Support for Programmng Languages and Operatng Systems. ACM, 212. [63] H. Wang, Q. Jng, R. Chen, B. He, Z. Qan, and L. Zhou. Dstrbuted systems meet economcs: prcng n the coud. In Proceedngs of the 2nd USENIX conference on Hot topcs n coud computng. USENIX Assocaton, 21. [64] Y. Wseman and D. Feteson. Pared gang schedung. Parae and Dstrbuted Systems, IEEE Transactons on, 14(6), June 23. [65] C. Xu, X. Chen, R. P. Dck, and Z. M. Mao. Cache contenton and appcaton performance predcton for mut-core systems. In Performance Anayss of Systems & Software (ISPASS), 21 IEEE Internatona Symposum on. IEEE, 21.