Heurstc Statc Load-Balancng Algorthm Appled to CESM 1 Yur Alexeev, 1 Sher Mckelson, 1 Sven Leyffer, 1 Robert Jacob, 2 Anthony Crag 1 Argonne Natonal Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, USA 2 CCSM Software Engneerng Group, NCAR, Boulder, CO 80305, USA
Argonne Natonal Laboratory supercomputers Intrepd (IBM Blue Gene/P) Mra(IBM Blue Gene/Q) 40,960 nodes / 163,840 cores 557 Teraflops peak PowerPC 450 wth 4 cores/node at 850 MHz Double FPU 2 wde double precson SIMD 512 MB per core 49,152 nodes / 786,432 cores 10 Petaflops peak PowerPC A2 wth 16 cores/node at 1.6 GHz Quad FPU 4 wde double precson SIMD 1Gb per core 2
CESM setup CESM fully coupled actve components, 1 degree resoluton: f09_g16.b Calculatons were run on Intrepd (40 racks Blue Gene/P) Goal: mnmze total executon tme Tme Node allocaton 3
Heurstc Statc Load-Balancng (HSLB) Algorthm (1) Gather Data: Run CESM calculatons D tmes usng a dfferent total numbers of cores. Collect the runnng tmes y j for each component. (2) Ft: Next, solve least squares problem for each component to determne the coeffcents a, b, c, and d for each fragment n performance model. (3) Solve: Determne the best allocaton by solvng the MINLP, and obtan the optmal values of sze n for each component. (4) Execute: Execute CESM smulatons, usng the determned subgroup szes n step (3). 4
Gather data for step (1) Calculatons were run on 512, 1024, 2048, 4096, 8192 cores 5
Performance model for step (2) a T scal nonln seral c ( n ) = T ( n ) + T ( n ) + T = + b n + d, = 1,..., C n T ( n ) - the wall-clock tme to compute the th component as a functon of n the number of cores allocated to process t T scal ( n ) = T seral = a n d - tme spent n perfectly scalable porton of the component - tme spent n the non-parallelzed porton of the component T nonln c ( n ) = b n - tme spent n partally parallelzed porton: ntalzaton, communcaton, and synchronzaton etc. (anythng nonlnear and not seral) Model makes sense both mathematcally and from the vewpont of Amdahl s law 6
Fttng data for step (2) Obtan the best ft by solvng the least squares problem mn a, b, c, d D y j j= 1 a n j c b n j subject to a, b, c, d R + d 2 7
Formulatng the Optmzaton Problem Problem: optmze the number of nodes, n, to be allocated to each component { 1,... C} C mnmze the total wall tme over all components : mn T ( n ) n = 1 mnmze the maxmum wall tme used by a component : mn max ( n ) maxmze the mnmum wall tme used by a component : Number of nodes tme n T max mnt ( n ) n 8
Formulatng the mathematcal problem for step (3) 1 Gven: + - set of postve nteger numbers 2 + - set of postve real numbers 3 C = { ce,lnd,atm,ocn} = {, l, a, o} - set of components 4 N + - total number of nodes avalable for allocaton 5 O = { 2,4,,480,768} = { O,,O } - 1 m possble allocatons for ocn 6 A = { 1,2,,1638,1664} = { A,, A } - 1 m possble allocatons for atm 7 Varables: T + - wall-clock tme obtaned by solvng allocaton problem 8 T celnd + - wall-clock tme to balance lnd and ce 9 Tsync + - synchronzaton tolerance to balance lnd and ce 10 n j + - number of nodes allocated 11 Tj( nj) + - (ftted) performance functon modelng tme taken to run on n j 12 z k {0,1} - bnary varables to model selecton of number nodes, n o 13 Mnmze: T Constrants for layout (1) 14 Subject to: Tcelnd T ( n ) 15 Tcelnd Tl ( nl ) 16 T Tcelnd + Ta ( na ) 17 T To( no) 18 Tl ( nl ) T ( n ) Tsync 19 Tl ( nl ) T ( n ) + Tsync 20 na+ no N 21 n+ nl na 9
Solvng MINLP problem Formulaton s wrtten n AMPL Classcal branch-and-bound [Dakn, 1965] mplemented n MINOTAUR: http://wk.mcs.anl.gov/mnotaur Solve relaxed NLP (contnuous relaxaton); soluton value provdes lower bound Branch on y Solve NLP & branch untl: Node nfeasble Node nteger feasble (get upper bound) Lower bound Tree search exhaustve but not complete enumeraton Method guarantees to fnd optmal global soluton or show that none exst Soluton tme s 10 seconds on a sngle core (155 components) 10
MINLP Tree Synthess MINLP B&B Tree: 10000+ nodes after 360s 11
Results CESM fully coupled actve components, 1 degree resoluton: f09_g16.b Calculatons were run on Intrepd (40 racks Blue Gene/P) 1 resoluton, 128 nodes Manual HSLB components # nodes Tme, sec Predcted # nodes Predcted Tme, sec Actual Tme, sec lnd 24 63.766 15 100.951 100.202 ce 80 109.054 89 102.972 116.472 atm 104 306.952 104 307.651 308.699 ocn 24 362.669 24 365.649 365.853 Total tme, sec 416.006 410.623 425.171 12
Results CESM fully coupled actve components, 1/8 degree resoluton: ne240_f02_t12.b 13
Predcton of Optmal Layout 14
Predcton of Effcency En ( ) = T(64) / Tn ( ) n / 64 15
Future work Convert the AMPL code to C++ to be more portable Create scrpts that wll automate the load balancng process - Frst scrpt wll gather tmng data for scalng curve by creatng/runnng 4-5 test layouts - Second scrpt wll analyze the tmng fles and produce a load balanced layout based on how many cores the user would lke to run on 16
Acknowledgments Thank you Dr. Ray Loy and ALCF team members (Argonne Natonal Laboratory) Jm Edwards and Marana Vertensten for encouragng ths work and helpful dscussons. Fundng was provded by U.S. Department of Energy 17