All images belong to their reator! sl.inf.ethz.h @sl_eth TORSTEN HOEFLER Towards fully automated interretable erformane models in ollaboration with Aleandru Calotoiu and Feli Wolf @ RWTH Aahen with students Arnamoy Bhattaharyya and Grzegorz Kwasniewski @ SPCL resented at University of Tennessee Knoville, July 5
sl.inf.ethz.h @sl_eth Analytial aliation erformane modeling Salability bug redition Find latent salability bugs early on (before mahine deloyment) SC3: A. Calotoiu, TH, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes Automated erformane testing Performane modeling as art of a software engineering disiline in HPC ICS 5: S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, F. Wolf: Easaling Your Library: Will Your Imlementation Meet Your Eetations? Hardware/Software o-design Deide how to arhitet systems Making erformane develoment intuitive vs.
sl.inf.ethz.h @sl_eth Manual analytial erformane modeling Identify kernels Create models Parts of the rogram that dominate its erformane at larger sales Identified via small-sale tests and intuition Laborious roess Still onfined to a small ommunity of skilled eerts Disadvantages Time onsuming Error-rone, may overlook unsalable ode TH, W. Gro, M. Snir, and W. Kramer: Performane Modeling for Systemati Performane Tuning, SC 3
sl.inf.ethz.h @sl_eth Weak saling Our first ste: salability bug detetor main() { foo() bar() omute() } Instrumentation All funtions Performane measurements (rofiles) = 8 4 =,4 = 56 5 =,48 3 = 5 6 = 4,96 Inut Outut Automated modeling Ranking:. Asymtoti. Target sale t. foo. omute 3. main 4. bar [ ] 4
sl.inf.ethz.h @sl_eth Primary fous on saling trend Our ranking Common erformane analysis hart in a aer. F. F 3 3. F 5
sl.inf.ethz.h @sl_eth Primary fous on saling trend Our ranking Atual measurement in laboratory onditions. F. F 3 3. F 6
sl.inf.ethz.h @sl_eth Primary fous on saling trend Our ranking Prodution Reality. F. F 3 3. F 7
Comutation sl.inf.ethz.h @sl_eth How to mehanize the eert? Survey! LU t() ~ FFT t( ) ~ log ( ) Naïve N-body t() ~ LU t() ~ FFT t( ) ~ log ( ) Naïve N-body t() ~ Communiation Samlesort t() ~ log () Samlesort t() ~ 8
sl.inf.ethz.h @sl_eth Survey result: erformane model normal form n å k= f () = i k log j k () k n Î i k Î I j k Î J I, J Ì n = I = {,, } J = {,} log() log() log() A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes, SC3 9
sl.inf.ethz.h @sl_eth Survey result: erformane model normal form n = I = {,, } J = {,} n å k= f () = i k log j k () k + + + log() + log() + log() log( log( log( log( ) ) ) ) log( log( ) ) log( log( ) log( log( log( ) ) ) ) log( n Î i k Î I j k Î J I, J Ì ) A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes, SC3
sl.inf.ethz.h @sl_eth Our automated generation workflow Statistial quality assurane Performane measurements Performane rofiles Model generation Model generation Saling models Model refinement Kernel refinement Saling models Auray saturated? Yes No Performane etraolation Ranking of kernels A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes, SC3
sl.inf.ethz.h @sl_eth Model refinement n =;R = - No n++ Inut data Hyothesis generation; hyothesis size n Hyothesis evaluation via ross-validation Comutation of for best hyothesis Rn- > Rn Ú n = n ma Yes Saling model Rn R R {(,t ),...,( 6,t 6 )} log() residualsu ( log() log() log() totalsumsq R ) 6 msquares n n uares I = {,,};J = {,};n ma =
sl.inf.ethz.h @sl_eth 3
sl.inf.ethz.h @sl_eth Evaluation overview Performane measurements Statistial quality assurane Performa ne rofiles Model generation I = {,,, 3, 4, 5, 6 } Kernel refinement Model generation Saling models Performane etraolation Saling models Auray saturated? Yes No Model refinement J = {,,} n = 5 Ranking of kernels Swee3D MILC HOMME XNS 4
sl.inf.ethz.h @sl_eth Swee3D ommuniation erformane Solves neutron transort roblem 3D domain maed onto D roess grid Parallelism ahieved through ielined wave-front roess t omm LogGP model for ommuniation develoed by Hoisie et al. We assume = * y Equation (6) in [] [] A. Hoisie, O. M. Lubek, and H. J. Wasserman. Performane analysis of wavefront algorithms on very-large sale distributed systems. In Worksho on Wide Area Networks and High Performane Comuting, ages 7 87. Sringer-Verlag, 999. 5
sl.inf.ethz.h @sl_eth Swee3D ommuniation erformane Kernel [ of 4] Runtime[%] t =6k Model [s] t = f() Preditive error [%] t =6k swee MPI_Rev 65.35 4.3 5. swee.87 i 8k 58.9 #bytes = onst. #msg = onst.. 6
sl.inf.ethz.h @sl_eth MILC MILC/su3_rmd from MILC suite of QCD odes with erformane model manually reated Time er roess should remain onstant eet for a rather small logarithmi term aused by global onvergene heks Kernel [3 of 479] omute_gen_stale_field g_vedoublesum MPI_Allredue mult_adj_su3_fieldlink_lathwe Model [s] t=f().4-6.3-6 log () 3.8-3 Preditive Error [%] t =64k.43..4 i 6k 7
sl.inf.ethz.h @sl_eth HOMME Core of the Community Atmosheri Model (CAM) Setral element dynamial ore on a ubed shere grid Kernel [3 of 94] bo_rearrange MPI_Redue vlalae_shere_vk omute_and_aly_rhs Model [s] t = f().6 +.53-6 +.4-3 i 5k 49.53 48.68 Preditive error [%] t = 3k 57. 99.3.65 8
sl.inf.ethz.h @sl_eth HOMME () Core of the Community Atmosheri Model (CAM) Setral element dynamial ore on a ubed shere grid Kernel [3 of 94] bo_rearrange MPI_Redue vlalae_shere_vk omute_and_aly_rhs Model [s] t = f() 3.63-6 + 7. -3 3 i 43k 4.44+.6-7 49.9 Preditive error [%] t = 3k 3.34 4.8.83 9
sl.inf.ethz.h @sl_eth HOMME (3)
sl.inf.ethz.h @sl_eth Is this all? No, it s just the beginning We fae several roblems: Multiarameter modeling searh sae elosion Interesting instane of the urse of dimensionality Modeling overheads Cross validation (leave-one-out) is slow and Our urrent rofiling requires a lot of storage (>TBs)
sl.inf.ethz.h @sl_eth Overview of the stati modeling system Parallel rogram LLVM Closed form reresentation Affine loo synthesis Loo etration ( i,..., i ) r A final ( i,..., i ) r b final ( i,..., i ) r with i r... n ( k, k ), k... r Number of iterations Program analysis W N D N
sl.inf.ethz.h @sl_eth Case studies NAS Parallel Benhmarks: EP 3
sl.inf.ethz.h @sl_eth Case studies NAS Parallel Benhmarks: EP 4
sl.inf.ethz.h @sl_eth 5 Case studies CG onjugate gradient k m k m k k E T D k m k m k W 3 4 3 log log IS integer sort 3 3 k m k k E T D u u m t b n W
sl.inf.ethz.h @sl_eth 6
sl.inf.ethz.h @sl_eth Performane Analysis. Automati Models Is feasible Still a long way to go Offers insight Requires low effort Imroves ode overage A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes. Sueromuting (SC3). T. Hoefler, G. Kwasniewski: Automati Comleity Analysis of Eliitly Parallel Programs. SPAA 4. A. Bhattaharyya, T. Hoefler: PEMOGEN: Automati Adative Performane Modeling during Program Runtime, PACT 4 S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, F. Wolf: Easaling Your Library: Will Your Imlementation Meet Your Eetations? ICS 5 7
sl.inf.ethz.h @sl_eth Baku 8
sl.inf.ethz.h @sl_eth Why affine loos? Closed form reresentation of the loo 9 Counting Arbitrary Affline Loo Nests ) ), ( min ( arg ),, ( ) ( ) ( ), ( g d g n i i L i T d
sl.inf.ethz.h @sl_eth Why affine loos? Closed form reresentation of the loo Eamle 3 Counting Arbitrary Affline Loo Nests ) ), ( min ( arg ),, ( ) ( ) ( ), ( g d g n i i L i T d ; ), ( i i } ){ ( ; m while ) ( j k m n for ( k=j; k < m; k = k + j ) verycomliatedoeration(j,k); k j where
sl.inf.ethz.h @sl_eth Loos Multiath affine loos 3