Towards fully automated interpretable performance models



Similar documents
Social Network Analysis Based on BSP Clustering Algorithm

Journal of Manufacturing Systems. Tractable supply chain production planning, modeling nonlinear lead time and quality of service constraints

CIS570 Lecture 4 Introduction to Data-flow Analysis 3

Multicore Parallel Computing with OpenMP

INTELLIGENCE IN SWITCHED AND PACKET NETWORKS

BUILDING A SPAM FILTER USING NAÏVE BAYES. CIS 391- Intro to AI 1

A Virtual Machine Dynamic Migration Scheduling Model Based on MBFD Algorithm

' R ATIONAL. :::~i:. :'.:::::: RETENTION ':: Compliance with the way you work PRODUCT BRIEF

Multi-GPU Load Balancing for Simulation and Rendering

An Efficient Network Traffic Classification Based on Unknown and Anomaly Flow Detection Mechanism

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Concurrent Program Synthesis Based on Supervisory Control

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Using Live Chat in your Call Centre

Free Software Development. 2. Chemical Database Management

From Simulation to Experiment: A Case Study on Multiprocessor Task Scheduling

Performance Analysis of IEEE in Multi-hop Wireless Networks

Open and Extensible Business Process Simulator

Planning Approximations to the average length of vehicle routing problems with time window constraints

ENFORCING SAFETY PROPERTIES IN WEB APPLICATIONS USING PETRI NETS

Henley Business School at Univ of Reading. Pre-Experience Postgraduate Programmes Chartered Institute of Personnel and Development (CIPD)

Hierarchical Clustering and Sampling Techniques for Network Monitoring

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

Multistage Human Resource Allocation for Software Development by Multiobjective Genetic Algorithm

Load Balancing Mechanism in Agent-based Grid

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Failure Behavior Analysis for Reliable Distributed Embedded Systems

SLA-based Resource Allocation for Software as a Service Provider (SaaS) in Cloud Computing Environments

Interpretable Fuzzy Modeling using Multi-Objective Immune- Inspired Optimization Algorithms

IBM WebSphere DataStage Online training from Yes-M Systems

The Contamination Problem in Utility Regulation

FDA CFR PART 11 ELECTRONIC RECORDS, ELECTRONIC SIGNATURES

Moving Objects Tracking in Video by Graph Cuts and Parameter Motion Model

AUTOMATIC AND CONTINUOUS PROJECTOR DISPLAY SURFACE CALIBRATION USING EVERY-DAY IMAGERY

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Performance metrics for parallelism

Neural network-based Load Balancing and Reactive Power Control by Static VAR Compensator

Supply Chain Management in a Dairy Industry A Case Study

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

A Pattern-Based Approach to. Automated Application Performance Analysis

Automated Generation of Interactive 3D Exploded View Diagrams

Chapter 1 Microeconomics of Consumer Theory

Performance Monitoring of Parallel Scientific Applications

Storage Basics Architecting the Storage Supplemental Handout

FPGA Synthesis of Fuzzy (PD and PID) Controller for Insulin Pumps in Diabetes Using Cadence

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Clustering & Visualization

A Three-Hybrid Treatment Method of the Compressor's Characteristic Line in Performance Prediction of Power Systems

MATE: MPLS Adaptive Traffic Engineering

Monitoring Frequency of Change By Li Qin

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

CABRS CELLULAR AUTOMATON BASED MRI BRAIN SEGMENTATION

A Keyword Filters Method for Spam via Maximum Independent Sets

Branch-and-Price for Service Network Design with Asset Management Constraints

Analysis of Effectiveness of Web based E- Learning Through Information Technology

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Application Performance Analysis Tools and Techniques

SpiraTeam Feature Compa

Agile ALM White Paper: Redefining ALM with Five Key Practices

SR-IOV: Performance Benefits for Virtualized Interconnects!

ABAP SQL Monitor Implementation Guide and Best Practices

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Time-Cost Trade-Offs in Resource-Constraint Project Scheduling Problems with Overlapping Modes

Mean shift-based clustering

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

CSI:FLORIDA. Section 4.4: Logistic Regression

Memory management. Chapter 4: Memory Management. Memory hierarchy. In an ideal world. Basic memory management. Fixed partitions: multiple programs

benchmarking Amazon EC2 for high-performance scientific computing

Higher Focus on Quality. Pressure on Testing Budgets. ? Short Release Cycles. Your key to Effortless Automation. OpKey TM


Behavior Analysis-Based Learning Framework for Host Level Intrusion Detection

Machine Learning with Operational Costs

Franck Cappello and Daniel Etiemble LRI, Université Paris-Sud, 91405, Orsay, France

Scalable Hierarchical Multitask Learning Algorithms for Conversion Optimization in Display Advertising

Secure synthesis and activation of protocol translation agents

Statistical Machine Learning

An Associative Memory Readout in ESN for Neural Action Potential Detection

Discovering Trends in Large Datasets Using Neural Networks

A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

THE NAS KERNEL BENCHMARK PROGRAM

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Performance and Scalability of the NAS Parallel Benchmarks in Java

SDN/OpenFlow. Outline. Performance U!, Winterschool, Zurich. SDN to OpenFlow. OpenFlow a valid technology!

A Certification Authority for Elliptic Curve X.509v3 Certificates

Transcription:

All images belong to their reator! sl.inf.ethz.h @sl_eth TORSTEN HOEFLER Towards fully automated interretable erformane models in ollaboration with Aleandru Calotoiu and Feli Wolf @ RWTH Aahen with students Arnamoy Bhattaharyya and Grzegorz Kwasniewski @ SPCL resented at University of Tennessee Knoville, July 5

sl.inf.ethz.h @sl_eth Analytial aliation erformane modeling Salability bug redition Find latent salability bugs early on (before mahine deloyment) SC3: A. Calotoiu, TH, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes Automated erformane testing Performane modeling as art of a software engineering disiline in HPC ICS 5: S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, F. Wolf: Easaling Your Library: Will Your Imlementation Meet Your Eetations? Hardware/Software o-design Deide how to arhitet systems Making erformane develoment intuitive vs.

sl.inf.ethz.h @sl_eth Manual analytial erformane modeling Identify kernels Create models Parts of the rogram that dominate its erformane at larger sales Identified via small-sale tests and intuition Laborious roess Still onfined to a small ommunity of skilled eerts Disadvantages Time onsuming Error-rone, may overlook unsalable ode TH, W. Gro, M. Snir, and W. Kramer: Performane Modeling for Systemati Performane Tuning, SC 3

sl.inf.ethz.h @sl_eth Weak saling Our first ste: salability bug detetor main() { foo() bar() omute() } Instrumentation All funtions Performane measurements (rofiles) = 8 4 =,4 = 56 5 =,48 3 = 5 6 = 4,96 Inut Outut Automated modeling Ranking:. Asymtoti. Target sale t. foo. omute 3. main 4. bar [ ] 4

sl.inf.ethz.h @sl_eth Primary fous on saling trend Our ranking Common erformane analysis hart in a aer. F. F 3 3. F 5

sl.inf.ethz.h @sl_eth Primary fous on saling trend Our ranking Atual measurement in laboratory onditions. F. F 3 3. F 6

sl.inf.ethz.h @sl_eth Primary fous on saling trend Our ranking Prodution Reality. F. F 3 3. F 7

Comutation sl.inf.ethz.h @sl_eth How to mehanize the eert? Survey! LU t() ~ FFT t( ) ~ log ( ) Naïve N-body t() ~ LU t() ~ FFT t( ) ~ log ( ) Naïve N-body t() ~ Communiation Samlesort t() ~ log () Samlesort t() ~ 8

sl.inf.ethz.h @sl_eth Survey result: erformane model normal form n å k= f () = i k log j k () k n Î i k Î I j k Î J I, J Ì n = I = {,, } J = {,} log() log() log() A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes, SC3 9

sl.inf.ethz.h @sl_eth Survey result: erformane model normal form n = I = {,, } J = {,} n å k= f () = i k log j k () k + + + log() + log() + log() log( log( log( log( ) ) ) ) log( log( ) ) log( log( ) log( log( log( ) ) ) ) log( n Î i k Î I j k Î J I, J Ì ) A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes, SC3

sl.inf.ethz.h @sl_eth Our automated generation workflow Statistial quality assurane Performane measurements Performane rofiles Model generation Model generation Saling models Model refinement Kernel refinement Saling models Auray saturated? Yes No Performane etraolation Ranking of kernels A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes, SC3

sl.inf.ethz.h @sl_eth Model refinement n =;R = - No n++ Inut data Hyothesis generation; hyothesis size n Hyothesis evaluation via ross-validation Comutation of for best hyothesis Rn- > Rn Ú n = n ma Yes Saling model Rn R R {(,t ),...,( 6,t 6 )} log() residualsu ( log() log() log() totalsumsq R ) 6 msquares n n uares I = {,,};J = {,};n ma =

sl.inf.ethz.h @sl_eth 3

sl.inf.ethz.h @sl_eth Evaluation overview Performane measurements Statistial quality assurane Performa ne rofiles Model generation I = {,,, 3, 4, 5, 6 } Kernel refinement Model generation Saling models Performane etraolation Saling models Auray saturated? Yes No Model refinement J = {,,} n = 5 Ranking of kernels Swee3D MILC HOMME XNS 4

sl.inf.ethz.h @sl_eth Swee3D ommuniation erformane Solves neutron transort roblem 3D domain maed onto D roess grid Parallelism ahieved through ielined wave-front roess t omm LogGP model for ommuniation develoed by Hoisie et al. We assume = * y Equation (6) in [] [] A. Hoisie, O. M. Lubek, and H. J. Wasserman. Performane analysis of wavefront algorithms on very-large sale distributed systems. In Worksho on Wide Area Networks and High Performane Comuting, ages 7 87. Sringer-Verlag, 999. 5

sl.inf.ethz.h @sl_eth Swee3D ommuniation erformane Kernel [ of 4] Runtime[%] t =6k Model [s] t = f() Preditive error [%] t =6k swee MPI_Rev 65.35 4.3 5. swee.87 i 8k 58.9 #bytes = onst. #msg = onst.. 6

sl.inf.ethz.h @sl_eth MILC MILC/su3_rmd from MILC suite of QCD odes with erformane model manually reated Time er roess should remain onstant eet for a rather small logarithmi term aused by global onvergene heks Kernel [3 of 479] omute_gen_stale_field g_vedoublesum MPI_Allredue mult_adj_su3_fieldlink_lathwe Model [s] t=f().4-6.3-6 log () 3.8-3 Preditive Error [%] t =64k.43..4 i 6k 7

sl.inf.ethz.h @sl_eth HOMME Core of the Community Atmosheri Model (CAM) Setral element dynamial ore on a ubed shere grid Kernel [3 of 94] bo_rearrange MPI_Redue vlalae_shere_vk omute_and_aly_rhs Model [s] t = f().6 +.53-6 +.4-3 i 5k 49.53 48.68 Preditive error [%] t = 3k 57. 99.3.65 8

sl.inf.ethz.h @sl_eth HOMME () Core of the Community Atmosheri Model (CAM) Setral element dynamial ore on a ubed shere grid Kernel [3 of 94] bo_rearrange MPI_Redue vlalae_shere_vk omute_and_aly_rhs Model [s] t = f() 3.63-6 + 7. -3 3 i 43k 4.44+.6-7 49.9 Preditive error [%] t = 3k 3.34 4.8.83 9

sl.inf.ethz.h @sl_eth HOMME (3)

sl.inf.ethz.h @sl_eth Is this all? No, it s just the beginning We fae several roblems: Multiarameter modeling searh sae elosion Interesting instane of the urse of dimensionality Modeling overheads Cross validation (leave-one-out) is slow and Our urrent rofiling requires a lot of storage (>TBs)

sl.inf.ethz.h @sl_eth Overview of the stati modeling system Parallel rogram LLVM Closed form reresentation Affine loo synthesis Loo etration ( i,..., i ) r A final ( i,..., i ) r b final ( i,..., i ) r with i r... n ( k, k ), k... r Number of iterations Program analysis W N D N

sl.inf.ethz.h @sl_eth Case studies NAS Parallel Benhmarks: EP 3

sl.inf.ethz.h @sl_eth Case studies NAS Parallel Benhmarks: EP 4

sl.inf.ethz.h @sl_eth 5 Case studies CG onjugate gradient k m k m k k E T D k m k m k W 3 4 3 log log IS integer sort 3 3 k m k k E T D u u m t b n W

sl.inf.ethz.h @sl_eth 6

sl.inf.ethz.h @sl_eth Performane Analysis. Automati Models Is feasible Still a long way to go Offers insight Requires low effort Imroves ode overage A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performane Modeling to Find Salability Bugs in Comle Codes. Sueromuting (SC3). T. Hoefler, G. Kwasniewski: Automati Comleity Analysis of Eliitly Parallel Programs. SPAA 4. A. Bhattaharyya, T. Hoefler: PEMOGEN: Automati Adative Performane Modeling during Program Runtime, PACT 4 S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, F. Wolf: Easaling Your Library: Will Your Imlementation Meet Your Eetations? ICS 5 7

sl.inf.ethz.h @sl_eth Baku 8

sl.inf.ethz.h @sl_eth Why affine loos? Closed form reresentation of the loo 9 Counting Arbitrary Affline Loo Nests ) ), ( min ( arg ),, ( ) ( ) ( ), ( g d g n i i L i T d

sl.inf.ethz.h @sl_eth Why affine loos? Closed form reresentation of the loo Eamle 3 Counting Arbitrary Affline Loo Nests ) ), ( min ( arg ),, ( ) ( ) ( ), ( g d g n i i L i T d ; ), ( i i } ){ ( ; m while ) ( j k m n for ( k=j; k < m; k = k + j ) verycomliatedoeration(j,k); k j where

sl.inf.ethz.h @sl_eth Loos Multiath affine loos 3