Getting the Performance Out Of High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 1 Getting the Performance Out Of High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 2 1
3 Getting the Performance into High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 4 2
Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. 2 transistors/chip Every 1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, storage, internet bandwidth, etc 5 1 PFlop/s Moore s Law Super Scalar/Vector/Parallel Parallel Earth Simulator 1 TFlop/s ASCI Red ASCI White Pacific TMC CM-5 Cray T3D 1 GFlop/s Super Scalar Vector Cray 1 Cray 2 Cray -MP TMC CM-2 1 MFlop/s Scalar CDC 7600 CDC 6600 IBM 360/195 IBM 7090 1 KFlop/s UNIVAC 1 EDSAC 1 1950 1960 1970 1980 1990 2000 2010 6 3
Next Generation: IBM Blue Gene/L and ASCI Purple Announced 11/19/02 One of 2 machines for LLNL 360 TFlop/s 130,000 proc Linux FY 2005 Plus ASCI Purple IBM Power 5 based 7 12K proc, 100 TFlop/s To Be Provocative Citation in the Press, March 10 th, 2008 DOE Supercomputers Sit Idle WASHINGTON, Mar. 10, 2008 GAO reports that after almost 5 years of effort and several hundreds of M$ s spent at the DOE labs, the high performance computers recently purchased did not meet users expectation and are sitting idle Alan Laub head of the DOE efforts reports that the computer equ How could this happen? Complexity of programming these machines were underestimated Users were unprepared for the lack of reliability of the hardware and software Little effort was spent to carry out medium and long term research activities to solve problems that were foreseen 5 years ago in the areas of applications, algorithm, middleware, programming models, and computer architectures, 8 4
Software Technology & Performance Tendency to focus on the hardware Software required to bridge an ever widening gap Gaps between potential and delivered performance is very steep Performance only if the data and controls are setup just right Otherwise, dramatic performance degradations, very unstable situation Will become more unstable as systems change and become more complex Challenge for applications, libraries, and tools is formidable with Tflop/s level, even greater with Pflops, some might say insurmountable. 9 Linpack (100x100) Analysis, The Machine on My Desk 12 Years Ago and Today Compaq 386/S20 S with FPA -.16 Mflop/s Pentium IV 2.8 GHz 1317 Mflop/s 12 years we see a factor of ~ 8231 Doubling in less than 12 months, for 12 years Moore s Law gives us a factor of 256. How do we get a factor > 8000? Clock speed increase = 128x External Bus Width & Caching 16 vs. 64 bits = 4x Floating Point - 4/8 bits multi vs. 64 bits (1 clock) = 8x Compiler Technology = 2x However the potential for that Pentium 4 is Complex set of interaction between Application Algorithms Programming language Compiler Machine instructions Hardware Many layers of translation from the application to the hardware Changing with each generation 5.6 Gflop/s and here we are getting 1.32 Gflop/s Still a factor of 4.25 off of peak 10 5
Where Does Much of Our Lost Performance Go? or Why Should I Care About the Memory Hierarchy? 1000000 Processor-DRAM Memory Gap (latency) Moore s Law CPU µproc 60%/yr. (2/1.5yr) Performance 10000 100 1 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2/10 yrs) Year 11 Optimizing Computation and Memory Use Computational optimizations Theoretical peak:(# fpus)*(flops/cycle)*cycle time Pentium 4: (1 fpu)*(2 flops/cycle)*(2.8 Ghz) Operations like: = 5600 MFLOP/s y = a x + y : 3 operands (24 Bytes) needed for 2 flops; 5600 Mflop/s requires 8400 MWord/s bandwidth from memory Memory optimization Theoretical peak: (bus width) * (bus speed) Pentium 4: (32 bits)*(533 Mhz) = 2132 MB/s = 266 MWord/s Off by a factor of 30 from what s required to drive the processor from memory to peak performance 12 6
Memory Hierarchy By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Datapath Processor Control Registers On-Chip Cache Level 2 and 3 Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) Distributed Memory Tertiary Storage (Disk/Tape) Remote Cluster Memory Speed (ns): 1s 10s 100s Size (bytes): 100s Ks Ms 10,000,000s (10s ms) 100,000 s (.1s ms) Gs 10,000,000,000s (10s sec) 10,000,000 s (10s ms) 13 Ts Tool To Help Understand What s Going On In the Processor Complex system with many filters Need to identify bottlenecks Prioritize optimization Focus on important aspects 14 7
Tools for Performance Analysis, Modeling and Optimization PAPI Dynist SVPablo ROSE: Compiler Framework Recognition of high-level abstractions Specification of Transformations 15 Example PERC on Climate Model Interaction with the SciDAC Climate development effort Profiling Identifying performance bottleneck and prioritizing enhancements Evaluation of code over time 9 month period Produced 400% improvement via decreased overhead and increased scalability 16 8
Signatures: Key Factors in Applications and System that Affect Performance Application Signatures Characterization of operations needed to be performed by application Description of application demands on resources Algorithm Signatures Opts counts Memory ref patterns Data dependencies I/O characteristics Software Signatures Sync points Thread level parallelism Inst level parallelism Ratio of mem ref to flpt ops Predict application behavior and performance Execution signature combine application and machine signatures to provide accurate performance models. Hardware Signatures Performance capabilities of Machine Latencies and bandwidth of memory hierarchy Local to node & to remote node Instruction issue rates Cache size TLB size Parallel or Distributed Application Feedback (Degree of Similarity) Observation and Model Signature Comparison Performance Monitor Generation of Observation Signature Model Signature Live Performance Data Application Signature 17 Algorithms vs Applications Sparse Linear System Solvers Lattice Gauge (QCD) Quantum Chemistry Weather Simulation Comp Fluid Dynamics Adjustment of Geodetic Networks Inverse Problems Structural Mechanics Electronic Device Simulation Circuit Simulation Linear Least Squares Nonlinear Algebraic System Solvers Sparse Eigenvalue Problems FFT Rapid Elliptic Problem Solvers Multigrid Schemes Stiff ODE Solvers Monte Carlo Schemes Integral Transformations 18 From: Supercomputing Tradeoffs and the Cedar System, E. Davidson, D. Kuck, D. Lawrie, and A. Sameh, in High - Speed Computing, Scientific Applications and Algorithm Design, Ed R. Wilhelmson, U of I Press, 1986. 9
Update to Sameh s Table? Application Performance Matrix http://www.krellinst.org/matrix/ Next step by looking at: Application Signatures Algorithms choices Software profile Architecture (Machine) Data mine to extract information Need signatures for A 3 S 19 Performance Tuning Motivation: performance of many applications dominated by a few kernels Conventional approach: handtuning by user or vendor Very time consuming and tedious work Even with intimate knowledge of architecture and compiler, performance hard to predict Growing list of kernels to tune Must be redone for every architecture, compiler Compiler technology often lags architecture Not just a compiler problem: Best algorithm may depend on input, so some tuning at runtime. Not all algorithms semantically or mathematically equivalent 20 10
Automatic Performance Tuning to Hide Complexity Approach: for each kernel 1. Identify and generate a space of algorithms 2. Search for the fastest one, by running them What is a space of algorithms? Depending on kernel and input, may vary instruction mix and order memory access patterns data structures mathematical formulation When do we search? Once per kernel and architecture At compile time At run time All of the above 21 Some Automatic Tuning Projects ATLAS (www.netlib.org/atlas) (Dongarra, Whaley) used in Matlab and many SciDAC and ASCI projects PHIPAC (www.icsi.berkeley.edu/~bilmes/phipac) (Bilmes,Asanovic,Vuduc,Demmel) Sparsity (www.cs.berkeley.edu/~yelick/sparsity) (Yelick, Im) Self Adapting Linear Algebra Software (SALAS) (Dongarra, Eijkhout, Gropp, Keyes) FFTs and Signal Processing FFTW (www.fftw.org) Won 1999 Wilkinson Prize for Numerical Software SPIRAL (www.ece.cmu.edu/~spiral) Extensions to other transforms, DSPs UHFFT Extensions to higher dimension, parallelism MFLOP/S 3500.0 3000.0 2500.0 2000.0 1500.0 1000.0 500.0 0.0 Vendor BLAS ATLAS BLAS F77 BLAS AMD Athlon-600 DEC ev56-533 DEC ev6-500 HP9000/735/135 IBM PPC604-112 IBM Power2-160 Architectures IBM Power3-200 Intel P-III 933 MHz Intel P-4 2.53 GHz w/sse2 SGI R10000ip28-200 SGI R12000ip30-270 Sun UltraSparc2-200 22 11
Futures for High Performance Scientific Computing Numerical software will be adaptive, exploratory, and intelligent Determinism in numerical computing will be gone. After all, its not reasonable to ask for exactness in numerical computations. Reproducibility at a cost Importance of floating point arithmetic will be undiminished. 16, 32, 64, 128 bits and beyond. Reproducibility, fault tolerance, and auditability Adaptivity is a key so applications can effectively use the resources. 23 Citation in the Press, March 10 th, 2008 DOE Supercomputers Live up to Expectation WASHINGTON, Mar. 10, 2008 GAO reported today that after almost 5 years of effort and several hundreds of M$ s spent at DOE labs, the high performance computers recently purchased have exceeded users expectation and are helping to solve some of our most challenging problems. Alan Laub head of DOE s HPC efforts reported today at the annual meeting of the SciDAC PI How can this happen? Close interactions of with the applications and the CS and Math ISIC groups Dramatic improvements in adaptability of software to the execution environment Improved processor-memory bandwidth New large-scale system architectures and software Aggressive fault management and reliability Exploration of some alternative architectures and languages Application teams to help drive the design of new architectures 24 12
With Apologies to Gary Larson SciDAC is helping Teams are developing the scientific computing software and hardware infrastructure needed to use terascale computers and beyond. 25 13