Getting the Performance Out Of High Performance Computing. Getting the Performance Out Of High Performance Computing



Similar documents
Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

64-Bit versus 32-Bit CPUs in Scientific Computing

Architecture of Hitachi SR-8000

Trends in High-Performance Computing for Power Grid Applications

YALES2 porting on the Xeon- Phi Early results

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Parallel Programming Survey

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Design Cycle for Microprocessors

Lattice QCD Performance. on Multi core Linux Servers

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Computer Systems Structure Main Memory Organization

Building an Inexpensive Parallel Computer

Supercomputing Status und Trends (Conference Report) Peter Wegner

Binary search tree with SIMD bandwidth optimization using SSE

Data Centric Systems (DCS)

Multi-Threading Performance on Commodity Multi-Core Processors

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Architectures and Platforms

Power-Aware High-Performance Scientific Computing

Performance of the JMA NWP models on the PC cluster TSUBAME.

A Very Brief History of High-Performance Computing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Price/performance Modern Memory Hierarchy

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Cluster Computing in a College of Criminal Justice

High Performance Computing. Course Notes HPC Fundamentals

Big Graph Processing: Some Background

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

1. Memory technology & Hierarchy

Lecture 1: the anatomy of a supercomputer

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Cluster Computing at HRI

Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer

benchmarking Amazon EC2 for high-performance scientific computing

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections


COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Next Generation GPU Architecture Code-named Fermi

OpenMP and Performance

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Introduction to Microprocessors

32 K. Stride K 8 K 32 K 128 K 512 K 2 M

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Infrastructure Matters: POWER8 vs. Xeon x86

Scientific Computing Programming with Parallel Objects

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

ST810 Advanced Computing

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Hadoop Cluster Applications

1 Bull, 2011 Bull Extreme Computing

Why Latency Lags Bandwidth, and What it Means to Computing

CSE 6040 Computing for Data Analytics: Methods and Tools

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Parallelism and Cloud Computing

Linux Cluster Computing An Administrator s Perspective

White Paper The Numascale Solution: Extreme BIG DATA Computing

OpenSPARC T1 Processor

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

High Performance Computing

Benchmarking Large Scale Cloud Computing in Asia Pacific

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Scaling in a Hypervisor Environment

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

White Paper Perceived Performance Tuning a system for what really matters

Enabling Technologies for Distributed Computing

Introduction to Virtual Machines

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

An Introduction to Parallel Computing/ Programming

THE NAS KERNEL BENCHMARK PROGRAM

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Hadoop on the Gordon Data Intensive Cluster

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Mesh Generation and Load Balancing

BSC - Barcelona Supercomputer Center

Transcription:

Getting the Performance Out Of High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 1 Getting the Performance Out Of High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 2 1

3 Getting the Performance into High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/ 4 2

Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. 2 transistors/chip Every 1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, storage, internet bandwidth, etc 5 1 PFlop/s Moore s Law Super Scalar/Vector/Parallel Parallel Earth Simulator 1 TFlop/s ASCI Red ASCI White Pacific TMC CM-5 Cray T3D 1 GFlop/s Super Scalar Vector Cray 1 Cray 2 Cray -MP TMC CM-2 1 MFlop/s Scalar CDC 7600 CDC 6600 IBM 360/195 IBM 7090 1 KFlop/s UNIVAC 1 EDSAC 1 1950 1960 1970 1980 1990 2000 2010 6 3

Next Generation: IBM Blue Gene/L and ASCI Purple Announced 11/19/02 One of 2 machines for LLNL 360 TFlop/s 130,000 proc Linux FY 2005 Plus ASCI Purple IBM Power 5 based 7 12K proc, 100 TFlop/s To Be Provocative Citation in the Press, March 10 th, 2008 DOE Supercomputers Sit Idle WASHINGTON, Mar. 10, 2008 GAO reports that after almost 5 years of effort and several hundreds of M$ s spent at the DOE labs, the high performance computers recently purchased did not meet users expectation and are sitting idle Alan Laub head of the DOE efforts reports that the computer equ How could this happen? Complexity of programming these machines were underestimated Users were unprepared for the lack of reliability of the hardware and software Little effort was spent to carry out medium and long term research activities to solve problems that were foreseen 5 years ago in the areas of applications, algorithm, middleware, programming models, and computer architectures, 8 4

Software Technology & Performance Tendency to focus on the hardware Software required to bridge an ever widening gap Gaps between potential and delivered performance is very steep Performance only if the data and controls are setup just right Otherwise, dramatic performance degradations, very unstable situation Will become more unstable as systems change and become more complex Challenge for applications, libraries, and tools is formidable with Tflop/s level, even greater with Pflops, some might say insurmountable. 9 Linpack (100x100) Analysis, The Machine on My Desk 12 Years Ago and Today Compaq 386/S20 S with FPA -.16 Mflop/s Pentium IV 2.8 GHz 1317 Mflop/s 12 years we see a factor of ~ 8231 Doubling in less than 12 months, for 12 years Moore s Law gives us a factor of 256. How do we get a factor > 8000? Clock speed increase = 128x External Bus Width & Caching 16 vs. 64 bits = 4x Floating Point - 4/8 bits multi vs. 64 bits (1 clock) = 8x Compiler Technology = 2x However the potential for that Pentium 4 is Complex set of interaction between Application Algorithms Programming language Compiler Machine instructions Hardware Many layers of translation from the application to the hardware Changing with each generation 5.6 Gflop/s and here we are getting 1.32 Gflop/s Still a factor of 4.25 off of peak 10 5

Where Does Much of Our Lost Performance Go? or Why Should I Care About the Memory Hierarchy? 1000000 Processor-DRAM Memory Gap (latency) Moore s Law CPU µproc 60%/yr. (2/1.5yr) Performance 10000 100 1 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2/10 yrs) Year 11 Optimizing Computation and Memory Use Computational optimizations Theoretical peak:(# fpus)*(flops/cycle)*cycle time Pentium 4: (1 fpu)*(2 flops/cycle)*(2.8 Ghz) Operations like: = 5600 MFLOP/s y = a x + y : 3 operands (24 Bytes) needed for 2 flops; 5600 Mflop/s requires 8400 MWord/s bandwidth from memory Memory optimization Theoretical peak: (bus width) * (bus speed) Pentium 4: (32 bits)*(533 Mhz) = 2132 MB/s = 266 MWord/s Off by a factor of 30 from what s required to drive the processor from memory to peak performance 12 6

Memory Hierarchy By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Datapath Processor Control Registers On-Chip Cache Level 2 and 3 Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) Distributed Memory Tertiary Storage (Disk/Tape) Remote Cluster Memory Speed (ns): 1s 10s 100s Size (bytes): 100s Ks Ms 10,000,000s (10s ms) 100,000 s (.1s ms) Gs 10,000,000,000s (10s sec) 10,000,000 s (10s ms) 13 Ts Tool To Help Understand What s Going On In the Processor Complex system with many filters Need to identify bottlenecks Prioritize optimization Focus on important aspects 14 7

Tools for Performance Analysis, Modeling and Optimization PAPI Dynist SVPablo ROSE: Compiler Framework Recognition of high-level abstractions Specification of Transformations 15 Example PERC on Climate Model Interaction with the SciDAC Climate development effort Profiling Identifying performance bottleneck and prioritizing enhancements Evaluation of code over time 9 month period Produced 400% improvement via decreased overhead and increased scalability 16 8

Signatures: Key Factors in Applications and System that Affect Performance Application Signatures Characterization of operations needed to be performed by application Description of application demands on resources Algorithm Signatures Opts counts Memory ref patterns Data dependencies I/O characteristics Software Signatures Sync points Thread level parallelism Inst level parallelism Ratio of mem ref to flpt ops Predict application behavior and performance Execution signature combine application and machine signatures to provide accurate performance models. Hardware Signatures Performance capabilities of Machine Latencies and bandwidth of memory hierarchy Local to node & to remote node Instruction issue rates Cache size TLB size Parallel or Distributed Application Feedback (Degree of Similarity) Observation and Model Signature Comparison Performance Monitor Generation of Observation Signature Model Signature Live Performance Data Application Signature 17 Algorithms vs Applications Sparse Linear System Solvers Lattice Gauge (QCD) Quantum Chemistry Weather Simulation Comp Fluid Dynamics Adjustment of Geodetic Networks Inverse Problems Structural Mechanics Electronic Device Simulation Circuit Simulation Linear Least Squares Nonlinear Algebraic System Solvers Sparse Eigenvalue Problems FFT Rapid Elliptic Problem Solvers Multigrid Schemes Stiff ODE Solvers Monte Carlo Schemes Integral Transformations 18 From: Supercomputing Tradeoffs and the Cedar System, E. Davidson, D. Kuck, D. Lawrie, and A. Sameh, in High - Speed Computing, Scientific Applications and Algorithm Design, Ed R. Wilhelmson, U of I Press, 1986. 9

Update to Sameh s Table? Application Performance Matrix http://www.krellinst.org/matrix/ Next step by looking at: Application Signatures Algorithms choices Software profile Architecture (Machine) Data mine to extract information Need signatures for A 3 S 19 Performance Tuning Motivation: performance of many applications dominated by a few kernels Conventional approach: handtuning by user or vendor Very time consuming and tedious work Even with intimate knowledge of architecture and compiler, performance hard to predict Growing list of kernels to tune Must be redone for every architecture, compiler Compiler technology often lags architecture Not just a compiler problem: Best algorithm may depend on input, so some tuning at runtime. Not all algorithms semantically or mathematically equivalent 20 10

Automatic Performance Tuning to Hide Complexity Approach: for each kernel 1. Identify and generate a space of algorithms 2. Search for the fastest one, by running them What is a space of algorithms? Depending on kernel and input, may vary instruction mix and order memory access patterns data structures mathematical formulation When do we search? Once per kernel and architecture At compile time At run time All of the above 21 Some Automatic Tuning Projects ATLAS (www.netlib.org/atlas) (Dongarra, Whaley) used in Matlab and many SciDAC and ASCI projects PHIPAC (www.icsi.berkeley.edu/~bilmes/phipac) (Bilmes,Asanovic,Vuduc,Demmel) Sparsity (www.cs.berkeley.edu/~yelick/sparsity) (Yelick, Im) Self Adapting Linear Algebra Software (SALAS) (Dongarra, Eijkhout, Gropp, Keyes) FFTs and Signal Processing FFTW (www.fftw.org) Won 1999 Wilkinson Prize for Numerical Software SPIRAL (www.ece.cmu.edu/~spiral) Extensions to other transforms, DSPs UHFFT Extensions to higher dimension, parallelism MFLOP/S 3500.0 3000.0 2500.0 2000.0 1500.0 1000.0 500.0 0.0 Vendor BLAS ATLAS BLAS F77 BLAS AMD Athlon-600 DEC ev56-533 DEC ev6-500 HP9000/735/135 IBM PPC604-112 IBM Power2-160 Architectures IBM Power3-200 Intel P-III 933 MHz Intel P-4 2.53 GHz w/sse2 SGI R10000ip28-200 SGI R12000ip30-270 Sun UltraSparc2-200 22 11

Futures for High Performance Scientific Computing Numerical software will be adaptive, exploratory, and intelligent Determinism in numerical computing will be gone. After all, its not reasonable to ask for exactness in numerical computations. Reproducibility at a cost Importance of floating point arithmetic will be undiminished. 16, 32, 64, 128 bits and beyond. Reproducibility, fault tolerance, and auditability Adaptivity is a key so applications can effectively use the resources. 23 Citation in the Press, March 10 th, 2008 DOE Supercomputers Live up to Expectation WASHINGTON, Mar. 10, 2008 GAO reported today that after almost 5 years of effort and several hundreds of M$ s spent at DOE labs, the high performance computers recently purchased have exceeded users expectation and are helping to solve some of our most challenging problems. Alan Laub head of DOE s HPC efforts reported today at the annual meeting of the SciDAC PI How can this happen? Close interactions of with the applications and the CS and Math ISIC groups Dramatic improvements in adaptability of software to the execution environment Improved processor-memory bandwidth New large-scale system architectures and software Aggressive fault management and reliability Exploration of some alternative architectures and languages Application teams to help drive the design of new architectures 24 12

With Apologies to Gary Larson SciDAC is helping Teams are developing the scientific computing software and hardware infrastructure needed to use terascale computers and beyond. 25 13