Leistungsanalyse von Rechnersystemen

Transcription

1 Center for Information Services and High Performance Computing (ZIH) Leistungsanalyse von Rechnersystemen 29. Oktober 2008 Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) Summary of Previous Lecture (1) Remarks: Doherty (1970) Performance is the degree to which a computing system meets expectations of the persons involved in it. Main objective: Get highest performance for a given cost System: An arbitrary collection of hardware, software, and firmware: e.g. CPU, database, network of computers Metric: A criteria used to evaluate the performance of a system: e.g. response time, throughput, FLOPS Workload: The overall sum of user requests to a system e.g.: CPU workload: Collection of instructions to execute 1

2 Summary of Previous Lecture (2) Discussion of performance analysis examples and questions Selection of technique, metric, and workload Correctness of performance measurements Measurement and simulation design The art of performance analysis Successful evaluation cannot be produced mechanically Evaluation requires detailed knowledge of the system to be modeled Summary of Previous Lecture (3) Knowledge of common mistakes and games is important for choosing the right methodology as an analyst; questioning offers, recommendations, and advertisements as a consumer, buying agent, or decision maker Classes of common mistakes: Goals Methodology Completeness Analysis Presentation Checklist for avoiding problems Systematic approach to performance evaluation 2

3 Summary of Previous Lecture: Questions What does performance mean? What are the main reasons to do a performance analysis? What are the main tasks? What s a system in performance analysis terminology? What do the terms metric and workload stand for? What s a performance parameter? What s a performance factor? Center for Information Services and High Performance Computing (ZIH) Parallel Metrics Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) 3

4 Excursion on Speedup and Efficiency Metrics Comparison of sequential and parallel algorithms Speedup: S n = T 1 T n n is the number of processors T 1 is the execution time of the sequential algorithm T n is the execution time of the parallel algorithm with n processors Efficiency: E p = S p p Its value estimates how well-utilized p processors solve a given problem Usually between zero and one. Exception: Super linear speedup (later) Amdahl s Law Find the maximum expected improvement to an overall system when only part of the system is improved Serial execution time = s+p Parallel execution time = s+p/n S n = s + p s + p n Normalizing with respect to serial time (s+p) = 1 results in: S n = 1/(s+p/n) Drops off rapidly as serial fraction increases Maximum speedup possible = 1/s, independent of n the number of processors! Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors? What is wrong with this argument? 4

5 Scaled Speedup (Gustafson-Barsis Law) Amdahl s speedup equation assumes p is independent of n, in other words the problem size remains the same Gustafson-Barsis law states that any sufficiently large problem can be efficiently parallelized More realistic to assume runtime remains the same, NOT the problem size If the problem size scales up, does the serial part also increase? Parallel execution time = s+p Serial execution time = s+np S sn = s + pn s + p Normalizing with respect to parallel execution time results in: S sn = n+(1-n) s = p(n-1) + 1 Efficiency and Serial Fraction Strong scalability vs. weak scalability E n = S n /n, does not tell the whole story is it necessarily bad if efficiency drops as you increase n for a given problem size? s is supposed to be a constant this assumes work is load balanced no overhead for synchronizing the processors Experimentally measure the serial fraction if s does not remain constant, what can we discern? 5

6 Superlinear/Superunitary Speedup Work in algorithm = W real +W ovhd What is W ovhd? Super-unitary speedup possible if total work done by n processors is strictly less than that done by a single processor Reasons for super-unitary speedup Memory and cache effects Dividing up resource management overheads Hiding latency for remote operations Randomized algorithms In literature superlinear speedup is sometime also referred to us superunitary speedup which might be mathematically more correct Center for Information Services and High Performance Computing (ZIH) Workload types, selection and characterization Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) 6

7 Types of Workloads Test workload: Any workload used in performance studies Real or synthetic Real workload: Observed on a system being used for normal operation Cannot be repeated May contain sensitive data Synthetic workload: Should be representative for a real workload Often smaller in size Historical examples for test workloads Addition instruction Instruction mixes Kernels Synthetic programs Application benchmarks 7

8 Popular benchmarks: Eratosthenes sieve algorithm Algorithm to find prime numbers Kernel Simple An algorithm is always independent of a computer language or specific implementation No very representative of today's use of computers Popular benchmarks: Ackermann s Function Ackermann(n,m) := n+1 if m=0 Ackermann(m-1,1) if n=0 Ackermann(m-1, Ackermann(m,n-1)) Used to assess the efficiency of procedure calls Ackermann(3,n) requires (512*4**(n-1)-15*2**(n+3)+9*n+37)/3 calls and a stack size 2**(n+3)-4 8

9 Popular benchmarks: Whetstone Used at British Central Computer Agency 11 modules Representative f 949 ALGOL programs Available in ALGOL, FORTRAN, PL/I and other programs See Curnow and Wichmann (1975) Results in KWHIPS (Kilo Whetstone Instructions Per Second) Workloads characteristics: Floating point intensive Cache friendly No I/O Popular benchmarks: LINPACK Developed by Jack Dongarra (1983) at ANL (now ICL, UTK) Solves a dense system of linear equations Algorithmic definition of the benchmark Reference implementation available (HPL) Makes have use of BLAS One fixed dataset: 100x100 Used as the benchmark for the TOP500 list Many vendors have its own hand-tuned implementation 9

10 Popular benchmarks: Dhrystone Developed in 1984 by Reinhold Weicker at Siemens Represents systems programming environments Available in C, Pascal and Ada Results are in Dhrystone Instructions Per Seconds (DIPS) Includes ground rules for building and executing Dhrystone (run rules) Popular Benchmarks: Lawrence Livermore Loops 24 separate tests Largely vectorizable Assembled at LLNL (see McMahon 1986) 10

11 Popular Benchmarks: Transaction Processing (TPC-C) Successor of the Debit-Credit Benchmark TPC-C is an on-line transaction processing benchmark Results reports performance (tpmc) and price/performance ($/tmpc) System reported has to be available to the customer (at that price) Running the benchmarks requires a costly setup: SPEC groups and benchmarks Open Systems Group (desktop systems, high-end workstations and servers) CPU (CPU benchmarks) JAVA (java client and server side benchmarks) MAIL (mail server benchmarks) SFS (file server benchmarks) WEB (web Server benchmarks) High Performance Group (HPC systems) OMP (OpenMP benchmark) HPC (HPC application benchmark) MPI (MPI application benchmark) Graphics Performance Groups (Graphics) Apc (Graphics application benchmarks) Opc (OpenGL performance benchmarks) 11

12 Center for Information Services and High Performance Computing (ZIH) Workload Selection Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) System under Study Seems to be an easy thing to define Be aware of different abstraction layers Example ISO/OSI reference model for computer networks: 1. Application (mail, FTP) 2. Presentation (Data compression,..) 3. Session (Dialogs) 4. Transport (Messages) 5. Network (Packets) 6. Datalink (Frames) 7. Physical (Bits) 12

13 Level of Detail of the workload description Examples: Most frequent request (e.g. Addition) Frequency of request type (instruction mix) Time-stamped sequence of requests Average resource demand (e.g. 20 I/O requests per second) Distribution of resource demands (not only the average, but also probability distribution) Representativeness After all benchmarks are not a merit of their own, they should represent real workloads: Different characteristics to consider: Arrival rate of requests Resource demands Resource usage profile (sequence and amounts of resources used by an application) To be representative a test workload has to follow the user behavior in a timely fashion!!! 13

14 Center for Information Services and High Performance Computing (ZIH) SPEC Benchmarks Vorlesung Leistungsanalyse Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) Outline What is SPEC? Who is SPEC? Some SPEC benchmarks: SPEC CPU SPEC HPC SPEC OMP SPEC MPI Summary 14

15 Center for Information Services and High Performance Computing (ZIH) What and who is SPEC? Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) What is SPEC? The Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of highperformance computers. SPEC develops suites of benchmarks and also reviews and publishes submitted results from our member organizations and other benchmark licensees. For more details see 15

16 SPEC Members SPEC Members: 3DLabs * Acer Inc. * Advanced Micro Devices * Apple Computer, Inc. * ATI Research * Azul Systems, Inc. * BEA Systems * Borland * Bull S.A. * CommuniGate Systems * Dell * EMC * Exanet * Fabric7 Systems, Inc. * Freescale Semiconductor, Inc. * Fujitsu Limited * Fujitsu Siemens * Hewlett-Packard * Hitachi Data Systems * Hitachi Ltd. * IBM * Intel * ION Computer Systems * JBoss * Microsoft * Mirapoint * NEC - Japan * Network Appliance * Novell * NVIDIA * Openwave Systems * Oracle * P.A. Semi * Panasas * PathScale * The Portland Group * S3 Graphics Co., Ltd. * SAP AG * SGI * Sun Microsystems * Super Micro Computer, Inc. * Sybase * Symantec Corporation * Unisys * Verisign * Zeus Technology * SPEC Associates: California Institute of Technology * Center for Scientific Computing (CSC) * Defence Science and Technology Organisation - Stirling * Dresden University of Technology * Duke University * JAIST * Kyushu University * Leibniz Rechenzentrum - Germany * National University of Singapore * New South Wales Department of Education and Training * Purdue University * Queen's University * Rightmark * Stanford University * Technical University of Darmstadt * Texas A&M University * Tsinghua University * University of Aizu - Japan * University of California - Berkeley * University of Central Florida * University of Illinois - NCSA * University of Maryland * University of Modena * University of Nebraska, Lincoln * University of New Mexico * University of Pavia * University of Stuttgart * University of Texas at Austin * University of Texas at El Paso * University of Tsukuba * University of Waterloo * VA Austin Automation Center * SPEC members in Dresden: Workshop June

17 SPEC groups Open Systems Group (desktop systems, high-end workstations and servers) CPU (CPU benchmarks) JAVA (java client and server side benchmarks) MAIL (mail server benchmarks) SFS (file server benchmarks) WEB (web Server benchmarks) High Performance Group (HPC systems) OMP (OpenMP benchmark) HPC (HPC application benchmark) MPI (MPI application benchmark) Graphics Performance Groups (Graphics) Apc (Graphics application benchmarks) Opc (OpenGL performance benchmarks) SPEC HPG = SPEC High-Performance Group Founded in 1994 Mission: To establish, maintain, and endorse a suite of benchmarks that are representative of real-world highperformance computing applications. SPEC/HPG includes members from both industry and academia. Benchmark products: SPEC OMP (OMPM2001, OMPL2001) SPEC HPC2002 released at SC 2002 SPEC MPI (under development) 17

18 Currently active SPEC HPG Members Fujitsu HP IBM Intel SGI SUN UNISYS University of Purdue Technische Universität Dresden HPG (High Performance Group) Benchmark Suites MPI2007 OMP2001 OMPL2001 HPC96 HPC2002 Founding of SPEC HPG Jan June 2001 June 2002 Jan

19 Center for Information Services and High Performance Computing (ZIH) Overview and Positioning Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) Where is SPEC Relative to Other Benchmarks? There are many metrics, each one has its purpose Computer Hardware Raw machine performance: Tflops Microbenchmarks: Stream Algorithmic benchmarks: Linpack Compact Apps/Kernels: NAS benchmarks Application Suites: SPEC User-specific applications: Custom benchmarks Applications 19

20 Why do we need benchmarks? Identify problems: measure machine properties Time evolution: verify that we make progress Coverage: Help the vendors to have representative codes: Increase competition by transparency Drive future development (see SPEC CPU2000) Relevance: Help the customers to choose the right computer Comparison of different benchmark classes coverage relevance Identify problems Time evolution Micro Algorithmic Kernels SPEC Apps

21 Center for Information Services and High Performance Computing (ZIH) Nöthnitzer Straße 46 Raum 1026 Tel SPEC CPU 2006 From John Henning s talk at SPEC Workshop June 2007, Dresden Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) SPEC CPU2006 History Released August 2006 Replaces CPU2000 (retired February 2007) 5th CPU benchmark SPECmark (later called CPU89 ) SPEC92 (later called CPU92 ) CPU95 CPU2000 CPU2006 Note: these updates are required to stay representative Question to the audience: What kind of application would you add? 21

22 CINT 2006 Benchmark L Application Area Brief Description 400.perlbench C Programming Language Derived from Perl V The workload includes SpamAssassin, MHonArc (an indexer), and specdiff (SPEC's tool that checks benchmark outputs). 401.bzip2 C Compression Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O. 403.gcc C C-Compiler Based on gcc Version 3.2, generates code for Opteron. 429.mcf C Combinatorial Optim. Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport. 445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game. 456.hmmer C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profile HMMs) 458.sjeng C AI: chess A highly-ranked chess program that also plays several chess variants. 462.libquantum C Physics Quantum Comp. Simulates a quantum computer, running Shor's polynomial-time factorization algorithm. 464.h264ref C Video Compression A reference implementation of H.264/AVC, encodes a videostream using 2 parameter sets. The H.264/AVC standard is expected to replace MPEG2 471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete event simulator to model a large Ethernet campus network. 473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm. 483.xalancbmk C++ XML Processing A modified version of Xalan-C++, which transforms XML documents to other document types. CFP 2006 (part I) Benchmark Lang. Application Area Brief Description 410.bwaves Fortran Fluid Dynamics Computes 3D transonic transient laminar viscous flow. 416.gamess Fortran Quantum Chemistry. Implements a wide range of quantum chemical computations. The SPEC workload does self-consistent field calculations using the Restricted Hartree Fock method, Restricted open-shell Hartree-Fock, and Multi- Configuration Self-Consistent Field 433.milc C Physics/QCD A gauge field generating program for lattice gauge theory with dynamical quarks. 434.zeusmp Fortran Physics / CFD ZEUS-MP is a computational fluid dynamics code developed at the Laboratory for Computational Astrophysics (NCSA, University of Illinois at Urbana-Champaign) for the simulation of astrophysical phenomena. 435.gromacs C, Fortran Biochemistry Molecular dynamics, i.e. simulate Newtonian equations of motion for hundreds to millions of particles. The test case simulates protein Lysozyme in a solution. 436.cactusADM C,Fortran Physics / General Relativity Solves the Einstein evolution equations using a staggered-leapfrog numerical method 437.leslie3d Fortran Fluid Dynamics Computational Fluid Dynamics (CFD) using Large-Eddy Simulations with Linear-Eddy Model in 3D. Uses MacCormack Predictor-Corrector time integration 444.namd C++ Biology Molecular Dynamics Simulates biomolecular systems. Test case has 92,224 atoms of apolipoprotein A-I. 447.dealII C++ FE Analysis deal.ii is a C++ library targeted at adaptive finite elements and error estimation. The testcase solves a Helmholtz-type equation with nonconstant coefficients. 22

23 CFP 2006 (part II) Benchmark Language Application Area Brief Description 450.soplex C++ Linear Programming, Solves a linear program using a simplex algorithm and sparse linear algebra. Test Optimization cases include railroad planning and military airlift models. 453.povray C++ Image Ray-tracing Image rendering. The testcase is a 1280x1024 antialiased image of a landscape with some abstract objects with textures using a Perlin noise function. 454.calculix C, F Structural Mechanics Finite element code for 3D structural applications. Uses the SPOOLES solver library. 459.GemsFDTD F Electromagnetics Solves Maxwell equations in 3D using finite-difference time-domain (FDTD) method. 465.tonto Fortran Quantum Chemistry An open source quantum chemistry package, using an object-oriented design in Fortran 95. The test case places a constraint on a molecular Hartree-Fock wavefunction calculation to better match experimental X-ray diffraction data. 470.lbm C Fluid Dynamics Implements the "Lattice-Boltzmann Method" to simulate incompressible fluids in 3D 481.wrf C,F Weather Weather modeling from scales of meters to thousands of kilometers. The test case is from a 30km area over 2 days. 482.sphinx3 C Speech recognition A widely-known speech recognition system from Carnegie Mellon University Code growth 23

24 Metrics Speed SPECint_base2006 (Required Base result) SPECint2006 (Optional Peak result) SPECfp_base2006 (Required Base result) SPECfp2006 (Optional Peak result) Throughput SPECint_rate_base2006 (Required Base result) SPECint_rate2006 (Optional Peak result) SPECfp_rate_base2006 (Required Base result) SPECfp_rate2006 (Optional Peak result) Speed Metric for Single Benchmark For each benchmark in suite, compute ratio vs. time on a reference system A 1997 Sun system with 296 MHz UltraSPARC II Similar but not identical to CPU2000 ref machine Example: 400.perlbench on a year 2006 imac took 948 seconds On the reference system, took 9770 seconds SPECratio = 10.3 (9770/948) If your workload looks like perl, you might find that this modern imac runs around 10x faster than a state-of-the-1997-art workstation. 24

25 Overall Speed Metric To obtain the overall speed metrics: geometric mean of the individual SPECratios Why geometric mean? Because this is the best answer to the question Without knowing how much time I will spend in text processing vs. network mapping vs. compiling vs. video compression, please tell me about how much faster this machine will be than the reference system. Motivation for Throughput Metric Differs from speed Stove analogy: One big flame cooks one big pot with one hogshead in one hour 6 little flames cook 6 little pots, each holding one firkin, in 15 minutes Which is better? Well, big flame does ~250 liters/hour; each little flame does only ~40 * 4 = 160 liters/hour 25

26 Throughput vs. Speed Big flame does ~250 liters/hour; each little flame does only ~40 * 4 = 160 liters/hour Alternatives: If I only need to heat up an UNOPENED container holding 1 gallon of soup, supper can be served most quickly if I put it on the big flame If I need to heat up one butt of soup (=2 hogsheads), and if I can open the container, I'd be better off using many small flames In IT business: Processing one image in Photoshop or Gimp vs. Rendering the next movie with thousands of pictures CPU2006 Throughput Metric Formula: the number of copies run * reference time for the benchmark / elapsed time in seconds Example: Sun Fire E25K runs 144 copies of 400.perlbench in1066 seconds: 144 * 9770 / 1066 =

27 Summary of Metrics Two different kind of metrics speed (single application turnaround) rate (thoughput) Run rules make the different between base and peak Base: conservative optimization, less freedom Peak: more aggressive optimization, more freedom Tow benchmark sets SPECint and SPECfp 2 3 = 8 different metrics If you look at the single application results you get: 2*2*(12+17)=116 different metics Example for Run Rules Base does not allow feedback directed optimization (still legal in peak) An unlimited number of flags may be set in base, Why? Because flag counting is not worth arguing about. For example, is -fast:np27 one flag, two, or three? Prove it. What if it's -fast_np27? What it it s fast np27 or fast np27? 27

28 SPEC CPU2000 Result Center for Information Services and High Performance Computing (ZIH) Thank You! Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de) 28