Leistungsanalyse von Rechnersystemen

Transcription

1 Center for Information Services and High Performance Computing (ZIH) Leistungsanalyse von Rechnersystemen Zellescher Weg 12 Willers-Bau A113 Tel Matthias S. Mueller (matthias.mueller@tu-dresden.de) Organization Lecture: Every Wednesday in INF E07 from 13:00 to 14:30 Exercise: Every Thursday in INF E010 from 9:20 to10:50 First Exercise: October 19th, guided tour through all new machine rooms at ZIH All slides will be in English Ten minute summary of last lecture at the beginning of each lecture. Also given in English 1

2 Class Material on the Web Slides will be put on the web prior or shortly after each class Bildungsportal Sachsen Login required but identical to ZIH or INF login ZIH web pages Class Outline 14 lectures with 12 corresponding exercises Class structure Introduction and motivation Performance requirements and common evaluation mistakes Performance metrics and evaluation techniques Workload types, selection, and characterization Commonly used benchmarks Benchmarks specialized on I/O Monitoring techniques Capacity planning for future systems Performance data presentation Comparing system using sample data Regression models Experimental design Performance simulation and prediction Introduction to queueing theory 2

3 Literature Raj Jain: The Art of Computer Systems Performance Analysis John Wiley & Sons, Inc., 1991 (ISBN: ) Rainer Klar, Peter Dauphin, Fran Hartleb, Richard Hofmann, Bernd Mohr, Andreas Quick, Markus Siegle Messung und Modellierung paralleler und verteilter Rechensysteme B.G. Teubner Verlag, Stuttgart, 1995 (ISBN: ) Dongarra, Gentzsch, Eds.: Computer Benchmarks, Advances in Parallel Computing 8, North Holland, 1993 (ISBN: x) Motivation Zellescher Weg 12 Willers-Bau A113 Tel Matthias S. Mueller (matthias.mueller@tu-dresden.de) 3

4 Innovations that changed our daily life steam engine, motor railway, car, airplanes fertilizer telephone computer energy transportation food communication data processing Speed of data processing Human Workstation, PC Supercomputer 10-2 FLOPS 10 8 FLOPS FLOPS Ratio: factor

5 5

6 6

7 HPC A key technology? USA defines strategic mission of HPC Software, methods and human beings Main motivation from military applications Integration of know-how in the country Attraction of experts from all over the world Japan: Creator of the Earth-Simulator Petaflop special purpose machine for MD simulations Petaflop project is in preparation EU is still discussing an initiative 7

8 Accelerated Strategic Computing Initiative (ASCI) Strategic Initiative in the U.S.A. ASCI Red (Sandia): Intel-System with 1TFLOP (sustained) ASCI Blue (LANL;LLNL): IBM und SGI, 3 TFLOP each (sustained) ASCI White (LLNL): IBM Power 3 (10 TFLOPS) ASCI Q (LANL): COMPAQ-Rechner (20 TFLOPS) Red Storm (Sandia) Cray XT3, Opteron (40 TFLOPS) ASCI Purple (LLNL): IBM Power 4 (100 TFLOPS) ASCI BlueGene (LLNL): IBM PowerPC (180/360 TFLOPS) What kind of know-how is required for HPC? Algorithms and methods Performance Programming (Paradigms and details of implementations) Operation of supercomputers (network, infrastructure, service, support) 8

9 Challenges Languages Fortran95, C++, Java Parallelization: MPI, OpenMP Network ATM, IPv6, Gigabit Scheduling Distributed Components, mobile agents System architecture Processors, memory hierarchy What is the best programming models for clustered SMPs with a deep memory hierarchy? Software a key technology Software is a key factor for progress in our country Is Germany a location for software development? WWW is everywhere (E-Commerce, Google, EBay, ) Contribution of HPC: Optimizing Servers Optimizing Access to data bases Optimizing applications 9

10 Center of Information Services and HPC A short introduction Zellescher Weg 12 Willers-Bau A113 Tel Matthias S. Mueller (matthias.mueller@tu-dresden.de) HPC in Germany 10

11 Center for Information Services and HPC (ZIH) Central Scientific Unit at TU Dresden Merged institution: TUD Computing Center (URZ) and Center for High Performance Computing (ZHR) Competence Center for Parallel Computing and Software Tools Strong commitment to support real users Development of algorithms and methods: Cooperation with users from all departments Structure Management Director: Deptuy. Directors: Prof. Dr. W. E. Nagel Dr. P. Fischer Dr. M. Müller Unit ZSD Central Systems and Services Dr. S. Maletti Unit IAK Interdiciplinary Application Development and Coordination Dr. M. Müller Unit NK Network and Communication W. Wünsch Unit IMC Innovative Methods of Computing PD Dr. A. Deutsch Unit PSW Programming and Software Tools Dr. H. Mix 11

12 Responsibilities of ZIH Providing infrastructure and qualified service for TU Dresden and Saxony Research topics Architecture and performance analysis of High Performance Computers Programming methods and techniques for HPC systems Software tools to support programming and optimization Modeling algorithms of biological processes Mathematical models, algorithms, and efficient implementations Role of mediator between vendors, developers, and users Pick up and preparation of new concepts, methods, and techniques Teaching and Education Procurement: Overall Infrastructure / Future Directions HPC-Server Main memory : 4 TB PC-Farm 8 GB/s 4 GB/s 4 GB/s HPC-SAN Capacity: > 50 TB PC-SAN Capacity: > 50 TB HPC-Component SGI Altix 4700 >2000 of the latest Itanium -Cores 6 TByte main memory 1,8 GB/s PetaByte Tape Silo Capacity: 1 PB PC-Farm System from Linux Networx AMD opteron CPUs >700 boards with >2500 cores Infiniband networks between the nodes 12

13 Timeline Machine Room Upgrade Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Installation Stage 1a (Test operation) Installation Stage 1b Installation Stage 2 Center for Information Services and High Performance Computing (ZIH) Stage1a Test Systems Zellescher Weg 12 Willers-Bau A113 Tel Matthias S. Mueller (matthias.mueller@tu-dresden.de) 13

14 HRSK Stage 1a HPC-Server merkur.hrsk.tu-dresden.de SGI Altix 3700 Bx2 192x 1.5GHz/4MB L3 Cache Itanium2 CPUs 1152 GFlops/s Peak Performance 768 GB Shared Memory (4 GB/CPU), NUMA 1 TB lokal discs + 34 TB SAN SuSE SLES 9 inkl. SGI ProPack 4 Intel Compiler and Tools: Alinea DDT Debugger Batchsystem LSF 1x DDN RAID System S2A9500: 2x S2A9500 Couplet (5GB Cache, 8x FC4 Ports) 292x 146GB 10k RPM FC Disks (4 hot spare) 34 TB net capacity PC-Farm Stage 1a phobos.hrsk.tu-dresden.de 64 dual CPU nodes 128 AMD Opteron DP GHz (Single-Core) CPUs 563,2 GFlops/s peak performance 256 GB main memory ( 4GB per node) SUSE operating system Infiniband 4x Interconnect 80 GB local disc per node 21,2 TB shared disc space: 2x DDN RAID System S2A x 146GB 10k RPM FC discs (4 Hot-Spare) 14

15 Center for Information Services and High Performance Computing (ZIH) Stage1b,2 Zellescher Weg 12 Willers-Bau A113 Tel Matthias S. Mueller (matthias.mueller@tu-dresden.de) HRSK Stufe 1b/2 start of installation 31.Juli 2006 Petabyte-Bandarchiv SUN STK SL Slots 30 LTO-3 tape drives 2500 LTO-3 tapes 15

16 HRSK HPC-System SGI Altix x 42U Racks 1024 x Sockets mit Itanium2 Montecito Dual- Core CPUs (1.6 GHz/9MB L3 Cache) 13 TFlops/s Peak Performance 6,5 TB Shared Memory HRSK PC-Farm Linux Networx PC-Farm (final configuration) 26 x 42U Racks + water cooled AMD Opteron X85 Dual Core Chip mit 2,6 GHz 384x Single CPU Nodes 232x Dual CPU Nodes 112x Quad CPU Nodes 2 GB main memory (ECC) per Core für Stufe 2 12 TFlops/s Peak Performance 16

17 HRSK Stage 2 HPC-SAN and PC-SAN SGI InfiniteStorage 6700 (DDN S2A9500) HPC-SAN: 68 TB PC-SAN: 51 HRSK HPC-Komponente Hauptspeicher 6,5 TB PC-Farm 8 GB/s 4 GB/s 4 GB/s HPC-SAN Festplattenkapazität: 68 TB PC-SAN Festplattenkapazität: 51 TB 1,8 GB/s PetaByte- Bandarchiv Kapazität: 1 PB 17

18 Trefftz and Willers Building New Extension 18

19 Location of Computer Rooms Anbau Treffz-Bau 19

20 SGI Altix 4700 at ZIH PC-Farm at ZIH 20

21 Configuration of overall system: SAN Overview Beschreibung der Lösung von SGI HPC-SAN Gesamtkapazität: 68 TB durchgängig 4 Gb/s FC 4 x DDN S2A 9500 je 17 TB 584 Festplatten 146 GB CXFS/DMF auf Altix 350 (24 Itanium) TP 9300 (MDS Storage Subsystem) 14 x 73 GB für Metadaten Zugang von PC-Farm: NFS-Server auf 12 x Altix 350 mit je 2 CPUs oder Opteron (für RDMA-Zugriff) 21

22 Beschreibung der Lösung von SGI HPC-Komponente mehr als 500 dual-core Itanium-2 (Montecito) 1,6 GHz, 18 MB L3 (pro core 9 MB) 12,8 GFlops Peak 4 8 GB RAM (DDR2) S = 6 TB verbunden über SGI NumaLink 4 Bandbreite: 3,2 GB/s pro Knoten und Richtung Fat-Tree-Topologie Grafik-Pipes + Grafik-Compositor RASC Blade mit zwei FPGAs (RASClib) Beschreibung der Lösung von SGI PC-Farm mehr als 700 Boards Prozessoren: AMD Opteron Verbindungsnetzwerk: IB X4 Compute-Knoten verbunden über drei Switche (288 ports) Anbindung an HPC-SAN über 12 NFS-Server (CXFS-Clients) 22

23 Beschreibung der Lösung von SGI PC-SAN Lustre FS 2 x DDN S2A 9500 Kapazität: 50,9 TB 440 Festplatten 146 GB Tape Silo - Details CXFS/DMF-Server on Altix 350 (24 CPUs, 48 GB) Data Migration Facility (Licence for 1 bzw. 2 PB) 2 x FC-Switches (24 ports) StorageTek SL 8500 (SUN) ACSLS-Lizenz for 2500 Slots 23

24 Performance of Computers at ZIH Some Activities Zellescher Weg 12 Willers-Bau A113 Tel Matthias S. Mueller (matthias.mueller@tu-dresden.de) 24

25 Vampir: Technical Components Trace 1 Trace 2 Trace 3 Trace N Tools Worker 1 Worker 2 Worker m Server Master 1. Trace generator 2. Classical Vampir viewer and analyzer 3. Vampir client viewer 4. Parallel server engine 5. Conversion and analysis tools Vampir: Timeline 25

26 Vampir: Scalability sppm ASCI Benchmark 3D Gas Dynamic Data to be analyzed 16 Processes 200 MByte Volume Number of Workers Load Time 47,33 22,48 10,80 5,43 3,01 3,16 Timeline 0,10 0,09 0,06 0,08 0,09 0,09 Summary Profile 1,59 0,87 0,47 0,30 0,28 0,25 Process Profile 1,32 0,70 0,38 0,26 0,17 0,17 Com. Matrix 0,06 0,07 0,08 0,09 0,09 0,09 Stack Tree 2,57 1,39 0,70 0,44 0,25 0,25 Vampir: A Large Test Case IRS ASCI Benchmark Implicit Radiation Solver Data to be analyzed: 64 Processes in 8 Streams Approx Events 40 GByte Data Volume Analysis Platform: Jump.fz-juelich.de 41 IBM p690 nodes 32 processors per node 128 GByte per node Visualization Platform: Remote Laptop 26

27 BenchIT: Key Components 1 BenchIT measurement core Measurement kernels Exact timer Running kernels with variable problem sizes Generating result files BenchIT: Key Components 2 BenchIT measurement core Command line interface 27

28 BenchIT: Key Components 3 BenchIT measurement core Command line interface GUI BenchIT: Key Components 4 BenchIT measurement core Command line interface GUI Website 28

29 Parbench: Influence of the operating system Full Load Test: 72 sequential jobs 9 jobs eightfold parallelized 144 CPUs Parallelized kernel sequences more then 250 s CPU time Influence by OS Parbench: Influence of the operating system Under Load Test: 68 sequential jobs 9 jobs eightfold parallelized 144 CPUs 4 CPUs free for OS Parallelized kernel sequences have fewer CPU time 29

30 Benchmarks: Scalability of /fastfs file system.8.6 bandwidth[gb/s] P a Pt2Pt latency between all possible pairs "64-2/result-all2all-latency.log" u 4:5:

31 Pt2Pt bandwidth between all possible pairs "64/result-all2all-bandwidth.log" u 3:4: Thank you! Hope to see you next time Zellescher Weg 12 Willers-Bau A113 Tel Matthias S. Mueller (matthias.mueller@tu-dresden.de) 31