Red Española de Supercomputación Zaragoza

Transcription

1 Red Española de Supercomputación Zaragoza Zaragoza, 1 de Julio, 2010 Mateo Valero Director

2 Top10 2

3 Looking at the Gordon Bell Prize 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors Superconductive materials 1 EFlop/s; ~2018;?; 1x107 Processors (109 threads) Jack Dongarra 3

4 Cores in the Top25 Over Last 10 Years 4

5 Exponential growth in parallelism for the foreseeable future 5

6 Increasing chip performance: Intel s Petaflop chip 80 processors in a die of 300 square mm. Terabytes per second of memory bandwidth Note: The barrier of the Teraflops was obtained by Intel in 1991 using Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters This will be possible in 3 years from now Thanks to Intel ICPP-2009, September 23rd

7 Intel/UPC Since 2002 (Roger Espasa, Toni Juan) 40 People Microprocessor Development (Larrabee x86 many core) 7

8 NVIDIA Fermi Architecture 16 Streaming- Multiprocessors (512 cores) execute Thread Blocks 620 Gigaflops Wide DRAM interface provides 12 GB/s bandwidth Unified 768KB L2 cache serves all threads GigaThread hardware scheduler assigns Thread Blocks to SMs 8

9 Cell Broadband Engine TM: A Heterogeneous Multi-core Architecture * Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. 9

10 Hybrid SMP-cluster parallel systems Interconnect (Myrinet, IB, Ge, 3D torus, tree, ) Node* Node Node Node** Node* Node* Node Node Node Node** Node** SMP Memory IN multicore multicore multicore multicore homogeneous multicore (e.g. Larrabee) heterogenous multicore general-purpose accelerator (e.g. Cell) GPU FPGA ASIC (e.g. Anton for MD) Network-on-chip (bus, ring, direct, ) 10

11 HPC hierarchy in current Top 10 Based on June 2010 list 11

12 Evolution towards Exaflop supercomputers Current #1 (6/2010) Current #2 (Fermi) (6/2010) Current #3 (Cell) (6/2010) All in Top500 (6/2010) Personal supercomputer (CPU) Personal supercomputer (CPU/GPU) Sequoia LANL (announced) Possible exaflop? cores nodes chips/n cores/c ops/cor ode hip e Personal supercomputer Personal sup. accelerator GHz GFlops 2,8 2,8 0, PF/s, 1.6 PB Memory 96 racks, 98,304 nodes 1.6 M cores (1 GB/core) 50 PB Lustre file system 6.0 MW power 12

13 13

14 Holistic approach Towards exaflop Applications Performance tools Programming model Can you imagine how would it be if there was no distance? Yes, he can! if everything was here? Load balancing Interconnection Processor/node architecture Thanks to Jesus Labarta New York, June 9th,

15 The holistic approach Towards exaflop Applications Performance tools Programming model A d d r e s s Can you imagine how should we think exaflop? In a holistic way? s p a c e s Load balancing Interconnection Processor/node architecture New York, June 10th, 2009 Yes, he can! L a t e n c y Dependences 15

16 Holistic approach Towards exaflop Applications Comput. Complexity Async. Algs. Moldability Job Scheduling a lle Ma Load Balancing Resource awareness y nc icie Po Concurrency extraction Interconnection ef f Locality optimization we r Work generation d Dependencies ea Address space Ov erh Run time User satisfaction ity bi l YES...We can Programming Model Topology and routing External contention Processor/node architecture NIC design Hw counters Run time support Memory subsystem Core Structure M. Valero Keynote at ICS, NY, June

17 Challenges: view from Jesús Labarta Variability Everywhere, huge Efficiency Performance and power reasons Avoid overkills Memory Logical and physical structure Bandwidth and latency Resilience Impossible to run an app without errors happening halfway Constraint: Power Locality scheduling, minimize Bandwidth Programmability: Don Grice: we can do the hardware but if it can not be programmed (approx.) Programming model Machine independent. What, not how Smooth migration path Runtime/Execution model Data access awareness. Asynchrony/dataflow Automatic Load balance Malleability Algorithms Asynchrony, overlap Minimize bandwidth Resilience From recovery to tolerance Holistic: Applications: Co-design vehicles Between system software layers and architecture Tools: Fly with instruments The importance of detail Jesús Labarta 17

18 BSC-CNS e iniciativas a nivel internacional: IESP Improve the world s simulation and modeling capability by improving the coordination and development of the HPC software environment Build an international plan for developing the next generation open source software for scientific highperformance computing 18

19 Back to Babel? Book of Genesis Now the whole earth had one language and the same words The computer age Fortran & MPI Come, let us make bricks, and burn them thoroughly. "Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves And the LORD said, "Look, they are one people, and they have all one language; and this is only the beginning of what they will do; nothing that they propose to do will now be impossible for them. Come, let us go down, and confuse their language there, so that they will not understand one another's speech." ++ Cilk++ Fortress X10 CUDA Sisal HPF RapidMind StarSs Sequoia CAF ALF OpenMP UPC SDK Chapel MPI 19

20 Different models of computation. The dream for automatic parallelizing compilers not true so programmer needs to express opportunities for parallel execution in the application SPMD OpenMP 2.5 Nested fork-join OpenMP 3.0 DAG data flow Huge Lookahead &Reuse. Latency/EBW/Scheduling And asynchrony (MPI and OpenMP too synchronous): Collectives/barriers multiply effects of microscopic load imbalance, OS noise, 20

21 The holistic approach Towards exaflop Performance Tools Programming Model Load Balancing Interconnection Processor/node architecture Clean programming practices. Abstraction y c n e h t m t a s d i i l L e w l l d a n r a a B P Mechanisms to inject probes Merge nicely with node level Asynchrony Sc he du lin g Applications Asynchrony/dependences/dataflow data access specification. Abstract/simple memory model Fine grain DLP Basic transformations (unroll, ) r e w o P Monitor/set tunable and scheduling variables Detect data production to enable overlapped trasnfer Efficient support for basic mechanisms. Data transfers, thread creation, dependence handling 21

22 The TEXT project Towards EXaflop applications Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a an productive and efficient way. Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. Develop additional environment capabilities tools (debug, performance) improvements in runtime systems (load balance and GPUSs) Support other users Identify users of TEXT applications Identify and support interested application developers Contribute to Standards (OpenMP ARB, PERI-XML) 22

23 BSC-CNS El Barcelona Supercomputing Center Centro Nacional de Supercomputación (BSC-CNS) fue constituido el 1 de Abril de 2005 como el Spanish National Laboratory in supercomputing. A corto plazo se sumará el CSIC al patronato del BSC. Los nuevos porcentajes de propiedad serán: El Gobierno Español 54% El Gobierno Catalán 30% La UPC 11% El CSIC 5% 23

24 BSC: Spanish National Center More than 300 people from 27 different countries (Argentina, Belgium, Brazil, Bulgaria, Canada, Colombia, Cuba, China, Cuba, Dominicana, France, Germany, India, Iran, Ireland, Italy, Jordania, Lebanon, Mexico, Pakistan, Poland, Russia, Serbia, Spain, Turkey, UK, USA) 24

25 25

26 A la espera de un nuevo MareNostrum MNv1 MNv2 (50TF) (100TF) MNv3 MN (400TF) Posición Mundial Posición Europea MareNostrum v1 Nov Jun Nov Jun Nov Jun Nov Jun Nov Jun Nov MareNostrum v

27 Top20 in Europe 27

28 Red Española de Supercomputación Altamira Universidad de Cantabria MareNostrum BSC CaesarAugusta Universidad de Zaragoza Magerit Universidad Politécnica Madrid Picasso Universidad de Málaga La Palma IAC Tirant Universidad de Valencia Atlante ITC 28

29 1550 proyectos de e-ciencia 29

30 Dynamic load balancing MPI + OpenMP Over commit threads/core. Only one per processor active at a time Shift processors between processes in node i.e. 800 procs 2.5x speedup MPI + StarSs (In development) Should result in higher flexibility Overcoming Amdahl s law in hybrid parallelization Start with one process per core StarSs only in unbalanced regions LeWI: A Runtime Balancing Algorithm for Nested Parallelism. M.Garcia,J.corbalan.J.Labarta. ICPP09 30

31 Performance analysis of real production applications Describe actual behavior Identify optimization approaches Estimate potential To optimize the applications in cooperation with developers To drive other developments: Programming model, run times, architecture, interconnect, GROMACS Serialization!! Profile 40 PEPC Real Code region Bandwdith (MB/s) Bandwdith (MB/s) CPU ratio CPU ratio % % code region Speedup Speedup GADGET 93.67% Bandwdith (MB/s) load balance!! 4096 Pred. 1MB/s 15 1 Speedup Pred. 10MB/s Pred. 5MB/s 20 0 Endpoint contention!! NM Pred. Pred. 100MB/s ideal 30 % of computation time SPECFEM3D %elapsed time 35 CPU ratio

32 Kaleidoscope Project Platform Gflops Seep-up Power (W) Gflops/W JS21 8, ,03 QS22 116, ,32 2 TESLA ,8 0,76 The work of 3 months is now done in 1 week (speed-up 14) 2 days (speed-up 42) On the Cell 23.5 GB/s of memory BW used from 25.6 GB/s max BW On TESLA the I/O is now the real bottleneck Awarded by "IEEE Spectrum" as one of the 2008 top 5 innovative technologies Platt s award to the commercial technology of the year 2009 Barcelona, 10 febrero

33 Consolider El BSC-CNS coordina un programa Consolider de supercomputación y e-ciencia, que une a grupos de investigación expertos en aplicaciones que requieren supercomputación y a grupos expertos en el diseño del hardware y software de base de los supercomputadores. Grupo de Modelización Molecular y Bioinformática (U. de Barcelona, M. Orozco) Grup de Bioinformàtica del Genoma (Centre de Regulació Genòmica, R. Guigó) Grupo de Biología Estructural Computacional (Centro Nacional de Investigaciones Oncológicas, A. Valencia) Application scope Life Sciences Compilers and tuning of application kernels Programming models and performance tuning tools Application areas Architectures and hardware technologies Application scope Earth Sciences Application scope Astrophysics Application scope Engineering Application scope Physics Grupo de Ciencias de la Tierra (BSC-CNS, J. M. Baldasano) Unidad de Contaminación Atmosférica (CIEMAT, F. Martín) Grupo de Diagnóstico y Modelización del Clima (U. Complutense de Madrid, R. García-Herrera) Grupo de Astrofísica Relativista y Cosmología (U. de Valencia, J. M. Ibáñez) Grupo de Simulación Numérica de Procesos Astrofísicos (Instituto de Astrofísica de Canarias, F. Moreno) Grupo GAIA de Astronomía Galáctica (U. de Barcelona, J. Torra) Grupo de Cosmología Computacional (U. Autónoma de Madrid, G. Yepes) Grupo de Mecánica de Fluidos Computacional (U. Politécnica de Madrid, J. Jiménez Sendín) Unidad de Simulación Numérica y Modelización de Procesos (CIEMAT, M. Uhlmann) Grupo SIESTA (U. Autónoma de Madrid, J. M. Soler) Grupo SIESTA (Laboratorio de Estructura Electrónica de los Materiales del Instituto de Ciencia de Materiales de Barcelona ICMAB-CSIC, P. J. Ordejón) 33

34 Consolider (2) Departamento de Tecnologías de la Información (BSC-CNS, M. Valero) Grupo Computación de Altas Prestaciones (U. Politècnica de Catalunya, J. M. Llabería) Grupo de Arquitectura y Tecnología de Sistemas Informáticos (U. Complutense de Madrid, F. Tirado) Grupo de Arquitectura de Computadores (U. de Malaga, E. López Zapata) Application scope Life Sciences Compilers and tuning of application kernels Departamento de Tecnologías de la Información (BSC-CNS, M. Valero) Grupo Computación de Altas Prestaciones (U. Politècnica de Catalunya, J. M. Llabería) Parallel Processing and Distributed Systems group (U. Autónoma de Barcelona, A. Ripoll) Departamento de Tecnologías de la Información (BSC-CNS, M. Valero) Grupo Computación de Altas Prestaciones (U. Politècnica de Catalunya, J. M. Llabería) Grupo de Arquitectura de Computadores (U de Zaragoza, V. Viñals) Grupo de Arquitectura y Tecnología de Sistemas Informáticos (U. Complutense de Madrid, F. Tirado) Grupo de Arquitectura y Tecnología de Computadores (U. de Cantabria, J. R. Beivide) Grupo de Arquitectura de Computadores (U. de Malaga, E. López Zapata) Grupo de Arquitectura de Computadores (U. de Las Palmas de Gran Canaria, Instituto Universitario de Ciencias y Tecnologías Cibernéticas, E. Fernández) Programming models and performance tuning tools Application scope Earth Sciences Basic research Application scope Astrophysics in supercomputing Architectures and hardware technologies Application scope Engineering Application scope Physics 34

35 SyeC Consolider Project Nanotechnology (SIESTA code) Improving scalability till 1000 cores Load balancing Hybrid parallelism MPI+openMP Parallel I/O Matrix per vector Scalability Eigenvalue solver scalability 1000 Operations Matrix per vector Ideal 100 Speed Up Hamiltonian builder builder scalability scalability Hamiltonian Number of processors

36 SyeC Consolider Project Astrophysics codes on: General Relativistic Resistive Magneto-Hydrodinamics Porting to MPI + openmp or GPUs GAIA operative infrastructure 36

37 SyeC Consolider Project DNS in CFD (Turbulence + Particles) Global communications are critical Boundary layer modeling Hybrid code MPI+openMP Reθ = 6150 on BG/P at ANL cores (34Mh) Parallel I/O 100 TB Total Output data => 5 year to analyze these data 37

38 SyeC Consolider Project Atmospheric modeling New numerical schemes Improving scalability MPI+openMP+ paralell I/O Cell and GPUs code First dust simulation including all processes except dry deposition 38

39 SyeC Consolider Project Life Sciences (Protein interactions & Genomics) Workflows Data bases GPU code porting New sequencing harware Small folding 1 experiment 4 Tb of data 40 Gb processed data 1 machine 2 experiments a week A medium sized center 10 machines 39

40 SyeC Consolider Project Computer Sciences Code scalability Programming Models Computer architecture 40

41 PRACE Project Consortium tier 1 Hosting Partners tier 0 General Partners Non Hosting Partners 41

42 First machine available JUGENE IBM FZJ, Jülich, Germany Next machine France Summer

43 PRACE Early Access Call Opening date : 10th May 2010 Closing date : 10th June 2010 Start date : 1st August or 1st December 2010 Allocation period : 4 months Type of access: Project (1 proposal) or Preparatory + Project (combined -2 linked proposals) Information: 43

44 MareIncognito: Project structure 4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: EUTERPE General kernels Automatic analysis Coarse/fine grain prediction Sampling Clustering Integration with Peekperf StarSs: CellSs, SMPSs OpenMP@Cell OpenMP++ MPI + OpenMP/StarSs Applications Performance analysis tools Models and prototype Interconnect Contention, Collectives Overlap computation/communication Slimmed Networks Direct versus indirect networks Programming models Coordinated scheduling: Run time, Process, Job Power efficiency Load balancing Processor and node Contribution to new Cell design Support for programming model Support for load balancing Support for performance tools Issues for future processors 44

45 RIS: BSC-CNS y Latinoamérica Establecimiento de RIS: Red Iberoamericana de Supercomputación a través del CYTED: Uso compartido de recursos: RES más centros iberoamericanos Formación Investigación Países: Portugal Argentina Brasil Colombia Chile República Dominicana Cuba México Ecuador Conectar RIS a los programas de la UE Barcelona, 10 febrero

46 Barcelona Computing Week, July 5-9, 2010 Programming and Tuning Massively Parallel Systems Coordinators: Mateo Valero, BSC Wen-mei Hwu, Illinois Instructors: Wen-mei Hwu, University of Illinois David B. Kirk, NVIDIA Corporation Audience: Three parallel tracks specially designed for beginners, advanced and teachers profiles. Programming Languages: CUDA, OpenCL, OpenMP, StarSs Numerical Methods and Case Studies: FFT, Graph, Tiling, Grid, Montecarlo, FDTD, Sparse matrices Hands-on Labs: Afternoon labs with teaching assistants for each audience/level applications for July

47 ACACES 10 "HiPEAC Summer School" : one week summer school for computer architects and compiler builders 294 applications, 197 participants 17 industry attendants, 8 companies 31 countries Keynotes Insup Lee University of Pennsylvania Jesus Labarta, BSC Instructor Andreas Herkersdorf Michael Scott Affiliation TU Muenchen Vivek Sarkar Multicore Programming Models and their Compilation Challenges Harvard University Variation-Aware Processor Design David Brooks Derek Chiou University of Rochester Rice University University of Texas at Austin Scott Mahlke University of Michigan Dan Sorin Duke University Donatella Politecnico di Sciuto Milano Steven Hand Citrix Theodore Ts'o Google Mahmut Kandemir Andrzej Brud and Per Stenström Title Application-Specific (MP)SoC Architectures for Internet Networking Transactional Memory Fast and Accurate Computer System Simulators Compilation for Multicore Processors Fault Tolerant Computer Architecture FPGA-based reconfigurable computing System Virtualization File Systems and Storage Technologies Pennsylvania State Embedded Systems: A Software University Perspective Chalmers How to transform research results into a business 47

48 BMW10: Barcelona Multicore Workshop, October 22-23, 2010 Multi-core and many-core processors have already arrived The issue facing the software community is how to program those machines in the most productive way. The hardware community has to design the manycores so as to maximize the potential performance. The BMW workshop consists of a combination of invited talks, two panel discussions and time for discussion. Co-organized by BSC-Microsoft Research Center and HiPEAC Organizers: BSC -MIcrosoft: Mateo Valero, Fabrizio Gagliardi, Osman Unsal HiPEAC: Per Stenstrom, Georgi Gaydadjiev, Manolis Katevenis, Eduard Ayguade 48

49 Master on HPC Kernel Supercomputer Architectures Methods and Algorithms for Parallel Programming Optimization and Parallelization of Numerical Simulations Free-options High performance Computational Mechanics Performance tuning and analysis tools Data Mining 2 Seminar on Supercomputing I, II, III Applications Computational Astrophysics Bionformatics Earth Sciences Applications of Computational Astrophysics Applications of Bionformatics Applications of Earth Sciences 49 T H E S I S

50 Education for Parallel Programming I I many-core programming multi-core programming We all massive parallel prog. I games Multicore-based pacifier 50

51 Muchas gracias Barcelona, 10 febrero

52 SIESTA project Ab-inito DFT molecular dynamics code BSC working on its development Example: Neptune Mid-layer H2O+ NH3+ CH4 Temperature 1500 ºK 2500 ºK Pressure 0.15 GPa 60 GPa Simulated time 10 ps equivalent to 20,000 molecular dynamic steps Number of atoms 1269 atoms (100 processors, 2007) Now, more than atoms using more than 1000 processors) 52

53 High Performance Computing as key-enabler LES Capacity: # of Overnight Loads cases run Unsteady RANS Available Computational Capacity [Flop/s] RANS Low Speed x106 RANS High Speed 105 Smart use of HPC power: 106 Algorithms Data mining knowledge 1980 HS Design 1990 Data Set 2000 CFD-based LOADS & HQ Aero Optimisation & CFD-CSM 2030 Full MDO Capability achieved during one night batch CFD-based noise simulation Real time CFD based in flight simulation Courtesy AIRBUS France 53

54 Diseño del ITER TOKAMAK (JET, Oxford) 54

55 Weather, Climate and Earth Sciences: Roadmap 2009 Resolution : 80 Km Memory: FLOPS * 1014 GB Storage: 8 TB NEC-SX9 48 vector procs: 40 days run 2015 Resolution : 20 Km MemSory: 3,5 TB Storage: 180 TB High resolution model with complete carbon cycle model Challenges:16 data viz and post-processing, data discovery, archiving FLOPS 1* Resolution : 1 Km Memory: 4 PB Storage: 150 PB Higher resolution with global cloud resolving model Challenges: data sharing, transfer memory management, I/O management FLOPS 1*

56 Supercomputación, teoría y experimentación 56 Cortesia de IBM

57 57

58 ORNL: 1.75 PF/s Cray XT5-HE system Quad-core AMD Opteron processors running at 2.6 GHz, 224,162 cores. Power: 6.95 Mwatts 300 terabytes of memory 10 petabytes of disk space. 240 gigabytes per second disk bandwidth Cray's SeaStar2+ interconnect network. Jack Dongarra 58

59 PRACE AISBL First Council meeting of the legal form. Three more countries formally adhered to the legal form: Sweeden, Cyprus and Czech Republic BoD was formally appointed. Chair of the BoD selected: Sergi Girona Operating budget approved 59

60 PRACE regular calls Preparatory access code testing and optimisation, technical support if requested, continuous call, fast track assessment, maximum allocation of 6 months Project access 2 calls per year, 1 year allocation, allocation in November and May Programme access large projects of a research group, 2 years allocation, calls coincide with the calls for project access, possible review at the end of the first year allocation Final report mandatory for all types of proposals 60

61 PRACE regular calls 1st PRACE regular call opened 15th June for allocation starting 1st November Only available machine at present is JUGENE All proposals will be subject to PRACE Peer Review, which will be handled on-line The Scientific Steering Committee will be responsible for advising on the scientific direction of PRACE 61

62 Grand Challenge problems Systems biology Model & simulation leading to predictive models with clinical or environmental impact Sustainable Systems Taking into account multi-scale nature Models are linked to experimental data providing corroboration of experiments Turbulence & Chaos Characterize boundary layer effects and their impact on global solution and stability Environmental Global Warming/Climate Change Energy Water Biodiversity and land use Chemicals, toxics and heavy metals Air pollution Waste management Stratospheric ozone depletion Oceans & fisheries Deforestation Multi-Scale Patient-Specific Data Genetic Variability Gene Protein ExpressionExpression Profiling Profiling Multi-Modal Imaging Data Analysis And Modeling 62

63 BSC-CNS: sinergia con infraestructuras CNS es un complemento fundamental para las infraestructuras científicas experimentales IAC ICFO Sincroton IRB Barcelona, 20 Julio 2009 CIEMAT(TJ II) 63

64 Factors that Necessitate Redesign Steepness of the ascent from terascale to petascale to exascale Extreme parallelism and hybrid design Preparing for million/billion way parallelism Tightening memory/bandwidth bottleneck Limits on power/clock speed implication on multicore Reducing communication will become much more intense Memory per core changes, byte-to-flop ratio will change Necessary Fault Tolerance MTTF will drop Checkpoint/restart has limitations Software infrastructure does not exist today 64

65 El BSC-CNS Misión del BSC-CNS Investigar, desarrollar y gestionar tecnologías que faciliten el avance de la ciencia Objetivos del BSC-CNS I+D en ciencias de los computadores, ciencias de la vida y ciencias de la tierra Soporte de supercomputación a investigación externa 5 departamentos científico/técnicos Barcelona, 10 febrero

66 BSC-CNS: vertebrador del servicio de supercomputación en España MareNostrum BSC Magerit Universidad Politécnica Madrid CaesarAugusta Universidad de Zaragoza La Palma IAC Picasso Universidad de Málaga Barcelona, 10 febrero 2010 Altamira Universidad de Cantabria Atlante Gobierno Canarias Tirant Universidad de Valencia 66

67 BSC-CNS y REPSOL: proyecto Kaleidoscope 2008: Awarded by "IEEE Spectrum" as one of the top 5 innovative technologies Barcelona, 10 febrero

68 Kaleidoscope: WEM vs. RTM Data Property of TGS 68

69 SIESTA project Ab-inito DFT molecular dynamics code BSC working on its development Example: Neptune Mid-layer H2O+ NH3+ CH4 Temperature 1500 ºK 2500 ºK Pressure 0.15 GPa 60 GPa Simulated time 10 ps equivalent to 20,000 molecular dynamic steps Number of atoms 1269 atoms (100 processors, 2007) Now, more than atoms using more than 1000 processors) 69