The PRACE Project Applications, Benchmarks and Prototypes. Dr. Peter Michielse (NCF, Netherlands)



Similar documents
Relations with ISV and Open Source. Stephane Requena GENCI

PRACE hardware, software and services. David Henty, EPCC,

PRACE An Introduction Tim Stitt PhD. CSCS, Switzerland

Parallel Software usage on UK National HPC Facilities : How well have applications kept up with increasingly parallel hardware?

Performance analysis of parallel applications on modern multithreaded processor architectures

Kriterien für ein PetaFlop System

JUBE. A Flexible, Application- and Platform-Independent Environment for Benchmarking

PRACE: World Class HPC Services for Science

PRACE in building the HPC Ecosystem Kimmo Koski, CSC

Trends in High-Performance Computing for Power Grid Applications

HPC enabling of OpenFOAM R for CFD applications

Cosmological simulations on High Performance Computers

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

1 Bull, 2011 Bull Extreme Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Pedraforca: ARM + GPU prototype

PRACE: access to Tier-0 systems and enabling the access to ExaScale systems Dr. Sergi Girona Managing Director and Chair of the PRACE Board of

Overview of HPC systems and software available within

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Partnership for Advanced Computing in Europe

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Information about Pan-European HPC infrastructure PRACE. Vít Vondrák IT4Innovations

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Scientific Computing Programming with Parallel Objects

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Turbomachinery CFD on many-core platforms experiences and strategies

David Vicente Head of User Support BSC

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

ICT4 - Customised and low power computing

Supercomputing Resources in BSC, RES and PRACE

Multicore Parallel Computing with OpenMP

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Optimization on Huygens

A Survey on Training and Education Needs for Petascale Computing 1

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

2IP WP8 Materiel Science Activity report March 6, 2013

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Performance of HPC Applications on the Amazon Web Services Cloud

PRACE the European HPC Research Infrastructure. Carlos Mérida-Campos, Advisor of Spanish Member at PRACE Council

PLGrid Infrastructure Solutions For Computational Chemistry

Supercomputer Center Management Challenges. Branislav Jansík

Mathematical Libraries on JUQUEEN. JSC Training Course

High Performance Computing in CST STUDIO SUITE

Parallel Programming Survey

How Cineca supports IT

YALES2 porting on the Xeon- Phi Early results

High Performance Computing in the Multi-core Area

HPC Wales Skills Academy Course Catalogue 2015

benchmarking Amazon EC2 for high-performance scientific computing

Unified Performance Data Collection with Score-P

Challenges on Extreme Scale Computers - Complexity, Energy, Reliability

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

RES is a distributed infrastructure of Spanish HPC systems. The objective is to provide a unique service to HPC users in Spain

Accelerating CFD using OpenFOAM with GPUs

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Fast Multipole Method for particle interactions: an open source parallel library component

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Performance Tools for System Monitoring

LUG2014 EOFS Update Hugo R. Falter


Extreme Scaling on Energy Efficient SuperMUC

Part I Courses Syllabus

Advanced Computational Software

Overview of HPC Resources at Vanderbilt

High Performance Computing. Course Notes HPC Fundamentals

BSC - Barcelona Supercomputer Center

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

ALPS - The Swiss Grand Challenge Programme on the Cray XT3. CUG 2007, Seattle Dominik Ulmer, CSCS

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Experiences with Tools at NERSC

Big Data Visualization on the MIC

Recent Advances in HPC for Structural Mechanics Simulations

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Jean-Pierre Panziera Teratec 2011

SC15 SYNOPSIS FOR FEDERAL GOVERNMENT

High-Performance and High-Productivity. Thomas C. Schulthess

HPC in Oil and Gas Exploration

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Supercomputing Status und Trends (Conference Report) Peter Wegner

SR-IOV: Performance Benefits for Virtualized Interconnects!

GRID Computing and Networks

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Microsoft Research Worldwide Presence

Altix Usage and Application Programming. Welcome and Introduction

HPC-related R&D in 863 Program

Parallel Computing. Introduction

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

In situ data analysis and I/O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Next Generation GPU Architecture Code-named Fermi

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Keeneland Enabling Heterogeneous Computing for the Open Science Community Philip C. Roth Oak Ridge National Laboratory

Parallel Visualization of Petascale Simulation Results from GROMACS, NAMD and CP2K on IBM Blue Gene/P using VisIt Visualization Toolkit

On the Importance of Thread Placement on Multicore Architectures

Parallel Analysis and Visualization on Cray Compute Node Linux

HP ProLiant SL270s Gen8 Server. Evaluation Report

An Act. To provide for a coordinated Federal program to ensure continued United States leadership in high-performance computing.

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Transcription:

The PRACE Project Applications, Benchmarks and Prototypes Dr. Peter Michielse (NCF, Netherlands)

Introduction to me Ph.D. in numerical mathematics (parallel adaptive multigrid solvers) from Delft University of Technology in 1990 Philips Electronics, Convex/HP computers Principal Systems Engineer at Silicon Graphics From 2004: deputy director of NCF (National NL Computing Facilities) Secretary of the scientific committee of NCF for review of computing time requests Secretary of the Dutch national grid project ( BiG Grid ) Benchmark coordinator of PRACE 2

Agenda Introduction to PRACE Organisation of the PRACE project Prototypes Applications Final benchmark set Access to the prototypes Future architectures Future PRACE implementation phase Exascale computing 3

Introduction to PRACE - 1 History: First ideas towards European Supercomputing in June 2005 Meeting at ISC2005 in Heidelberg: Fr, Ge, NL, Sp, UK Investments in future HPC equipment regarded as beyond the scope and potential of a single country One concluded that a sustainable HPC infrastructure in Europe, which compares to the US and Japan would be necessary Tier-0 systems in Europe, comparable to US and Japanese systems How to fund, how to operate and where to install? What science to facilitate? But first: Build the science case Seek awareness on European (EC) levels Involve communities and industries 4

Introduction to PRACE - 2 Science case: Prepared by HET (HPC in Europe Taskforce): Extensive survey on benefits for science Workshops with scientists Including societal and economic benefits Conducted in 2006, with participation of the Netherlands EC awareness: High Performance Computing facilities put on the ESFRI roadmap (among 34 other projects) Estimated costs 200-400 M on an annual basis (investments and running costs) for one new system each year 5

Introduction to PRACE - 3 In 2007: Memorandum of Understanding (MoU) signed: 14 countries: Au, Ch, Fi, Fr, Ge, Gr, It, NL, No, Po, Pt, Sp, Sw, UK To work on the establishment of a European HPC infrastructure To prepare a project proposal to the EC to investigate and to prepare such a European HPC infrastructure To be prepared to match funding to execute the project First known as PACE, later on as PRACE: PaRtnership for Advanced Computing in Europe Project proposal for 20 M in total (50% by EC, 50% by 14 countries) 10 M by EC in the framework of preparation of the ESFRI projects Project granted and started in January 2008 for 2 years In 2008 and 2009, Ireland, Turkey, Cyprus, Serbia have already joined, more to follow (Bulgaria, Czech) 6

Organisation of the PRACE project - 1 16 partners from 14 countries (Germany with 3 partners: FZJ, HLRS, LRZ) Fr: GENCI (CEA, CINES); NL: NCF (with SARA and Univ. Groningen); Sp: BSC; UK: EPSRC (EPCC, STFC); Fi: CSC; CH: CSCS; It: CINECA;. Sustainable Tier-0 infrastructure, preparation phase: Administrative part : legal issues, peer review, operational and funding models, ownership, seat of PRACE, dissemination, education Technical part : system software, system management, applications and benchmarking, architecture prototypes, future technology Organised in a total of 8 work packages 7

Organisation of the PRACE project - 2 Key overall goal: Facilitate European scientific research with top HPC facilities In the PRACE project this translates technically to: Investigate and improve scalability of applications to petascale and beyond Test applications on petascale equipment Develop a benchmark suite of representative applications for purchasing and testing Investigate future technology for usage in future systems This has led to: Applications selection Benchmark definition Prototype systems (mainstream and future) 8

Prototypes (mainstream technology) - 1 Hosting partners to buy prototypes, matching by PRACE To avoid misunderstanding: PRACE partners have given access to their already existing systems for PRACE usage, and have typically bought a small part in addition to do PRACE-specific work this small part has been matched Criteria: Immediate (asap) usage in the PRACE project Coverage of all mainstream architectures: MPP SMP-ThinNode SMP-FatNode Hybrid (combination of architecture/processor technology) For both internal and external usage 9

Prototypes (mainstream technology) - 2 Results: MPP: IBM BlueGene/P, FZ Juelich, Germany MPP: Cray XT5, CSC, Finland (in combination with CSCS, Switzerland) SMP-ThinNode: Bull cluster with Intel Nehalem, CEA/FZJ, France/Germany SMP-FatNode: IBM Power6 cluster, SARA/NCF, The Netherlands Hybrid: IBM Power6 + Cell, BSC, Spain Hybrid: NEC vector + Intel x86 cluster, HLRS, Germany Internal usage: Assessment of the prototypes by running synthetic benchmarks Scalability testing and preparation of application benchmark suite External usage: Scientific communities with their own codes through light review mechanism Experiences possibly included in PRACE deliverables 10

Applications - 1 Goals in PRACE with respect to applications: Investigate scalability of applications Construct a PRACE Application Benchmark Suite (PABS) Process (part 1): Determine application areas Determine codes within application areas, based on usage in Europe Define application/prototype matrix with respect to porting efforts Start testing with this Initial PABS (July 2008-May 2009) 11

Applications - 2 Application areas and codes: Particle Physics QCD Computational Chemistry VASP, CPMD, CP2K, NAMD, GROMACS, HELIUM, GPAW, SIESTA Computational Fluid Dynamics SATURNE, AVBP, NS3D, ALYA Astronomy and Cosmology GADGET Earth Sciences NEMO, ECHAM5, BSIT Plasma Physics PEPC, TORB/EUTERPE Computational Engineering TRIPOLI_4 12

Applications - 3 Benchmark codelanguages Libraries Programming Model IO characteristics QCD Fortran 90, C MPI no special VASP Fortran 90 BLACS, SCALAPACK MPI (+ pthreads) no special NAMD C++ Charm++, FFTW, TCL Charm++, MPI, master-slaveno special CPMD Fortran 77 BLAS, LAPACK MPI Code_Saturne Fortran 77, C99, pythonblas MPI read at start, write periodically GADGET C 89 FFTW, GSL, HDF5 MPI TORB Fortran 90 PETSC, FFTW MPI read at start, write periodically ECHAM5 Fortran 90 BLAS, LAPACK, NetCDF MPI/OpenMP read at start, write periodically NEMO Fortran 90 NetCDF MPI read at start, write periodically CP2K Fortran 95 FFTW, LAPCK, ACML MPI checkpoints and output, intense GROMACS C, assembler FFTW, BLAS, LAPACK MPI read at start, write periodically, relaxe NS3D Fortran 90 EAS3, Netlib (FFT) MPI + NEC-microtasking read at start, write periodically AVBP Fortran 90 Hdf5, szip, Metis MPI read at start, write periodically HELIUM Fortran 90 MPI read at start, write periodically TRIPOLI_4 C++ TCP/IP sockets read at start, write periodically PEPC Fortran 90 MPI read at start, write periodically GPAW Python, C LAPACK, BLAS MPI read at start, write at end ALYA Fortran 90 Metis MPI/OpenMP read at start, write periodically SIESTA Fortran 90 Metis, BLAS, SCALAPACKMPI read at start, write periodically BSIT Fortran 95, C Compression lib MPI/OpenMP read at start, write periodically 13

Applications - 4 What kind of work? Porting of applications to selected prototypes (if not done already) Profiling and optimisation work Recommendations Best practices Investigation and improvement of scalability Recommendations Best practices Prepare for decision-making process on final benchmark suite Per application, one person responsible for work on all aspects Responsibility includes porting, optimisation and petascaling Benchmark Code Owner (BCO) May manage a team of co-workers 14

Applications - 5 Other assumptions/goals: Each prototype platform has at least 5 to 6 applications Each application is ported and run on at least 3 prototype platforms (if possible) Have proportional distribution of applications to PRACE partners (BCOs) with respect to available manpower (pm s) 15

Applications 6 (status of June 2009) Application MPP-BG MPP-Cray SMP-TN-x86 SMP-FN-pwr6 SMP-FN+Cell SMP-TN+vector QCD Done Done Done VASP Done Done Stopped Yet to start NAMD Done Done Done Yet to start CPMD Done Done Done Done Code_SaturneDone Done Done Stopped Done GADGET Done Done Done TORB Done Done In progress ECHAM5 Stopped Done Done Yet to start NEMO Done Done Done Done CP2K Done Done Done GROMACS Done Done Done NS3D Done In progress Done Done AVBP Done Done Done HELIUM Done Done Done Done TRIPOLI_4 In progress Done PEPC Done Done Done GPAW Done Done Done ALYA Done Done SIESTA BSIT Done Done 16

Applications - 7 QCD kernel contents: 0,07 1. Berlin Quantum Chromo Dynamics 2. SU3_AHiggs 3. Wuppertal Portable Inverter Two others (unnamed) Scalability for all kernels is good (example here is kernel 3) Performance (1/sec) 0,06 0,05 0,04 0,03 0,02 0,01 0 0 20 40 60 80 100 Partition size (Tflop/s) IBM BlueGene/P Cray XT5 IBM Power6 17

Applications - 8 PRACE aims: To facilitate European scientific research with top HPC facilities Hence, needs to cover: Relevant scientific areas already done Scalability (potential) on Petascale systems Licensing aspects with respect to further usage US/Japan efforts on applications Extension in certain scientific areas These aspects (part 2 of the process) have been investigated to define the final PRACE Application Benchmark Suite 18

Final Benchmark Set Final PABS: Particle Physics QCD Computational Chemistry CPMD, CP2K, NAMD, GROMACS, HELIUM, GPAW, Quantum_Espresso, Octopus Computational Fluid Dynamics SATURNE, AVBP, NS3D, ALYA Astronomy and Cosmology GADGET Earth Sciences NEMO, BSIT, SPECFEM3D, WRF Plasma Physics PEPC, TORB/EUTERPE Computational Engineering TRIPOLI_4, ELMER 19

Access to the prototypes - 1 Available for testing internal and external to PRACE: Host Machine Committed Commited #cores total Availability #months dedicated non-dedicated non-dedicated since in cores in % FZJ BlueGene 0 2,00 65536 Jul 2008 17 CSC/CSCS Cray XT5 1440 10,00 9424 Dec 2008 12 FZJ+CEA TN-x86 (CEA) 1024 0,00 0 Apr 2009 8 FZJ+CEA TN-x86 (FZJ) 0 2,00 17664 Jun 2009 6 NCF/SARA FN-pwr6 128 5,00 3328 Dec 2008 12 BSC Hyb-Cell 144 0,00 0 Dec 2008 12 HLRS Hyb-vector 96 0,00 0 Mar 2009 9 Until Aug. 2009: ca. 325k core hours used on NCF/SARA FN-pwr6 for internal PRACE work (benchmark runs) 20

Access to the prototypes - 2 External access to the prototypes: http://www.prace-project.eu/prototype-access Fill out prototype access form (available on web site) Applicable to European groups from academia and industry Assessment of proposals done by the PRACE Prototype Access Committee with representatives from the prototype host centre and representatives from PRACE WP5 and WP6 Applications can be sent all year round In 2009 the application deadlines are on April 15, June 15 and September 15; Decisions dates: May 1, July 1 and October 1 respectively. 21

Access to the prototypes - 3 The following criteria will be used to assess proposals for access to the PRACE prototypes: 1. Amount of time requested 2. Potential of information being useful to the PRACE project 3. Evaluation and testing purposes only 4. The code being investigated codes already included in the PRACE benchmarking suite will not be given high priority 5. Comparisons of prototypes 6. Written assessment of results 7. Research into potential petascale optimisation techniques 8. Licensing of the code 22

Access to the prototypes - 4 First projects granted access to the PRACE Prototype systems in July 2009 Total of 4.4 Million Core hours to 3 projects: Simulation of interfaces of biological systems (Niall English, Dublin) Porting the astrophysical fluid code HYDRA (Turlough Downes, Dublin) Development of X10 environment on a parallel machine (Marc Tajchman, CEA) Supporting training events: Code porting workshops: http://www.prace-project.eu/hpc-training-events 23

Future architectures - 1 Investigation of future Petaflop/s computer technologies beyond 2010: Set up a permanent structure to enable research on future technology and software, with academic and industrial collaborators (STRATOS) STRATOS: PRACE advisory group for Strategic Technologies Define promising technologies for mainstream use in the 2011-2012 time frame, and install these as prototype systems Has led to another set of more experimental prototypes (which include software developments) Access to these prototypes to be discussed with respective owners 24

Future architectures - 2 Sites Hardware/Software Porting effort CEA GPU/CAPS CINES-LRZ LRB/CS CSCS UPC/CAF EPCC FPGA 1U Tesla Server T1070 (CUDA, CAPS, DDT) Intel Harpertown nodes Hybrid SGI ICE2/UV/Nehalem- EP & Nehalem-EX/ClearSpeed/ Larrabee Prototype PGAS language compilers (CAF + UPC for Cray XT systems) Maxwell FPGA prototype (VHDL support & consultancy + software licenses (e.g., Mitrion- C)) Evaluate GPU accelerators and GPGPU programming models and middleware. (e.g., pollutant migration code (ray tracing algorithm) to CUDA and HMPP) Gadget,SPECFEM3D_GLOBE, RaXml, Rinf, RandomAccess, ApexMap, Intel MPI BM The applications chosen for this analysis will include some of those already selected as benchmark codes in WP6. We wish to port several of the PRACE benchmark codes to the system. The codes will be chosen based on their suitability for execution on such a system. 25

Future architectures - 3 Sites Hardware/Software Porting effort FZJ (BSC) Cell & FPGA interconnect LRZ RapidMind NCF ClearSpeed eqpace (PowerXCell cluster with special network processor) RapidMind (Streaming Processing Programming Paradym) X86, GPGPU, Cell ClearSpeed CATS 700 units CINECA I/O Subsystem (SSD, Lustre, pnfs) - KTH Standalone system with AMD Istanbul 6- core CPUs (subject to formal EC approval) Extend FPGA-based interconnect beyond QCD applications. ApexMap, Multigrid, FZJ (QCD), CINECA (linear algebra kernels involved in solvers for ordinary differential equations), SNIC Astronomical many-body simulation, Iterative sparse solvers with preconditioning, finite element code, cryomicrotome image analysis Exploit novel AMD power control features to maximize energy efficiency 26

Future PRACE implementation phase PRACE project ends at 31-dec-2009 Final deliverables of the project prepare for continuation and/or implementation: Legal aspects like statutes, office seat Procurement schedules Tier-0 access, operation and ownership models European-wide peer review Benchmark set for future procurements Preparation of prolongation proposals in progress in EC FP7 calls Software aspects 100k+ heterogeneous cores, exascale systems Linkage to global efforts like International Exascale Software Project (IESP) 27

Exascale computing - 1 IESP project Assess the short-term, medium-term and long-term needs of applications for peta/exascale systems Explore how laboratories, universities, and vendors can work together on coordinated HPC software Understand existing R&D plans addressing new programming models and tools addressing extreme scale, multicore, heterogeneity and performance Start development of a roadmap for software on extreme-scale systems Starts to get organised: Workshops held in US (Santa Fe, April 2009), Europe (Paris, June 2009) Next one in Japan (Tsukuba, October 2009) 28

Exascale Computing - 2 PRACE Source: Rick Stevens, Argonne National Laboratory 29

Exascale computing - 3 Levels of concurrency (10 6 now to 10 9 at exascale) Clock rate of core (1-4 GHz now to 1-4 GHz at exascale) RAM per core (1-2GB now to 1-4GB at exascale) Total number of cores (200K now to 100M at exascale) Number of cores per node (8 now to 64-512 at exascale) Aggressive Fault Management in HW and SW I/O channels (>10 3 now to 10 5 at exascale) Power Consumption (10MW now to 40MW-150MW at exascale) Programming Model (MPI now to MPI + X at exascale) Source: Rick Stevens, Argonne National Laboratory 30

Exascale computing - 4 Many challenges: Power consumption Software development to enable efficient usage of millions of cores simultaneously: System software Numerical libraries Programming models Languages Applications And more NL involvement: Patrick Aerts in Executive Committee, Peter Michielse in Program Committee Source: Rick Stevens, Argonne National Laboratory 31

Thank you! Questions? 32