Red Española de Supercomputación Zaragoza

Similar documents
Spanish Supercomputing Network

Supercomputing Resources in BSC, RES and PRACE

BSC vision on Big Data and extreme scale computing

Trends in High-Performance Computing for Power Grid Applications

David Vicente Head of User Support BSC

Cosmological simulations on High Performance Computers

BSC - Barcelona Supercomputer Center

Barry Bolding, Ph.D. VP, Cray Product Division

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Kriterien für ein PetaFlop System

Parallel Programming Survey

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Scientific Computing Programming with Parallel Objects

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Data Centric Systems (DCS)

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Jean-Pierre Panziera Teratec 2011

MareNostrum: Building and running the system - Lisbon, August 29th, 2005

Parallel Computing. Benson Muite. benson.

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

10- High Performance Compu5ng

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Next Generation GPU Architecture Code-named Fermi

Relations with ISV and Open Source. Stephane Requena GENCI

Turbomachinery CFD on many-core platforms experiences and strategies

ANALYSIS OF SUPERCOMPUTER DESIGN

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

YALES2 porting on the Xeon- Phi Early results

HPC enabling of OpenFOAM R for CFD applications

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Petascale Software Challenges. William Gropp

Building a Top500-class Supercomputing Cluster at LNS-BUAP

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

PRACE hardware, software and services. David Henty, EPCC,

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer


How To Build A Cloud Computer

Introduction to GPU hardware and to CUDA

Sun Constellation System: The Open Petascale Computing Architecture

HPC Programming Framework Research Team

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

The PRACE Project Applications, Benchmarks and Prototypes. Dr. Peter Michielse (NCF, Netherlands)

CENTRO DE SUPERCOMPUTACIÓN GALICIA CESGA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Hank Childs, University of Oregon

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

A Flexible Cluster Infrastructure for Systems Research and Software Development

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

OpenMP Programming on ScaleMP

Experiences With Mobile Processors for Energy Efficient HPC

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Introduction to Cloud Computing

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

HPC with Multicore and GPUs

Big Data Management in the Clouds and HPC Systems

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Clusters: Mainstream Technology for CAE

Big Data Challenges in Bioinformatics

Automating Big Data Benchmarking for Different Architectures with ALOJA

Current Status of FEFS for the K computer

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPUs for Scientific Computing

High Performance Computing in the Multi-core Area

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Supercomputing Status und Trends (Conference Report) Peter Wegner

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Parallel file I/O bottlenecks and solutions

Parallel Software usage on UK National HPC Facilities : How well have applications kept up with increasingly parallel hardware?

HPC Wales Skills Academy Course Catalogue 2015

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Hadoop on the Gordon Data Intensive Cluster

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

~ Greetings from WSU CAPPLab ~

Pedraforca: ARM + GPU prototype

Transcription:

Red Española de Supercomputación Zaragoza Zaragoza, 1 de Julio, 2010 Mateo Valero Director

Top10 2

Looking at the Gordon Bell Prize 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors Superconductive materials 1 EFlop/s; ~2018;?; 1x107 Processors (109 threads) Jack Dongarra 3

Cores in the Top25 Over Last 10 Years 4

Exponential growth in parallelism for the foreseeable future 5

Increasing chip performance: Intel s Petaflop chip 80 processors in a die of 300 square mm. Terabytes per second of memory bandwidth Note: The barrier of the Teraflops was obtained by Intel in 1991 using 10.000 Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters This will be possible in 3 years from now Thanks to Intel ICPP-2009, September 23rd 2009 6

Intel/UPC Since 2002 (Roger Espasa, Toni Juan) 40 People Microprocessor Development (Larrabee x86 many core) 7

NVIDIA Fermi Architecture 16 Streaming- Multiprocessors (512 cores) execute Thread Blocks 620 Gigaflops Wide DRAM interface provides 12 GB/s bandwidth Unified 768KB L2 cache serves all threads GigaThread hardware scheduler assigns Thread Blocks to SMs 8

Cell Broadband Engine TM: A Heterogeneous Multi-core Architecture * Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. 9

Hybrid SMP-cluster parallel systems Interconnect (Myrinet, IB, Ge, 3D torus, tree, ) Node* Node Node Node** Node* Node* Node Node Node Node** Node** SMP Memory IN multicore multicore multicore multicore homogeneous multicore (e.g. Larrabee) heterogenous multicore general-purpose accelerator (e.g. Cell) GPU FPGA ASIC (e.g. Anton for MD) Network-on-chip (bus, ring, direct, ) 10

HPC hierarchy in current Top 10 Based on June 2010 list 11

Evolution towards Exaflop supercomputers Current #1 (6/2010) Current #2 (Fermi) (6/2010) Current #3 (Cell) (6/2010) All in Top500 (6/2010) Personal supercomputer (CPU) Personal supercomputer (CPU/GPU) Sequoia LANL (announced) Possible exaflop? cores nodes chips/n cores/c ops/cor ode hip e Personal supercomputer Personal sup. accelerator 240 120 10240 10 10 4 2 2 6 6 512 4 4 1 GHz GFlops 2,8 2,8 0,648 2688 1344 6636 20PF/s, 1.6 PB Memory 96 racks, 98,304 nodes 1.6 M cores (1 GB/core) 50 PB Lustre file system 6.0 MW power 12

13

Holistic approach Towards exaflop Applications Performance tools Programming model Can you imagine how would it be if there was no distance? Yes, he can! if everything was here? Load balancing Interconnection Processor/node architecture Thanks to Jesus Labarta New York, June 9th, 2009 14

The holistic approach Towards exaflop Applications Performance tools Programming model A d d r e s s Can you imagine how should we think exaflop? In a holistic way? s p a c e s Load balancing Interconnection Processor/node architecture New York, June 10th, 2009 Yes, he can! L a t e n c y Dependences 15

Holistic approach Towards exaflop Applications Comput. Complexity Async. Algs. Moldability Job Scheduling a lle Ma Load Balancing Resource awareness y nc icie Po Concurrency extraction Interconnection ef f Locality optimization we r Work generation d Dependencies ea Address space Ov erh Run time User satisfaction ity bi l YES...We can Programming Model Topology and routing External contention Processor/node architecture NIC design Hw counters Run time support Memory subsystem Core Structure M. Valero Keynote at ICS, NY, June 2009 16

Challenges: view from Jesús Labarta Variability Everywhere, huge Efficiency Performance and power reasons Avoid overkills Memory Logical and physical structure Bandwidth and latency Resilience Impossible to run an app without errors happening halfway Constraint: Power Locality scheduling, minimize Bandwidth Programmability: Don Grice: we can do the hardware but if it can not be programmed (approx.) Programming model Machine independent. What, not how Smooth migration path Runtime/Execution model Data access awareness. Asynchrony/dataflow Automatic Load balance Malleability Algorithms Asynchrony, overlap Minimize bandwidth Resilience From recovery to tolerance Holistic: Applications: Co-design vehicles Between system software layers and architecture Tools: Fly with instruments The importance of detail Jesús Labarta 17

BSC-CNS e iniciativas a nivel internacional: IESP Improve the world s simulation and modeling capability by improving the coordination and development of the HPC software environment Build an international plan for developing the next generation open source software for scientific highperformance computing 18

Back to Babel? Book of Genesis Now the whole earth had one language and the same words The computer age Fortran & MPI Come, let us make bricks, and burn them thoroughly. "Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves And the LORD said, "Look, they are one people, and they have all one language; and this is only the beginning of what they will do; nothing that they propose to do will now be impossible for them. Come, let us go down, and confuse their language there, so that they will not understand one another's speech." ++ Cilk++ Fortress X10 CUDA Sisal HPF RapidMind StarSs Sequoia CAF ALF OpenMP UPC SDK Chapel MPI 19

Different models of computation. The dream for automatic parallelizing compilers not true so programmer needs to express opportunities for parallel execution in the application SPMD OpenMP 2.5 Nested fork-join OpenMP 3.0 DAG data flow Huge Lookahead &Reuse. Latency/EBW/Scheduling And asynchrony (MPI and OpenMP too synchronous): Collectives/barriers multiply effects of microscopic load imbalance, OS noise, 20

The holistic approach Towards exaflop Performance Tools Programming Model Load Balancing Interconnection Processor/node architecture Clean programming practices. Abstraction y c n e h t m t a s d i i l L e w l l d a n r a a B P Mechanisms to inject probes Merge nicely with node level Asynchrony Sc he du lin g Applications Asynchrony/dependences/dataflow data access specification. Abstract/simple memory model Fine grain DLP Basic transformations (unroll, ) r e w o P Monitor/set tunable and scheduling variables Detect data production to enable overlapped trasnfer Efficient support for basic mechanisms. Data transfers, thread creation, dependence handling 21

The TEXT project Towards EXaflop applications Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a an productive and efficient way. Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. Develop additional environment capabilities tools (debug, performance) improvements in runtime systems (load balance and GPUSs) Support other users Identify users of TEXT applications Identify and support interested application developers Contribute to Standards (OpenMP ARB, PERI-XML) 22

BSC-CNS El Barcelona Supercomputing Center Centro Nacional de Supercomputación (BSC-CNS) fue constituido el 1 de Abril de 2005 como el Spanish National Laboratory in supercomputing. A corto plazo se sumará el CSIC al patronato del BSC. Los nuevos porcentajes de propiedad serán: El Gobierno Español 54% El Gobierno Catalán 30% La UPC 11% El CSIC 5% 23

BSC: Spanish National Center More than 300 people from 27 different countries (Argentina, Belgium, Brazil, Bulgaria, Canada, Colombia, Cuba, China, Cuba, Dominicana, France, Germany, India, Iran, Ireland, Italy, Jordania, Lebanon, Mexico, Pakistan, Poland, Russia, Serbia, Spain, Turkey, UK, USA) 24

25

A la espera de un nuevo MareNostrum 2004 2005 2006 MNv1 MNv2 (50TF) (100TF) 2007 2008 2009 2010 MNv3 MN (400TF) Posición Mundial Posición Europea MareNostrum v1 Nov04 4 1 Jun05 5 1 Nov05 8 1 Jun06 11 3 Nov06 5 1 Jun07 9 1 Nov07 13 3 Jun08 26 8 Nov08 40 10 Jun09 60 19 Nov09 77 22 MareNostrum v2 26 2011

Top20 in Europe 27

Red Española de Supercomputación Altamira Universidad de Cantabria MareNostrum BSC CaesarAugusta Universidad de Zaragoza Magerit Universidad Politécnica Madrid Picasso Universidad de Málaga La Palma IAC Tirant Universidad de Valencia Atlante ITC 28

1550 proyectos de e-ciencia 29

Dynamic load balancing MPI + OpenMP Over commit threads/core. Only one per processor active at a time Shift processors between processes in node i.e. GADGET @ 800 procs 2.5x speedup MPI + StarSs (In development) Should result in higher flexibility Overcoming Amdahl s law in hybrid parallelization Start with one process per core StarSs only in unbalanced regions LeWI: A Runtime Balancing Algorithm for Nested Parallelism. M.Garcia,J.corbalan.J.Labarta. ICPP09 30

Performance analysis of real production applications Describe actual behavior Identify optimization approaches Estimate potential To optimize the applications in cooperation with developers To drive other developments: Programming model, run times, architecture, interconnect, GROMACS Serialization!! Profile 40 PEPC Real 10 2 20 10 10 13 Code region 64 0 31 1 64 128 512 Bandwdith (MB/s) Bandwdith (MB/s) 256 2048 1024 8192 CPU ratio 8 4096 1 64 128 512 256 2048 1024 8 8192 CPU ratio 0 4096 1 64 128 4 12 64 16384 0 11 5 5 64 16 10 99.11% 10 5 9 20 97.49% 15 512 6 7 8 code region Speedup 15 2048 5 Speedup 15 1024 4 GADGET 93.67% Bandwdith (MB/s) 3 16384 20 8192 load balance!! 4096 Pred. 1MB/s 15 1 Speedup 16384 Pred. 10MB/s Pred. 5MB/s 20 0 Endpoint contention!! NM Pred. Pred. 100MB/s 25 5 256 ideal 30 % of computation time SPECFEM3D %elapsed time 35 CPU ratio

Kaleidoscope Project Platform Gflops Seep-up Power (W) Gflops/W JS21 8,3 1 267 0,03 QS22 116,6 14 370 0,32 2 TESLA 1060 350 42 90+368,8 0,76 The work of 3 months is now done in 1 week (speed-up 14) 2 days (speed-up 42) On the Cell 23.5 GB/s of memory BW used from 25.6 GB/s max BW On TESLA the I/O is now the real bottleneck Awarded by "IEEE Spectrum" as one of the 2008 top 5 innovative technologies Platt s award to the commercial technology of the year 2009 Barcelona, 10 febrero 2010 32

Consolider El BSC-CNS coordina un programa Consolider de supercomputación y e-ciencia, que une a grupos de investigación expertos en aplicaciones que requieren supercomputación y a grupos expertos en el diseño del hardware y software de base de los supercomputadores. Grupo de Modelización Molecular y Bioinformática (U. de Barcelona, M. Orozco) Grup de Bioinformàtica del Genoma (Centre de Regulació Genòmica, R. Guigó) Grupo de Biología Estructural Computacional (Centro Nacional de Investigaciones Oncológicas, A. Valencia) Application scope Life Sciences Compilers and tuning of application kernels Programming models and performance tuning tools Application areas Architectures and hardware technologies Application scope Earth Sciences Application scope Astrophysics Application scope Engineering Application scope Physics Grupo de Ciencias de la Tierra (BSC-CNS, J. M. Baldasano) Unidad de Contaminación Atmosférica (CIEMAT, F. Martín) Grupo de Diagnóstico y Modelización del Clima (U. Complutense de Madrid, R. García-Herrera) Grupo de Astrofísica Relativista y Cosmología (U. de Valencia, J. M. Ibáñez) Grupo de Simulación Numérica de Procesos Astrofísicos (Instituto de Astrofísica de Canarias, F. Moreno) Grupo GAIA de Astronomía Galáctica (U. de Barcelona, J. Torra) Grupo de Cosmología Computacional (U. Autónoma de Madrid, G. Yepes) Grupo de Mecánica de Fluidos Computacional (U. Politécnica de Madrid, J. Jiménez Sendín) Unidad de Simulación Numérica y Modelización de Procesos (CIEMAT, M. Uhlmann) Grupo SIESTA (U. Autónoma de Madrid, J. M. Soler) Grupo SIESTA (Laboratorio de Estructura Electrónica de los Materiales del Instituto de Ciencia de Materiales de Barcelona ICMAB-CSIC, P. J. Ordejón) 33

Consolider (2) Departamento de Tecnologías de la Información (BSC-CNS, M. Valero) Grupo Computación de Altas Prestaciones (U. Politècnica de Catalunya, J. M. Llabería) Grupo de Arquitectura y Tecnología de Sistemas Informáticos (U. Complutense de Madrid, F. Tirado) Grupo de Arquitectura de Computadores (U. de Malaga, E. López Zapata) Application scope Life Sciences Compilers and tuning of application kernels Departamento de Tecnologías de la Información (BSC-CNS, M. Valero) Grupo Computación de Altas Prestaciones (U. Politècnica de Catalunya, J. M. Llabería) Parallel Processing and Distributed Systems group (U. Autónoma de Barcelona, A. Ripoll) Departamento de Tecnologías de la Información (BSC-CNS, M. Valero) Grupo Computación de Altas Prestaciones (U. Politècnica de Catalunya, J. M. Llabería) Grupo de Arquitectura de Computadores (U de Zaragoza, V. Viñals) Grupo de Arquitectura y Tecnología de Sistemas Informáticos (U. Complutense de Madrid, F. Tirado) Grupo de Arquitectura y Tecnología de Computadores (U. de Cantabria, J. R. Beivide) Grupo de Arquitectura de Computadores (U. de Malaga, E. López Zapata) Grupo de Arquitectura de Computadores (U. de Las Palmas de Gran Canaria, Instituto Universitario de Ciencias y Tecnologías Cibernéticas, E. Fernández) Programming models and performance tuning tools Application scope Earth Sciences Basic research Application scope Astrophysics in supercomputing Architectures and hardware technologies Application scope Engineering Application scope Physics 34

SyeC Consolider Project Nanotechnology (SIESTA code) Improving scalability till 1000 cores Load balancing Hybrid parallelism MPI+openMP Parallel I/O Matrix per vector Scalability Eigenvalue solver scalability 1000 Operations Matrix per vector Ideal 100 Speed Up Hamiltonian builder builder scalability scalability Hamiltonian 10 1 1 10 100 Number of processors 35 1000

SyeC Consolider Project Astrophysics codes on: General Relativistic Resistive Magneto-Hydrodinamics Porting to MPI + openmp or GPUs GAIA operative infrastructure 36

SyeC Consolider Project DNS in CFD (Turbulence + Particles) Global communications are critical Boundary layer modeling Hybrid code MPI+openMP Reθ = 6150 on BG/P at ANL 32768 cores (34Mh) Parallel I/O 100 TB Total Output data => 5 year to analyze these data 37

SyeC Consolider Project Atmospheric modeling New numerical schemes Improving scalability MPI+openMP+ paralell I/O Cell and GPUs code First dust simulation including all processes except dry deposition 38

SyeC Consolider Project Life Sciences (Protein interactions & Genomics) Workflows Data bases GPU code porting New sequencing harware Small folding 1 experiment 4 Tb of data 40 Gb processed data 1 machine 2 experiments a week A medium sized center 10 machines 39

SyeC Consolider Project Computer Sciences Code scalability Programming Models Computer architecture 40

PRACE Project Consortium tier 1 Hosting Partners tier 0 General Partners Non Hosting Partners 41

First machine available JUGENE IBM BlueGene/P @ FZJ, Jülich, Germany Next machine available @GENCI, France Summer 2011 42

PRACE Early Access Call Opening date : 10th May 2010 Closing date : 10th June 2010 Start date : 1st August or 1st December 2010 Allocation period : 4 months Type of access: Project (1 proposal) or Preparatory + Project (combined -2 linked proposals) Information: http://www.prace-project.eu/hpc-access 43

MareIncognito: Project structure 4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: EUTERPE General kernels Automatic analysis Coarse/fine grain prediction Sampling Clustering Integration with Peekperf StarSs: CellSs, SMPSs OpenMP@Cell OpenMP++ MPI + OpenMP/StarSs Applications Performance analysis tools Models and prototype Interconnect Contention, Collectives Overlap computation/communication Slimmed Networks Direct versus indirect networks Programming models Coordinated scheduling: Run time, Process, Job Power efficiency Load balancing Processor and node Contribution to new Cell design Support for programming model Support for load balancing Support for performance tools Issues for future processors 44

RIS: BSC-CNS y Latinoamérica Establecimiento de RIS: Red Iberoamericana de Supercomputación a través del CYTED: Uso compartido de recursos: RES más centros iberoamericanos Formación Investigación Países: Portugal Argentina Brasil Colombia Chile República Dominicana Cuba México Ecuador Conectar RIS a los programas de la UE Barcelona, 10 febrero 2010 45

Barcelona Computing Week, July 5-9, 2010 Programming and Tuning Massively Parallel Systems Coordinators: Mateo Valero, BSC Wen-mei Hwu, Illinois Instructors: Wen-mei Hwu, University of Illinois David B. Kirk, NVIDIA Corporation Audience: Three parallel tracks specially designed for beginners, advanced and teachers profiles. Programming Languages: CUDA, OpenCL, OpenMP, StarSs Numerical Methods and Case Studies: FFT, Graph, Tiling, Grid, Montecarlo, FDTD, Sparse matrices Hands-on Labs: Afternoon labs with teaching assistants for each audience/level http://bcw.ac.upc.edu 250 applications for July 2010 46

ACACES 10 "HiPEAC Summer School" : one week summer school for computer architects and compiler builders 294 applications, 197 participants 17 industry attendants, 8 companies 31 countries Keynotes Insup Lee University of Pennsylvania Jesus Labarta, BSC Instructor Andreas Herkersdorf Michael Scott Affiliation TU Muenchen Vivek Sarkar Multicore Programming Models and their Compilation Challenges Harvard University Variation-Aware Processor Design David Brooks Derek Chiou University of Rochester Rice University University of Texas at Austin Scott Mahlke University of Michigan Dan Sorin Duke University Donatella Politecnico di Sciuto Milano Steven Hand Citrix Theodore Ts'o Google Mahmut Kandemir Andrzej Brud and Per Stenström Title Application-Specific (MP)SoC Architectures for Internet Networking Transactional Memory Fast and Accurate Computer System Simulators Compilation for Multicore Processors Fault Tolerant Computer Architecture FPGA-based reconfigurable computing System Virtualization File Systems and Storage Technologies Pennsylvania State Embedded Systems: A Software University Perspective Chalmers How to transform research results into a business 47

BMW10: Barcelona Multicore Workshop, October 22-23, 2010 Multi-core and many-core processors have already arrived The issue facing the software community is how to program those machines in the most productive way. The hardware community has to design the manycores so as to maximize the potential performance. The BMW workshop consists of a combination of invited talks, two panel discussions and time for discussion. Co-organized by BSC-Microsoft Research Center and HiPEAC Organizers: BSC -MIcrosoft: Mateo Valero, Fabrizio Gagliardi, Osman Unsal HiPEAC: Per Stenstrom, Georgi Gaydadjiev, Manolis Katevenis, Eduard Ayguade 48

Master on HPC Kernel Supercomputer Architectures Methods and Algorithms for Parallel Programming Optimization and Parallelization of Numerical Simulations Free-options High performance Computational Mechanics Performance tuning and analysis tools Data Mining 2 Seminar on Supercomputing I, II, III Applications Computational Astrophysics Bionformatics Earth Sciences Applications of Computational Astrophysics Applications of Bionformatics Applications of Earth Sciences 49 T H E S I S

Education for Parallel Programming I I many-core programming multi-core programming We all massive parallel prog. I games Multicore-based pacifier 50

Muchas gracias Barcelona, 10 febrero 2010 51

SIESTA project Ab-inito DFT molecular dynamics code BSC working on its development Example: Neptune Mid-layer H2O+ NH3+ CH4 Temperature 1500 ºK 2500 ºK Pressure 0.15 GPa 60 GPa Simulated time 10 ps equivalent to 20,000 molecular dynamic steps Number of atoms 1269 atoms (100 processors, 2007) Now, more than 500.000 atoms using more than 1000 processors) 52

High Performance Computing as key-enabler LES Capacity: # of Overnight Loads cases run Unsteady RANS 102 103 104 Available Computational Capacity [Flop/s] RANS Low Speed x106 RANS High Speed 105 Smart use of HPC power: 106 Algorithms Data mining knowledge 1980 HS Design 1990 Data Set 2000 CFD-based LOADS & HQ 2010 2020 Aero Optimisation & CFD-CSM 2030 Full MDO Capability achieved during one night batch CFD-based noise simulation Real time CFD based in flight simulation Courtesy AIRBUS France 53

Diseño del ITER TOKAMAK (JET, Oxford) 54

Weather, Climate and Earth Sciences: Roadmap 2009 Resolution : 80 Km Memory: FLOPS 3 110 * 1014 GB Storage: 8 TB NEC-SX9 48 vector procs: 40 days run 2015 Resolution : 20 Km MemSory: 3,5 TB Storage: 180 TB High resolution model with complete carbon cycle model Challenges:16 data viz and post-processing, data discovery, archiving FLOPS 1* 10 2020 Resolution : 1 Km Memory: 4 PB Storage: 150 PB Higher resolution with global cloud resolving model Challenges: data sharing, transfer memory management, I/O management FLOPS 1* 1019 55

Supercomputación, teoría y experimentación 56 Cortesia de IBM

57

Jaguar @ ORNL: 1.75 PF/s Cray XT5-HE system Quad-core AMD Opteron processors running at 2.6 GHz, 224,162 cores. Power: 6.95 Mwatts 300 terabytes of memory 10 petabytes of disk space. 240 gigabytes per second disk bandwidth Cray's SeaStar2+ interconnect network. Jack Dongarra 58

PRACE AISBL First Council meeting of the legal form. Three more countries formally adhered to the legal form: Sweeden, Cyprus and Czech Republic BoD was formally appointed. Chair of the BoD selected: Sergi Girona Operating budget approved 59

PRACE regular calls Preparatory access code testing and optimisation, technical support if requested, continuous call, fast track assessment, maximum allocation of 6 months Project access 2 calls per year, 1 year allocation, allocation in November and May Programme access large projects of a research group, 2 years allocation, calls coincide with the calls for project access, possible review at the end of the first year allocation Final report mandatory for all types of proposals 60

PRACE regular calls 1st PRACE regular call opened 15th June for allocation starting 1st November Only available machine at present is JUGENE All proposals will be subject to PRACE Peer Review, which will be handled on-line The Scientific Steering Committee will be responsible for advising on the scientific direction of PRACE 61

Grand Challenge problems Systems biology Model & simulation leading to predictive models with clinical or environmental impact Sustainable Systems Taking into account multi-scale nature Models are linked to experimental data providing corroboration of experiments Turbulence & Chaos Characterize boundary layer effects and their impact on global solution and stability Environmental Global Warming/Climate Change Energy Water Biodiversity and land use Chemicals, toxics and heavy metals Air pollution Waste management Stratospheric ozone depletion Oceans & fisheries Deforestation Multi-Scale Patient-Specific Data Genetic Variability Gene Protein ExpressionExpression Profiling Profiling Multi-Modal Imaging Data Analysis And Modeling 62

BSC-CNS: sinergia con infraestructuras CNS es un complemento fundamental para las infraestructuras científicas experimentales IAC ICFO Sincroton IRB Barcelona, 20 Julio 2009 CIEMAT(TJ II) 63

Factors that Necessitate Redesign Steepness of the ascent from terascale to petascale to exascale Extreme parallelism and hybrid design Preparing for million/billion way parallelism Tightening memory/bandwidth bottleneck Limits on power/clock speed implication on multicore Reducing communication will become much more intense Memory per core changes, byte-to-flop ratio will change Necessary Fault Tolerance MTTF will drop Checkpoint/restart has limitations Software infrastructure does not exist today 64

El BSC-CNS Misión del BSC-CNS Investigar, desarrollar y gestionar tecnologías que faciliten el avance de la ciencia Objetivos del BSC-CNS I+D en ciencias de los computadores, ciencias de la vida y ciencias de la tierra Soporte de supercomputación a investigación externa 5 departamentos científico/técnicos Barcelona, 10 febrero 2010 65

BSC-CNS: vertebrador del servicio de supercomputación en España MareNostrum BSC Magerit Universidad Politécnica Madrid CaesarAugusta Universidad de Zaragoza La Palma IAC Picasso Universidad de Málaga Barcelona, 10 febrero 2010 Altamira Universidad de Cantabria Atlante Gobierno Canarias Tirant Universidad de Valencia 66

BSC-CNS y REPSOL: proyecto Kaleidoscope 2008: Awarded by "IEEE Spectrum" as one of the top 5 innovative technologies Barcelona, 10 febrero 2010 67

Kaleidoscope: WEM vs. RTM Data Property of TGS 68

SIESTA project Ab-inito DFT molecular dynamics code BSC working on its development Example: Neptune Mid-layer H2O+ NH3+ CH4 Temperature 1500 ºK 2500 ºK Pressure 0.15 GPa 60 GPa Simulated time 10 ps equivalent to 20,000 molecular dynamic steps Number of atoms 1269 atoms (100 processors, 2007) Now, more than 500.000 atoms using more than 1000 processors) 69