High Performance Computing, an Introduction to

Similar documents

Introduction to Cloud Computing

Parallel Programming Survey

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Scalability and Classifications

Lecture 23: Multiprocessors

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

How To Write A Parallel Computer Program

Chapter 2 Parallel Architecture, Software And Performance

Lecture 2 Parallel Programming Platforms

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Lecture 1: the anatomy of a supercomputer

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

High Performance Computing. Course Notes HPC Fundamentals

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Chapter 2 Parallel Computer Architecture

Supercomputing Status und Trends (Conference Report) Peter Wegner

Cluster Computing at HRI

White Paper The Numascale Solution: Extreme BIG DATA Computing

CMSC 611: Advanced Computer Architecture

Multi-Threading Performance on Commodity Multi-Core Processors

64-Bit versus 32-Bit CPUs in Scientific Computing

Parallel Computing. Benson Muite. benson.

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Data Centric Systems (DCS)

1 Bull, 2011 Bull Extreme Computing

Annotation to the assignments and the solution sheet. Note the following points

High Performance Computing

White Paper The Numascale Solution: Affordable BIG DATA Computing

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

ANALYSIS OF SUPERCOMPUTER DESIGN

Parallel Programming

Symmetric Multiprocessing

Current Trend of Supercomputer Architecture

CALMIP a Computing Meso-center for middle HPC academis Users

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

~ Greetings from WSU CAPPLab ~

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Performance of the JMA NWP models on the PC cluster TSUBAME.

Architecture of Hitachi SR-8000

Introduction to GPGPU. Tiziano Diamanti

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High Performance Computing in the Multi-core Area

A Very Brief History of High-Performance Computing

Multicore Parallel Computing with OpenMP

Trends in High-Performance Computing for Power Grid Applications

OpenMP Programming on ScaleMP

Computer Architecture TDTS10

Benchmarking Large Scale Cloud Computing in Asia Pacific

Next Generation GPU Architecture Code-named Fermi

10- High Performance Compu5ng

GPUs for Scientific Computing

Introduction to GPU hardware and to CUDA

Principles and characteristics of distributed systems and environments

ST810 Advanced Computing

Multi-core and Linux* Kernel

Building Clusters for Gromacs and other HPC applications

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Clusters: Mainstream Technology for CAE

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Parallel Algorithm Engineering

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Building an Inexpensive Parallel Computer

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Cluster Computing in a College of Criminal Justice

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Chapter 1 Computer System Overview

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

A quick tutorial on Intel's Xeon Phi Coprocessor

Jean-Pierre Panziera Teratec 2011

Kriterien für ein PetaFlop System

1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

- An Essential Building Block for Stable and Reliable Compute Clusters

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Altix Usage and Application Programming. Welcome and Introduction

Introduction to GPU Programming Languages

Cluster Implementation and Management; Scheduling

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Overview of HPC Resources at Vanderbilt

Transcription:

High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Michel Fournié, Maître de conférence IMT, Université de Toulouse Toulouse University ing Center : http://www.calmip.cict.fr Page 1 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel programing Make it happen : Message Passing Interface MPI (+BE) Execution model Why exchanging messages MPI Library : most important routines Basics : speed up, parallel efficiency and traps Page 2 1

Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel ation Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Page 3 Principle of Supercomputers : HPC computing Systems Hardware et Software Achieve Performance on floating computation Flop/s + Memory (RAM and file system) Flop : floating operation (mult, add) 3,14159-6,8675456342 E+08 Storage, I/O (Input/output) ++ or $++ Mutualisation Several(tens?)users single (big) system Need coherency in the use of ressources Rules (Fair sharing?) => job sheduler Supercomputers are only dedicated to compute Scientific Applications Back-up (huge) file storage Remote access Nowdays : huge constraints on infrastructures facility Electricity supply (back-up) Cooling Weight Safety Security Page 4 2

HPC System taxonomy Shared memory Systems: Multiprocessors One single address space Shared memory UMA Uniform Memory Access NUMA Non Uniform Memory Access PROGRAM Distributed memory system(cluster) : Multi-computers Multiple (different) address space NORMA no-remote memory access Page 5 OS : Operanding System Operanding System («Système d exploitation» in French) : A collection of program that unify the hardware in order to make it usable! Exemple : Windows, Linux, MacOSX, Unix like : AIX (IBM), HPUX (HP), SOLARIS (SUN) user OS : «allocate hardware ressources to your program» Operanding system Hardware (Schéma Operanding system Wikipedia) Page 6 3

UMA Architecture (Shared Memory) User side: A single machine (single OS) several processors one single space memory adress How to program : extension of sequential programmation Machine side: SMP Symmetric MultiProcessor (SMP) Bus Interconnexion between memory and processors Central memory and I/O : shared by all processors Processors access to the same memory memory Register File Functional (mult, add) Register File Functional (mult, add) Cache Cache Coherency Processor Cache Cache Coherency Processor BUS interconnect OS Page 7 UMA Architecture (Shared memory) Machine side: SMP Symmetric MultiProcessor (SMP) Processors access to the same memory Cache Coherency Cache Coherency # processor modifing data in the same line Shared memory Register File Functional (mult, add) Cache Cache Coherency Processor Register File Functional (mult, add) Cache Cache Coherency Processor OS Page 8 4

Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 FPU FPU FPU Page 9 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=1.5E8 FPU FPU FPU Page 10 5

Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=0.6E-2 A=1.5E8 FPU FPU FPU Processors have different value of A Page 11 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=0.6E-2 A=1.5E8 FPU FPU FPU? Processors have different value of A Unacceptable for the developper point of view Page 12 6

Machine UMA Cache Coherency Several protocols Invalidation 1 write invalidate data on other processors Snooping Handle by the OS, not the user But : Can slow parallel performance Developper may be aware of that OpenMP Page 13 UMA Architecture (Shared memory) Scalability : tests on DGEMM multithreaded (MKL, ifort : v11), size matrix = 10000 x 10000 (DP) Automatic parallelisation MKL DGEMM v12 100 90 80 70 time in seconds 60 50 40 NHM EX 2,6 Ghz Itanium 1,5 Ghz NHM EP 2,8 30 20 10 0 4 8 16 32 numbre of cores Page 14 7

UMA Architecture Memory access: Concurrent access to central memory => bottleneck - Time access increase. Increase size (and so level) of s memory Register File Functional (mult, add) Cache Cache Coherency Processor Register File Functional (mult, add) Cache Cache Coherency Processor OS Consequence : few number of processor Another paradigma/option : distribute memory? Page 15 Laptop, Workstation, PC are Multi-core : Shared Memory 2 cores distincts Caches distincts Cache partagé Cohérence des s 2 bi-core Cohérence des s/ bi-core Shared Memory Architecture! Parallel application : compulsory Page 16 8

Distributed memory(norma) Processor and memory tighly interconnected MPP : Massively Parallel Processing : Cluster : machines(comput nodes) interconnection Page 17 Distributed memory : Cluster Cluster : massive technology A lot of processor. A lot of machine (nodes) interconnected nodes: multi-processors, multi-core M P E/S M P E/S M P E/S M P E/S interconnect E/S M E/S M E/S P M E/S M P P P Page 18 9

Distributed memory : multi-computer Architecture (Clusters) Machine side: Massive technology Process access to its own (local) memory space Interconnect nodes : Like internet (ethernet) need much faster (bandwidth and latency) process to process communication Main memory Register File Cache Functional (mult, add) Main memory Register File Cache Functional (mult, add) Processor Processor disk OS 1 disk OS 2 Page 19 Distributed memory : multi-computer Architecture (Clusters) User side: n different nodes (n OS) interconnected, 1 (or +) processor per node. Parallel programming Message Passing (MPI, work done by developper you?) Need efficient tools to properly access computing ressources Main memory Register File Cache Functional (mult, add) Main memory Register File Cache Functional (mult, add) Processor Processor disk OS 1 disk OS 2 Page 20 10

Interconnection : key of parallel machine In parallel machine hardware (processor / memory) have to be connected specific (fast) protocols infiniband (shrink of ethernet), myrinet, proprietary topology: ring, hypercude, Torus, fat-tree, Interconnexion : Latency : how many time to be connected? Order of microsecond bandwidth (throughput): rate of data transfer? Mbytes/sec Topology : how many path from a point to another 1 Mo/s 1 Megaoctet/s 10 6 octets/sec 1 Go/s 1 Gigaoctet/s 10 9 octets/sec 1 To/s 1 Téraoctet/s 10 12 octets/s Page 21 Interconnect Interconnexion : Topology : Best choice: Each processor to all: price (affordable?), few numbers of cores impossible for large scale (1000 cores ) The least bad: try to avoid bottleneck scalability of the network topology Differents strategies Cross-bar Page 22 11

Interconnect : example Tore 2D Latency 69 B21 42,43 B12 24,25 B13 26,27 B14 28,29 B15 30,31 B16 32,33 B17 34,35 B0 0,1 40 48 55 B1 B2 2,3 4,5 B3 6,7 B4 8,9 48 B5 10,11 40 Tore 2D Page 23 Interconnexion / network : protocols Differents Protocols Source : Journées JoSy, Groupe Calcul CNRS, 13/09/2007 Lyon Page 24 12

Top500 Juin 2012 TOP 500 List and Interconnect #1 2012 #1 2011 Machine CURIE Programme Europe «PRACE» (France) Nuclear Plant between 40 MW and 1450 MW Page 25 Interconnect / Network TOP 500 Page 26 13

Interconnect : Topology topologie Hypercube 3D, 4D, 8D Page 27 Interconnect : Topology Fat-tree «mirrored» Page 28 14

Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel ation Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Page 29 Processors Architecture : MIMD MIMD : Multiple Instruction Multiple Data «Classic» Processor( ) HPC (, $ ) Exemple : Pentium IV Exemple : Itanium II 3,2 GHz Cache 500 kbytes 6,4 Gflop/s peak Linpack 0,7 Gflop/s 1,5 GHz Cache 6 MBytes 6 Gflop/s peak Linpack 5,4 Gflop/s Time to solution 8 times faster! Vendors : Intel (Itanium), AMD (Opteron), IBM (Power), FUJITSU (UltraSparc), NEC (Vector Processors) Page 30 15

Processors Architecture : MIMD Processor cycle Clock frequency = number of pulse per second 200 Mhz 200 Millions cycles per second aim : retire one or more operation per cycle Optimised Architecture: Instruction Level Parallelism (ILP) : Pipeline multiple Functional (FPU) Hierarchical Memory Time access Cache level L1, L2, L3 Speculative Execution Branch prediction Prefetching Page 31 Processors Architecture : MIMD Processor Architecture FPU, ALU register Instruction memory Control Data memory < Memory () Data+instruction < Page 32 16

Processors Architecture : MIMD Example Architecture Scheme Proc. Superscalaire: Itanium Architecure «massively parallel»: 2 FPU 4 I&MM s 3 Branch Prediction Page 33 Pipeline Aim 1 cycle = 1 retired operation Processors Architecture : MIMD Very roughly Speaking : A1 = B1 + C1 load exec write 3 phases 1 phase/cycle Pipeline : example 3 Independants Operations A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 Page 34 17

Processors Architecture : MIMD Pipeline Aim 1 cycle = 1 retired operation Independant Operation A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 load exec write Cycle 1 Load B1,C1 Cycle 2 Add B1,C1 Cycle 3 Store in A1 Cycle 4 Load B2,C2 Add B2,C2 Store in A2 3 retired operations : 9 cycles Load B3,C3 Cycle 8 Add B3,C3 Cycle 9 Store in A3 Cycles (Time) Ressources «idle» Page 35 Processors Architecture : MIMD Pipeline Aim 1 cycle = 1 retired operation Independant operation A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 R3 retired Operation : 9 cycles load exec write Pipelining : Latency : 3 cycles 1 res/cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Load B1,C1 Load B2,C2 Load B3,C3 Add B1,C1 Add B2,C2 Add B3,C3 Store in A1 Store in A2 Cycle 5 Store in A3 Cycles (Time) Independant Operation Exhibit maximum of independant Operations Feed the Pipeline Page 36 18

Processors Architecture : MIMD Feed with data : Hierarchical Memory Temps (Floating Point ) «work on» data in register Registre : tsmall size Mo A1 = B1 + C1 load B1 Hit? yes load C1 no = miss Cache? Coût n cycles e «WAIT» n cycles Page 37 Processors Architecture : MIMD Level Cache : example processor Itanium2 d INTEL latency and throughput 1,5 Ghz => 1 cycle = 0,6 ns 2 cycles 5 cycles 12 cycles 1 ko 128 Integer Registers 1 ko 128 FP Registers 16 ko L1D 16 Go/s 32 Go/s 16 Go/s 32 Go/s 16 Go/s 5+1 cycles L2U 256 ko 1.5-6 Mo-9 Mo 32 Go/s 6.4 Go/s 12+1cycles L3U 16Rd / 6Wr Altix : 145+ ns Page 38 19

Processors Architecture : MIMD Level Cache effect: Page 39 Speculate : exhibit independant operations Processors Architecture : MIMD If (cond1) then a1 = b1 + c1 else a1 = b1 * c1 End if «Break» dependancy Dependancy : Cond1 a1 = b1 + c1, Branch prediction «bet» Cond1 true compute a1 = b1 + c1,. Check later. Page 40 20

Processors Architecture HPC A core what is it? Page 41 Processors Architecture HPC L g I Page 42 21

Processors Architecture HPC Trends nowdays : multi-core Fondeurs : Intel, AMD, IBM, FUJITSU(Sparc) Cache Cache core Engrave Shrink (45 nm, 32nm, 22, nm,?nm) core core core core 1 processeurs mono-core core core core core More RAW power ( 2, 4, 6, 8 ) better ratio flop/watt, flop/m 2 same frequency or lower! Bottleneck?(SMP) => bandwidth with the RAM 1 processor ( or socket) multi-core (multi = 2, 4, 6, 8, 10, 12, 20 ) Page 43 Processors Architecture HPC TOP 500 JUNE 2013 Page 44 22

Processor Architecture : SIMD Vector Processor Operation on vector (not on scalar): Single instruction multiple data (SIMD) each => cycle n scalar retired operation code/program may fit to vectorization (need to fit with a lot of data) Affordable? X X1 X2 X3 Xn Y Y1 Y2 Y3 Yn X+Y X1+Y1 X2+Y2 X3+Y3 Xn+Yn vector principle(simd) now on scalar/superscalar processors: X86 SSE/AVX: Streaming SIMD Extension Altivec Superscalaire (2 FPU) Page 45 Processors Architecture : SIMD Processor Architecture Data memory Instruction memory Control FPU, ALU Data memory register Data memory < Data memory Memory () Data+instruction < Page 46 23

Processor Architecture : SIMD Accelerators GP-GPU : General Purpose Graphic process Precision, compilers, languages (CUDA), etc Stream processing (data centric process) SIMD intensive data parallelism data locality Future : in processor chip? GP-GPU Peak Gflop Ram sur la carte Time access ram Tesla an. 2008 500 Gflop/s 1Go environ 400-600 ns Tesla Kepler 4,5 TF 6Go 400-600 ns ATI 1,5 Tflop/s 2 G environ 400-600 ns Accelerators Intel : Xeon Phi 50 cores! MIMD (x86) cores Page 47 Processors Architecture : SIMD Accelerator Architecture Control Data memory Control Data memory Control Data memory Control Data memory < Control Data memory Control Data memory Control Data memory < Page 48 24

TOP 500 accelerator TOP 500 JUNE 2012 TOP 500 JUNE 2013 Page 49 25