High Performance Computing, an Introduction to

Size: px
Start display at page:

Download "High Performance Computing, an Introduction to"

Transcription

1 High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Michel Fournié, Maître de conférence IMT, Université de Toulouse Toulouse University ing Center : Page 1 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel programing Make it happen : Message Passing Interface MPI (+BE) Execution model Why exchanging messages MPI Library : most important routines Basics : speed up, parallel efficiency and traps Page 2 1

2 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel ation Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Page 3 Principle of Supercomputers : HPC computing Systems Hardware et Software Achieve Performance on floating computation Flop/s + Memory (RAM and file system) Flop : floating operation (mult, add) 3, , E+08 Storage, I/O (Input/output) ++ or $++ Mutualisation Several(tens?)users single (big) system Need coherency in the use of ressources Rules (Fair sharing?) => job sheduler Supercomputers are only dedicated to compute Scientific Applications Back-up (huge) file storage Remote access Nowdays : huge constraints on infrastructures facility Electricity supply (back-up) Cooling Weight Safety Security Page 4 2

3 HPC System taxonomy Shared memory Systems: Multiprocessors One single address space Shared memory UMA Uniform Memory Access NUMA Non Uniform Memory Access PROGRAM Distributed memory system(cluster) : Multi-computers Multiple (different) address space NORMA no-remote memory access Page 5 OS : Operanding System Operanding System («Système d exploitation» in French) : A collection of program that unify the hardware in order to make it usable! Exemple : Windows, Linux, MacOSX, Unix like : AIX (IBM), HPUX (HP), SOLARIS (SUN) user OS : «allocate hardware ressources to your program» Operanding system Hardware (Schéma Operanding system Wikipedia) Page 6 3

4 UMA Architecture (Shared Memory) User side: A single machine (single OS) several processors one single space memory adress How to program : extension of sequential programmation Machine side: SMP Symmetric MultiProcessor (SMP) Bus Interconnexion between memory and processors Central memory and I/O : shared by all processors Processors access to the same memory memory Register File Functional (mult, add) Register File Functional (mult, add) Cache Cache Coherency Processor Cache Cache Coherency Processor BUS interconnect OS Page 7 UMA Architecture (Shared memory) Machine side: SMP Symmetric MultiProcessor (SMP) Processors access to the same memory Cache Coherency Cache Coherency # processor modifing data in the same line Shared memory Register File Functional (mult, add) Cache Cache Coherency Processor Register File Functional (mult, add) Cache Cache Coherency Processor OS Page 8 4

5 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 FPU FPU FPU Page 9 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=1.5E8 FPU FPU FPU Page 10 5

6 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=0.6E-2 A=1.5E8 FPU FPU FPU Processors have different value of A Page 11 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=0.6E-2 A=1.5E8 FPU FPU FPU? Processors have different value of A Unacceptable for the developper point of view Page 12 6

7 Machine UMA Cache Coherency Several protocols Invalidation 1 write invalidate data on other processors Snooping Handle by the OS, not the user But : Can slow parallel performance Developper may be aware of that OpenMP Page 13 UMA Architecture (Shared memory) Scalability : tests on DGEMM multithreaded (MKL, ifort : v11), size matrix = x (DP) Automatic parallelisation MKL DGEMM v time in seconds NHM EX 2,6 Ghz Itanium 1,5 Ghz NHM EP 2, numbre of cores Page 14 7

8 UMA Architecture Memory access: Concurrent access to central memory => bottleneck - Time access increase. Increase size (and so level) of s memory Register File Functional (mult, add) Cache Cache Coherency Processor Register File Functional (mult, add) Cache Cache Coherency Processor OS Consequence : few number of processor Another paradigma/option : distribute memory? Page 15 Laptop, Workstation, PC are Multi-core : Shared Memory 2 cores distincts Caches distincts Cache partagé Cohérence des s 2 bi-core Cohérence des s/ bi-core Shared Memory Architecture! Parallel application : compulsory Page 16 8

9 Distributed memory(norma) Processor and memory tighly interconnected MPP : Massively Parallel Processing : Cluster : machines(comput nodes) interconnection Page 17 Distributed memory : Cluster Cluster : massive technology A lot of processor. A lot of machine (nodes) interconnected nodes: multi-processors, multi-core M P E/S M P E/S M P E/S M P E/S interconnect E/S M E/S M E/S P M E/S M P P P Page 18 9

10 Distributed memory : multi-computer Architecture (Clusters) Machine side: Massive technology Process access to its own (local) memory space Interconnect nodes : Like internet (ethernet) need much faster (bandwidth and latency) process to process communication Main memory Register File Cache Functional (mult, add) Main memory Register File Cache Functional (mult, add) Processor Processor disk OS 1 disk OS 2 Page 19 Distributed memory : multi-computer Architecture (Clusters) User side: n different nodes (n OS) interconnected, 1 (or +) processor per node. Parallel programming Message Passing (MPI, work done by developper you?) Need efficient tools to properly access computing ressources Main memory Register File Cache Functional (mult, add) Main memory Register File Cache Functional (mult, add) Processor Processor disk OS 1 disk OS 2 Page 20 10

11 Interconnection : key of parallel machine In parallel machine hardware (processor / memory) have to be connected specific (fast) protocols infiniband (shrink of ethernet), myrinet, proprietary topology: ring, hypercude, Torus, fat-tree, Interconnexion : Latency : how many time to be connected? Order of microsecond bandwidth (throughput): rate of data transfer? Mbytes/sec Topology : how many path from a point to another 1 Mo/s 1 Megaoctet/s 10 6 octets/sec 1 Go/s 1 Gigaoctet/s 10 9 octets/sec 1 To/s 1 Téraoctet/s octets/s Page 21 Interconnect Interconnexion : Topology : Best choice: Each processor to all: price (affordable?), few numbers of cores impossible for large scale (1000 cores ) The least bad: try to avoid bottleneck scalability of the network topology Differents strategies Cross-bar Page 22 11

12 Interconnect : example Tore 2D Latency 69 B21 42,43 B12 24,25 B13 26,27 B14 28,29 B15 30,31 B16 32,33 B17 34,35 B0 0, B1 B2 2,3 4,5 B3 6,7 B4 8,9 48 B5 10,11 40 Tore 2D Page 23 Interconnexion / network : protocols Differents Protocols Source : Journées JoSy, Groupe Calcul CNRS, 13/09/2007 Lyon Page 24 12

13 Top500 Juin 2012 TOP 500 List and Interconnect # # Machine CURIE Programme Europe «PRACE» (France) Nuclear Plant between 40 MW and 1450 MW Page 25 Interconnect / Network TOP 500 Page 26 13

14 Interconnect : Topology topologie Hypercube 3D, 4D, 8D Page 27 Interconnect : Topology Fat-tree «mirrored» Page 28 14

15 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel ation Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Page 29 Processors Architecture : MIMD MIMD : Multiple Instruction Multiple Data «Classic» Processor( ) HPC (, $ ) Exemple : Pentium IV Exemple : Itanium II 3,2 GHz Cache 500 kbytes 6,4 Gflop/s peak Linpack 0,7 Gflop/s 1,5 GHz Cache 6 MBytes 6 Gflop/s peak Linpack 5,4 Gflop/s Time to solution 8 times faster! Vendors : Intel (Itanium), AMD (Opteron), IBM (Power), FUJITSU (UltraSparc), NEC (Vector Processors) Page 30 15

16 Processors Architecture : MIMD Processor cycle Clock frequency = number of pulse per second 200 Mhz 200 Millions cycles per second aim : retire one or more operation per cycle Optimised Architecture: Instruction Level Parallelism (ILP) : Pipeline multiple Functional (FPU) Hierarchical Memory Time access Cache level L1, L2, L3 Speculative Execution Branch prediction Prefetching Page 31 Processors Architecture : MIMD Processor Architecture FPU, ALU register Instruction memory Control Data memory < Memory () Data+instruction < Page 32 16

17 Processors Architecture : MIMD Example Architecture Scheme Proc. Superscalaire: Itanium Architecure «massively parallel»: 2 FPU 4 I&MM s 3 Branch Prediction Page 33 Pipeline Aim 1 cycle = 1 retired operation Processors Architecture : MIMD Very roughly Speaking : A1 = B1 + C1 load exec write 3 phases 1 phase/cycle Pipeline : example 3 Independants Operations A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 Page 34 17

18 Processors Architecture : MIMD Pipeline Aim 1 cycle = 1 retired operation Independant Operation A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 load exec write Cycle 1 Load B1,C1 Cycle 2 Add B1,C1 Cycle 3 Store in A1 Cycle 4 Load B2,C2 Add B2,C2 Store in A2 3 retired operations : 9 cycles Load B3,C3 Cycle 8 Add B3,C3 Cycle 9 Store in A3 Cycles (Time) Ressources «idle» Page 35 Processors Architecture : MIMD Pipeline Aim 1 cycle = 1 retired operation Independant operation A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 R3 retired Operation : 9 cycles load exec write Pipelining : Latency : 3 cycles 1 res/cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Load B1,C1 Load B2,C2 Load B3,C3 Add B1,C1 Add B2,C2 Add B3,C3 Store in A1 Store in A2 Cycle 5 Store in A3 Cycles (Time) Independant Operation Exhibit maximum of independant Operations Feed the Pipeline Page 36 18

19 Processors Architecture : MIMD Feed with data : Hierarchical Memory Temps (Floating Point ) «work on» data in register Registre : tsmall size Mo A1 = B1 + C1 load B1 Hit? yes load C1 no = miss Cache? Coût n cycles e «WAIT» n cycles Page 37 Processors Architecture : MIMD Level Cache : example processor Itanium2 d INTEL latency and throughput 1,5 Ghz => 1 cycle = 0,6 ns 2 cycles 5 cycles 12 cycles 1 ko 128 Integer Registers 1 ko 128 FP Registers 16 ko L1D 16 Go/s 32 Go/s 16 Go/s 32 Go/s 16 Go/s 5+1 cycles L2U 256 ko Mo-9 Mo 32 Go/s 6.4 Go/s 12+1cycles L3U 16Rd / 6Wr Altix : 145+ ns Page 38 19

20 Processors Architecture : MIMD Level Cache effect: Page 39 Speculate : exhibit independant operations Processors Architecture : MIMD If (cond1) then a1 = b1 + c1 else a1 = b1 * c1 End if «Break» dependancy Dependancy : Cond1 a1 = b1 + c1, Branch prediction «bet» Cond1 true compute a1 = b1 + c1,. Check later. Page 40 20

21 Processors Architecture HPC A core what is it? Page 41 Processors Architecture HPC L g I Page 42 21

22 Processors Architecture HPC Trends nowdays : multi-core Fondeurs : Intel, AMD, IBM, FUJITSU(Sparc) Cache Cache core Engrave Shrink (45 nm, 32nm, 22, nm,?nm) core core core core 1 processeurs mono-core core core core core More RAW power ( 2, 4, 6, 8 ) better ratio flop/watt, flop/m 2 same frequency or lower! Bottleneck?(SMP) => bandwidth with the RAM 1 processor ( or socket) multi-core (multi = 2, 4, 6, 8, 10, 12, 20 ) Page 43 Processors Architecture HPC TOP 500 JUNE 2013 Page 44 22

23 Processor Architecture : SIMD Vector Processor Operation on vector (not on scalar): Single instruction multiple data (SIMD) each => cycle n scalar retired operation code/program may fit to vectorization (need to fit with a lot of data) Affordable? X X1 X2 X3 Xn Y Y1 Y2 Y3 Yn X+Y X1+Y1 X2+Y2 X3+Y3 Xn+Yn vector principle(simd) now on scalar/superscalar processors: X86 SSE/AVX: Streaming SIMD Extension Altivec Superscalaire (2 FPU) Page 45 Processors Architecture : SIMD Processor Architecture Data memory Instruction memory Control FPU, ALU Data memory register Data memory < Data memory Memory () Data+instruction < Page 46 23

24 Processor Architecture : SIMD Accelerators GP-GPU : General Purpose Graphic process Precision, compilers, languages (CUDA), etc Stream processing (data centric process) SIMD intensive data parallelism data locality Future : in processor chip? GP-GPU Peak Gflop Ram sur la carte Time access ram Tesla an Gflop/s 1Go environ ns Tesla Kepler 4,5 TF 6Go ns ATI 1,5 Tflop/s 2 G environ ns Accelerators Intel : Xeon Phi 50 cores! MIMD (x86) cores Page 47 Processors Architecture : SIMD Accelerator Architecture Control Data memory Control Data memory Control Data memory Control Data memory < Control Data memory Control Data memory Control Data memory < Page 48 24

25 TOP 500 accelerator TOP 500 JUNE 2012 TOP 500 JUNE 2013 Page 49 25

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Lecture 23: Multiprocessors

Lecture 23: Multiprocessors Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

How To Write A Parallel Computer Program

How To Write A Parallel Computer Program An Introduction to Parallel Programming An Introduction to Parallel Programming Tobias Wittwer VSSD Tobias Wittwer First edition 2006 Published by: VSSD Leeghwaterstraat 42, 2628 CA Delft, The Netherlands

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1 Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these

More information

Lecture 1: the anatomy of a supercomputer

Lecture 1: the anatomy of a supercomputer Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949

More information

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner

Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner (Conference Report) Peter Wegner SC2004 conference Top500 List BG/L Moors Law, problems of recent architectures Solutions Interconnects Software Lattice QCD machines DESY @SC2004 QCDOC Conclusions Technical

More information

Cluster Computing at HRI

Cluster Computing at HRI Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing

More information

White Paper The Numascale Solution: Extreme BIG DATA Computing

White Paper The Numascale Solution: Extreme BIG DATA Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing 64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

More information

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

Data Centric Systems (DCS)

Data Centric Systems (DCS) Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems

More information

1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing 1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

More information

Annotation to the assignments and the solution sheet. Note the following points

Annotation to the assignments and the solution sheet. Note the following points Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed

More information

High Performance Computing

High Performance Computing High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.

More information

White Paper The Numascale Solution: Affordable BIG DATA Computing

White Paper The Numascale Solution: Affordable BIG DATA Computing White Paper The Numascale Solution: Affordable BIG DATA Computing By: John Russel PRODUCED BY: Tabor Custom Publishing IN CONJUNCTION WITH: ABSTRACT Big Data applications once limited to a few exotic disciplines

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

ANALYSIS OF SUPERCOMPUTER DESIGN

ANALYSIS OF SUPERCOMPUTER DESIGN ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis

More information

Parallel Programming

Parallel Programming Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Parallel Architectures Acknowledgements Prof. Felix

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

Current Trend of Supercomputer Architecture

Current Trend of Supercomputer Architecture Current Trend of Supercomputer Architecture Haibei Zhang Department of Computer Science and Engineering haibei.zhang@huskymail.uconn.edu Abstract As computer technology evolves at an amazingly fast pace,

More information

CALMIP a Computing Meso-center for middle HPC academis Users

CALMIP a Computing Meso-center for middle HPC academis Users CALMIP a Computing Meso-center for middle HPC academis Users CALMIP : Calcul en Midi-Pyrénées Méso-centre de Calcul Boris Dintrans, President Operational Board CALMIP, Laboratoire d Astrophysique de Toulouse

More information

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN 1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

More information

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance of the JMA NWP models on the PC cluster TSUBAME. Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

High Performance Computing in the Multi-core Area

High Performance Computing in the Multi-core Area High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable

More information

Lecture 1. Course Introduction

Lecture 1. Course Introduction Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our

More information

A Very Brief History of High-Performance Computing

A Very Brief History of High-Performance Computing A Very Brief History of High-Performance Computing CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Very Brief History of High-Performance Computing Spring 2016 1

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

CS 352H: Computer Systems Architecture

CS 352H: Computer Systems Architecture CS 352H: Computer Systems Architecture Topic 14: Multicores, Multiprocessors, and Clusters University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Introduction Goal:

More information

Benchmarking Large Scale Cloud Computing in Asia Pacific

Benchmarking Large Scale Cloud Computing in Asia Pacific 2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

10- High Performance Compu5ng

10- High Performance Compu5ng 10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Principles and characteristics of distributed systems and environments

Principles and characteristics of distributed systems and environments Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single

More information

ST810 Advanced Computing

ST810 Advanced Computing ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Building Clusters for Gromacs and other HPC applications

Building Clusters for Gromacs and other HPC applications Building Clusters for Gromacs and other HPC applications Erik Lindahl lindahl@cbr.su.se CBR Outline: Clusters Clusters vs. small networks of machines Why do YOU need a cluster? Computer hardware Network

More information

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

Clusters: Mainstream Technology for CAE

Clusters: Mainstream Technology for CAE Clusters: Mainstream Technology for CAE Alanna Dwyer HPC Division, HP Linux and Clusters Sparked a Revolution in High Performance Computing! Supercomputing performance now affordable and accessible Linux

More information

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

More information

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez Energy efficient computing on Embedded and Mobile devices Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez A brief look at the (outdated) Top500 list Most systems are built

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Performance Evaluation of Amazon EC2 for NASA HPC Applications! National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!

More information

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are

More information

Cluster Computing in a College of Criminal Justice

Cluster Computing in a College of Criminal Justice Cluster Computing in a College of Criminal Justice Boris Bondarenko and Douglas E. Salane Mathematics & Computer Science Dept. John Jay College of Criminal Justice The City University of New York 2004

More information

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Vorlesung Rechnerarchitektur 2 Seite 178 DASH Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica

More information

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson CS 3530 Operating Systems L02 OS Intro Part 1 Dr. Ken Hoganson Chapter 1 Basic Concepts of Operating Systems Computer Systems A computer system consists of two basic types of components: Hardware components,

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture ELE 356 Computer Engineering II Section 1 Foundations Class 6 Architecture History ENIAC Video 2 tj History Mechanical Devices Abacus 3 tj History Mechanical Devices The Antikythera Mechanism Oldest known

More information

A quick tutorial on Intel's Xeon Phi Coprocessor

A quick tutorial on Intel's Xeon Phi Coprocessor A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed

More information

Jean-Pierre Panziera Teratec 2011

Jean-Pierre Panziera Teratec 2011 Technologies for the future HPC systems Jean-Pierre Panziera Teratec 2011 3 petaflop systems : TERA 100, CURIE & IFERC Tera100 Curie IFERC 1.25 PetaFlops 256 TB ory 30 PB disk storage 140 000+ Xeon cores

More information

Kriterien für ein PetaFlop System

Kriterien für ein PetaFlop System Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working

More information

Basic Concepts in Parallelization

Basic Concepts in Parallelization 1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline

More information

Algorithms of Scientific Computing II

Algorithms of Scientific Computing II Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware

More information

1 DCSC/AU: HUGE. DeIC Sekretariat 2013-03-12/RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

1 DCSC/AU: HUGE. DeIC Sekretariat 2013-03-12/RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations Bilag 1 2013-03-12/RB DeIC (DCSC) Scientific Computing Installations DeIC, previously DCSC, currently has a number of scientific computing installations, distributed at five regional operating centres.

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

QTP Computing Laboratory Strategy

QTP Computing Laboratory Strategy Introduction QTP Computing Laboratory Strategy Erik Deumens Quantum Theory Project 12 September 2001 From the beginning of its computer operations (1980-1982) QTP has worked from a strategy and an architecture

More information

Altix Usage and Application Programming. Welcome and Introduction

Altix Usage and Application Programming. Welcome and Introduction Zentrum für Informationsdienste und Hochleistungsrechnen Altix Usage and Application Programming Welcome and Introduction Zellescher Weg 12 Tel. +49 351-463 - 35450 Dresden, November 30th 2005 Wolfgang

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Cluster Implementation and Management; Scheduling

Cluster Implementation and Management; Scheduling Cluster Implementation and Management; Scheduling CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 1 /

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information