High Performance Computing, an Introduction to
|
|
- Briana Watkins
- 8 years ago
- Views:
Transcription
1 High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Michel Fournié, Maître de conférence IMT, Université de Toulouse Toulouse University ing Center : Page 1 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel programing Make it happen : Message Passing Interface MPI (+BE) Execution model Why exchanging messages MPI Library : most important routines Basics : speed up, parallel efficiency and traps Page 2 1
2 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel ation Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Page 3 Principle of Supercomputers : HPC computing Systems Hardware et Software Achieve Performance on floating computation Flop/s + Memory (RAM and file system) Flop : floating operation (mult, add) 3, , E+08 Storage, I/O (Input/output) ++ or $++ Mutualisation Several(tens?)users single (big) system Need coherency in the use of ressources Rules (Fair sharing?) => job sheduler Supercomputers are only dedicated to compute Scientific Applications Back-up (huge) file storage Remote access Nowdays : huge constraints on infrastructures facility Electricity supply (back-up) Cooling Weight Safety Security Page 4 2
3 HPC System taxonomy Shared memory Systems: Multiprocessors One single address space Shared memory UMA Uniform Memory Access NUMA Non Uniform Memory Access PROGRAM Distributed memory system(cluster) : Multi-computers Multiple (different) address space NORMA no-remote memory access Page 5 OS : Operanding System Operanding System («Système d exploitation» in French) : A collection of program that unify the hardware in order to make it usable! Exemple : Windows, Linux, MacOSX, Unix like : AIX (IBM), HPUX (HP), SOLARIS (SUN) user OS : «allocate hardware ressources to your program» Operanding system Hardware (Schéma Operanding system Wikipedia) Page 6 3
4 UMA Architecture (Shared Memory) User side: A single machine (single OS) several processors one single space memory adress How to program : extension of sequential programmation Machine side: SMP Symmetric MultiProcessor (SMP) Bus Interconnexion between memory and processors Central memory and I/O : shared by all processors Processors access to the same memory memory Register File Functional (mult, add) Register File Functional (mult, add) Cache Cache Coherency Processor Cache Cache Coherency Processor BUS interconnect OS Page 7 UMA Architecture (Shared memory) Machine side: SMP Symmetric MultiProcessor (SMP) Processors access to the same memory Cache Coherency Cache Coherency # processor modifing data in the same line Shared memory Register File Functional (mult, add) Cache Cache Coherency Processor Register File Functional (mult, add) Cache Cache Coherency Processor OS Page 8 4
5 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 FPU FPU FPU Page 9 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=1.5E8 FPU FPU FPU Page 10 5
6 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=0.6E-2 A=1.5E8 FPU FPU FPU Processors have different value of A Page 11 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=0.6E-2 A=1.5E8 FPU FPU FPU? Processors have different value of A Unacceptable for the developper point of view Page 12 6
7 Machine UMA Cache Coherency Several protocols Invalidation 1 write invalidate data on other processors Snooping Handle by the OS, not the user But : Can slow parallel performance Developper may be aware of that OpenMP Page 13 UMA Architecture (Shared memory) Scalability : tests on DGEMM multithreaded (MKL, ifort : v11), size matrix = x (DP) Automatic parallelisation MKL DGEMM v time in seconds NHM EX 2,6 Ghz Itanium 1,5 Ghz NHM EP 2, numbre of cores Page 14 7
8 UMA Architecture Memory access: Concurrent access to central memory => bottleneck - Time access increase. Increase size (and so level) of s memory Register File Functional (mult, add) Cache Cache Coherency Processor Register File Functional (mult, add) Cache Cache Coherency Processor OS Consequence : few number of processor Another paradigma/option : distribute memory? Page 15 Laptop, Workstation, PC are Multi-core : Shared Memory 2 cores distincts Caches distincts Cache partagé Cohérence des s 2 bi-core Cohérence des s/ bi-core Shared Memory Architecture! Parallel application : compulsory Page 16 8
9 Distributed memory(norma) Processor and memory tighly interconnected MPP : Massively Parallel Processing : Cluster : machines(comput nodes) interconnection Page 17 Distributed memory : Cluster Cluster : massive technology A lot of processor. A lot of machine (nodes) interconnected nodes: multi-processors, multi-core M P E/S M P E/S M P E/S M P E/S interconnect E/S M E/S M E/S P M E/S M P P P Page 18 9
10 Distributed memory : multi-computer Architecture (Clusters) Machine side: Massive technology Process access to its own (local) memory space Interconnect nodes : Like internet (ethernet) need much faster (bandwidth and latency) process to process communication Main memory Register File Cache Functional (mult, add) Main memory Register File Cache Functional (mult, add) Processor Processor disk OS 1 disk OS 2 Page 19 Distributed memory : multi-computer Architecture (Clusters) User side: n different nodes (n OS) interconnected, 1 (or +) processor per node. Parallel programming Message Passing (MPI, work done by developper you?) Need efficient tools to properly access computing ressources Main memory Register File Cache Functional (mult, add) Main memory Register File Cache Functional (mult, add) Processor Processor disk OS 1 disk OS 2 Page 20 10
11 Interconnection : key of parallel machine In parallel machine hardware (processor / memory) have to be connected specific (fast) protocols infiniband (shrink of ethernet), myrinet, proprietary topology: ring, hypercude, Torus, fat-tree, Interconnexion : Latency : how many time to be connected? Order of microsecond bandwidth (throughput): rate of data transfer? Mbytes/sec Topology : how many path from a point to another 1 Mo/s 1 Megaoctet/s 10 6 octets/sec 1 Go/s 1 Gigaoctet/s 10 9 octets/sec 1 To/s 1 Téraoctet/s octets/s Page 21 Interconnect Interconnexion : Topology : Best choice: Each processor to all: price (affordable?), few numbers of cores impossible for large scale (1000 cores ) The least bad: try to avoid bottleneck scalability of the network topology Differents strategies Cross-bar Page 22 11
12 Interconnect : example Tore 2D Latency 69 B21 42,43 B12 24,25 B13 26,27 B14 28,29 B15 30,31 B16 32,33 B17 34,35 B0 0, B1 B2 2,3 4,5 B3 6,7 B4 8,9 48 B5 10,11 40 Tore 2D Page 23 Interconnexion / network : protocols Differents Protocols Source : Journées JoSy, Groupe Calcul CNRS, 13/09/2007 Lyon Page 24 12
13 Top500 Juin 2012 TOP 500 List and Interconnect # # Machine CURIE Programme Europe «PRACE» (France) Nuclear Plant between 40 MW and 1450 MW Page 25 Interconnect / Network TOP 500 Page 26 13
14 Interconnect : Topology topologie Hypercube 3D, 4D, 8D Page 27 Interconnect : Topology Fat-tree «mirrored» Page 28 14
15 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel ation Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Page 29 Processors Architecture : MIMD MIMD : Multiple Instruction Multiple Data «Classic» Processor( ) HPC (, $ ) Exemple : Pentium IV Exemple : Itanium II 3,2 GHz Cache 500 kbytes 6,4 Gflop/s peak Linpack 0,7 Gflop/s 1,5 GHz Cache 6 MBytes 6 Gflop/s peak Linpack 5,4 Gflop/s Time to solution 8 times faster! Vendors : Intel (Itanium), AMD (Opteron), IBM (Power), FUJITSU (UltraSparc), NEC (Vector Processors) Page 30 15
16 Processors Architecture : MIMD Processor cycle Clock frequency = number of pulse per second 200 Mhz 200 Millions cycles per second aim : retire one or more operation per cycle Optimised Architecture: Instruction Level Parallelism (ILP) : Pipeline multiple Functional (FPU) Hierarchical Memory Time access Cache level L1, L2, L3 Speculative Execution Branch prediction Prefetching Page 31 Processors Architecture : MIMD Processor Architecture FPU, ALU register Instruction memory Control Data memory < Memory () Data+instruction < Page 32 16
17 Processors Architecture : MIMD Example Architecture Scheme Proc. Superscalaire: Itanium Architecure «massively parallel»: 2 FPU 4 I&MM s 3 Branch Prediction Page 33 Pipeline Aim 1 cycle = 1 retired operation Processors Architecture : MIMD Very roughly Speaking : A1 = B1 + C1 load exec write 3 phases 1 phase/cycle Pipeline : example 3 Independants Operations A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 Page 34 17
18 Processors Architecture : MIMD Pipeline Aim 1 cycle = 1 retired operation Independant Operation A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 load exec write Cycle 1 Load B1,C1 Cycle 2 Add B1,C1 Cycle 3 Store in A1 Cycle 4 Load B2,C2 Add B2,C2 Store in A2 3 retired operations : 9 cycles Load B3,C3 Cycle 8 Add B3,C3 Cycle 9 Store in A3 Cycles (Time) Ressources «idle» Page 35 Processors Architecture : MIMD Pipeline Aim 1 cycle = 1 retired operation Independant operation A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 R3 retired Operation : 9 cycles load exec write Pipelining : Latency : 3 cycles 1 res/cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Load B1,C1 Load B2,C2 Load B3,C3 Add B1,C1 Add B2,C2 Add B3,C3 Store in A1 Store in A2 Cycle 5 Store in A3 Cycles (Time) Independant Operation Exhibit maximum of independant Operations Feed the Pipeline Page 36 18
19 Processors Architecture : MIMD Feed with data : Hierarchical Memory Temps (Floating Point ) «work on» data in register Registre : tsmall size Mo A1 = B1 + C1 load B1 Hit? yes load C1 no = miss Cache? Coût n cycles e «WAIT» n cycles Page 37 Processors Architecture : MIMD Level Cache : example processor Itanium2 d INTEL latency and throughput 1,5 Ghz => 1 cycle = 0,6 ns 2 cycles 5 cycles 12 cycles 1 ko 128 Integer Registers 1 ko 128 FP Registers 16 ko L1D 16 Go/s 32 Go/s 16 Go/s 32 Go/s 16 Go/s 5+1 cycles L2U 256 ko Mo-9 Mo 32 Go/s 6.4 Go/s 12+1cycles L3U 16Rd / 6Wr Altix : 145+ ns Page 38 19
20 Processors Architecture : MIMD Level Cache effect: Page 39 Speculate : exhibit independant operations Processors Architecture : MIMD If (cond1) then a1 = b1 + c1 else a1 = b1 * c1 End if «Break» dependancy Dependancy : Cond1 a1 = b1 + c1, Branch prediction «bet» Cond1 true compute a1 = b1 + c1,. Check later. Page 40 20
21 Processors Architecture HPC A core what is it? Page 41 Processors Architecture HPC L g I Page 42 21
22 Processors Architecture HPC Trends nowdays : multi-core Fondeurs : Intel, AMD, IBM, FUJITSU(Sparc) Cache Cache core Engrave Shrink (45 nm, 32nm, 22, nm,?nm) core core core core 1 processeurs mono-core core core core core More RAW power ( 2, 4, 6, 8 ) better ratio flop/watt, flop/m 2 same frequency or lower! Bottleneck?(SMP) => bandwidth with the RAM 1 processor ( or socket) multi-core (multi = 2, 4, 6, 8, 10, 12, 20 ) Page 43 Processors Architecture HPC TOP 500 JUNE 2013 Page 44 22
23 Processor Architecture : SIMD Vector Processor Operation on vector (not on scalar): Single instruction multiple data (SIMD) each => cycle n scalar retired operation code/program may fit to vectorization (need to fit with a lot of data) Affordable? X X1 X2 X3 Xn Y Y1 Y2 Y3 Yn X+Y X1+Y1 X2+Y2 X3+Y3 Xn+Yn vector principle(simd) now on scalar/superscalar processors: X86 SSE/AVX: Streaming SIMD Extension Altivec Superscalaire (2 FPU) Page 45 Processors Architecture : SIMD Processor Architecture Data memory Instruction memory Control FPU, ALU Data memory register Data memory < Data memory Memory () Data+instruction < Page 46 23
24 Processor Architecture : SIMD Accelerators GP-GPU : General Purpose Graphic process Precision, compilers, languages (CUDA), etc Stream processing (data centric process) SIMD intensive data parallelism data locality Future : in processor chip? GP-GPU Peak Gflop Ram sur la carte Time access ram Tesla an Gflop/s 1Go environ ns Tesla Kepler 4,5 TF 6Go ns ATI 1,5 Tflop/s 2 G environ ns Accelerators Intel : Xeon Phi 50 cores! MIMD (x86) cores Page 47 Processors Architecture : SIMD Accelerator Architecture Control Data memory Control Data memory Control Data memory Control Data memory < Control Data memory Control Data memory Control Data memory < Page 48 24
25 TOP 500 accelerator TOP 500 JUNE 2012 TOP 500 JUNE 2013 Page 49 25
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationLecture 23: Multiprocessors
Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationHow To Write A Parallel Computer Program
An Introduction to Parallel Programming An Introduction to Parallel Programming Tobias Wittwer VSSD Tobias Wittwer First edition 2006 Published by: VSSD Leeghwaterstraat 42, 2628 CA Delft, The Netherlands
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationIntroduction to High Performance Cluster Computing. Cluster Training for UCL Part 1
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
More informationLecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
More informationAgenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationSupercomputing 2004 - Status und Trends (Conference Report) Peter Wegner
(Conference Report) Peter Wegner SC2004 conference Top500 List BG/L Moors Law, problems of recent architectures Solutions Interconnects Software Lattice QCD machines DESY @SC2004 QCDOC Conclusions Technical
More informationCluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing
More informationWhite Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More information64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
More informationProgramming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture
Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationAccelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
More informationnumascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationData Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
More information1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
More informationAnnotation to the assignments and the solution sheet. Note the following points
Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed
More informationHigh Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
More informationWhite Paper The Numascale Solution: Affordable BIG DATA Computing
White Paper The Numascale Solution: Affordable BIG DATA Computing By: John Russel PRODUCED BY: Tabor Custom Publishing IN CONJUNCTION WITH: ABSTRACT Big Data applications once limited to a few exotic disciplines
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationANALYSIS OF SUPERCOMPUTER DESIGN
ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis
More informationParallel Programming
Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Parallel Architectures Acknowledgements Prof. Felix
More informationSymmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
More informationCurrent Trend of Supercomputer Architecture
Current Trend of Supercomputer Architecture Haibei Zhang Department of Computer Science and Engineering haibei.zhang@huskymail.uconn.edu Abstract As computer technology evolves at an amazingly fast pace,
More informationCALMIP a Computing Meso-center for middle HPC academis Users
CALMIP a Computing Meso-center for middle HPC academis Users CALMIP : Calcul en Midi-Pyrénées Méso-centre de Calcul Boris Dintrans, President Operational Board CALMIP, Laboratoire d Astrophysique de Toulouse
More informationPARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More information~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
More informationBuilding a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
More informationPerformance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
More informationArchitecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationHigh Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
More informationLecture 1. Course Introduction
Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our
More informationA Very Brief History of High-Performance Computing
A Very Brief History of High-Performance Computing CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Very Brief History of High-Performance Computing Spring 2016 1
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationTrends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationCS 352H: Computer Systems Architecture
CS 352H: Computer Systems Architecture Topic 14: Multicores, Multiprocessors, and Clusters University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Introduction Goal:
More informationBenchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More information10- High Performance Compu5ng
10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationPrinciples and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
More informationST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationMulti-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
More informationBuilding Clusters for Gromacs and other HPC applications
Building Clusters for Gromacs and other HPC applications Erik Lindahl lindahl@cbr.su.se CBR Outline: Clusters Clusters vs. small networks of machines Why do YOU need a cluster? Computer hardware Network
More informationAppro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationClusters: Mainstream Technology for CAE
Clusters: Mainstream Technology for CAE Alanna Dwyer HPC Division, HP Linux and Clusters Sparked a Revolution in High Performance Computing! Supercomputing performance now affordable and accessible Linux
More informationUNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationA Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
More informationPerformance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
More informationEnergy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez
Energy efficient computing on Embedded and Mobile devices Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez A brief look at the (outdated) Top500 list Most systems are built
More informationBuilding an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
More informationPerformance Evaluation of Amazon EC2 for NASA HPC Applications!
National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!
More informationWhy Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
More informationCluster Computing in a College of Criminal Justice
Cluster Computing in a College of Criminal Justice Boris Bondarenko and Douglas E. Salane Mathematics & Computer Science Dept. John Jay College of Criminal Justice The City University of New York 2004
More informationVorlesung Rechnerarchitektur 2 Seite 178 DASH
Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica
More informationDavid Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationCS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson
CS 3530 Operating Systems L02 OS Intro Part 1 Dr. Ken Hoganson Chapter 1 Basic Concepts of Operating Systems Computer Systems A computer system consists of two basic types of components: Hardware components,
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationExascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture
ELE 356 Computer Engineering II Section 1 Foundations Class 6 Architecture History ENIAC Video 2 tj History Mechanical Devices Abacus 3 tj History Mechanical Devices The Antikythera Mechanism Oldest known
More informationA quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
More informationJean-Pierre Panziera Teratec 2011
Technologies for the future HPC systems Jean-Pierre Panziera Teratec 2011 3 petaflop systems : TERA 100, CURIE & IFERC Tera100 Curie IFERC 1.25 PetaFlops 256 TB ory 30 PB disk storage 140 000+ Xeon cores
More informationKriterien für ein PetaFlop System
Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working
More informationBasic Concepts in Parallelization
1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline
More informationAlgorithms of Scientific Computing II
Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware
More information1 DCSC/AU: HUGE. DeIC Sekretariat 2013-03-12/RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations
Bilag 1 2013-03-12/RB DeIC (DCSC) Scientific Computing Installations DeIC, previously DCSC, currently has a number of scientific computing installations, distributed at five regional operating centres.
More information- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
More informationHETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
More informationQTP Computing Laboratory Strategy
Introduction QTP Computing Laboratory Strategy Erik Deumens Quantum Theory Project 12 September 2001 From the beginning of its computer operations (1980-1982) QTP has worked from a strategy and an architecture
More informationAltix Usage and Application Programming. Welcome and Introduction
Zentrum für Informationsdienste und Hochleistungsrechnen Altix Usage and Application Programming Welcome and Introduction Zellescher Weg 12 Tel. +49 351-463 - 35450 Dresden, November 30th 2005 Wolfgang
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationCluster Implementation and Management; Scheduling
Cluster Implementation and Management; Scheduling CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 1 /
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More informationMiddleware and Distributed Systems. Introduction. Dr. Martin v. Löwis
Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More information