Parallel Programming and High-Performance Computing

Size: px
Start display at page:

Download "Parallel Programming and High-Performance Computing"

From this document you will learn the answers to the following questions:

  • What does the graph show on the x - axis?

  • What is the term for shared memory shared memory?

  • What is the abbreviation for quantitative performance evaluation?

Transcription

1 Parallel Programming and High-Performance Computing Part 1: Introduction Dr. Ralf-Peter Mundani CeSIM / IGSSE

2 General Remarks materials: Ralf-Peter Mundani phone , room 3181 (city centre) consultation-hour: Tuesday, 4:00 6:00 pm (room ) Ioan Lucian Muntean phone , room lecture (2 SWS) weekly Tuesday, start at 12:15 pm, room exercises (1 SWS) fortnightly Wednesday, start at 4:45 pm, room

3 General Remarks content part 1: introduction part 2: high-performance networks part 3: foundations part 4: programming memory-coupled systems part 5: programming message-coupled systems part 6: dynamic load balancing part 7: examples of parallel algorithms 1 3

4 Overview motivation classification of parallel computers levels of parallelism quantitative performance evaluation I think there is a world market for maybe five computers. Thomas Watson, chairman IBM,

5 Motivation numerical simulation: from phenomena to predictions physical phenomenon technical process 1. modelling determination of parameters, expression of relations 2. numerical treatment model discretisation, algorithm development discipline mathematics computer science application 3. implementation software development, parallelisation 4. visualisation illustration of abstract simulation results 5. validation comparison of results with reality 6. embedding insertion into working process 1 5

6 Motivation why parallel programming and HPC? complex problems (especially the so called grand challenges ) demand for more computing power climate or geophysics simulation (tsunami, e. g.) structure or flow simulation (crash test, e. g.) development systems (CAD, e. g.) large data analysis (Large Hadron Collider at CERN, e. g.) military applications (crypto analysis, e. g.) performance increase due to faster hardware, more memory ( work harder ) more efficient algorithms, optimisation ( work smarter ) parallel computing ( get some help ) 1 6

7 Motivation objectives (in case all resources would be available N-times) throughput: compute N problems simultaneously running N instances of a sequential program with different data sets ( embarrassing parallelism ); SETI@home, e. g. drawback: limited resources of single nodes response time: compute one problem at a fraction (1/N) of time running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e. g. drawback: writing a parallel program; communication problem size: compute one problem with N-times larger data running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e. g. drawback: writing a parallel program; communication 1 7

8 Overview motivation classification of parallel computers levels of parallelism quantitative performance evaluation 1 8

9 definition: A collection of processing elements that communicate and cooperate to solve large problems (ALMASE and GOTTLIEB, 1989) possible appearances of such processing elements specialised units (steps of a vector pipeline, e. g.) parallel features in modern monoprocessors (superscalar architectures, instruction pipelining, VLIW, multithreading, multicore, ) several uniform arithmetical units (processing elements of array computers, e. g.) processors of a multiprocessor computer (i. e. the actual parallel computers) complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) parallel computers or clusters connected via WAN (so called metacomputers) 1 9

10 reminder: dual core, quad core, manycore, and multicore observation: increasing frequency (and thus core voltage) over past years problem: thermal power dissipation increases linearly in frequency and with the square of the core voltage 1 10

11 reminder: dual core, quad core, manycore, and multicore (cont d) 25% reduction in frequency (and thus core voltage) leads to 50% reduction in dissipation dissipation performance normal CPU reduced CPU 1 11

12 reminder: dual core, quad core, manycore, and multicore (cont d) idea: installation of two cores per die with same dissipation as single core system dissipation performance single core dual core 1 12

13 commercial parallel computers manufacturers: starting from 1983, big players and small start-ups (see tabular; out of business: no longer in the parallel business) names have been coming and going rapidly in addition: several manufacturers of vector computers and nonstandard architectures company country year status in 2003 Sequent U.S acquired by IBM Intel U.S out of business Meiko U.K bankrupt ncube U.S out of business Parsytec Germany 1985 out of business Alliant U.S bankrupt 1 13

14 commercial parallel computers (cont d) company country year status in 2003 Encore U.S out of business Floating Point Systems U.S acquired by SUN Myrias Canada 1987 out of business Ametek U.S out of business Silicon Graphics U.S active C-DAC India 1991 active Kendall Square Research U.S bankrupt IBM U.S active NEC Japan 1993 active SUN Microsystems U.S active Cray Research U.S active 1 14

15 arrival of clusters in the late eighties, PCs became a commodity market with rapidly increasing performance, mass production, and decreasing prices growing attractiveness for parallel computers 1994: Beowulf, the first parallel computer built completely out of commodity hardware NASA Goddard Space Flight Centre 16 Intel DX4 processors multiple 10 Mbit Ethernet links Linux with GNU compilers MPI library 1996: Beowulf cluster performing more than 1 GFlops 1997: a 140-node cluster performing more than 10 GFlops 1 15

16 arrival of clusters (cont d) 2005: InfiniBand cluster at TUM 36 Opteron nodes (quad boards) 4 Itanium nodes (quad boards) 4 Xeon nodes (dual boards) for interactive tasks InfiniBand 4 Switch, 96 ports Linux (SuSE and Redhat) 1 16

17 supercomputers supercomputing or high-performance scientific computing as the most important application of the big number crunchers national initiatives due to huge budget requirements Accelerated Strategic Computing Initiative (ASCI) in the U.S. in the sequel of the nuclear testing moratorium in 1992/93 decision: develop, build, and install a series of five supercomputers of up to $100 million each in the U.S. start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world s first TFlops computer) then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, meanwhile new high-end computing memorandum (2004) 1 17

18 supercomputers (cont d) federal Bundeshöchstleistungsrechner initiative in Germany decision in the mid-nineties three federal supercomputing centres in Germany (Munich, Stuttgart, and Jülich) one new installation every second year (i. e. a six year upgrade cycle for each centre) the newest one to be among the top 10 of the world overview and state of the art: Top500 list (updated every six month), see

19 MOORE s law observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965) number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every two years 1 19

20 some numbers: Top

21 some numbers: Top500 (cont d) 1 21

22 some numbers: Top500 (cont d) cluster: constellation: #nodes > #processors/node #nodes < #processors/node 1 22

23 some numbers: Top500 (cont d) 1 23

24 some numbers: Top500 (cont d) 1 24

25 some numbers: Top500 (cont d) 1 25

26 some numbers: Top500 (cont d) 1 26

27 some numbers: Top500 (cont d) 1 27

28 some numbers: Top500 (cont d) 1 28

29 some numbers: Top500 (cont d) 1 29

30 The Earth Simulator world s #1 from installed in 2002 in Yokohama, Japan ES-building (approx. 50m 65m 17m) based on NEC SX-6 architecture developed by three governmental agencies highly parallel vector supercomputer consists of 640 nodes (plus 2 control & 128 data switching) 8 vector processors (8 GFlops each) 16 GB shared memory 5120 processors (40.96 TFlops peak performance) and 10 TB memory; TFlops sustained performance (Linpack) nodes connected by single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth) further 700 TB disc space and 1.6 PB mass storage 1 30

31 BlueGene/L world s #1 since 2004 installed in 2005 at LLNL, CA, USA (beta-system in 2004 at IBM) cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12 front-end and 1204 I/O nodes) 2 PowerPC 440d processors (2.8 GFlops each) 512 MB memory 131,072 processors (367 TFlops peak performance) and 33.5 TB memory; TFlops sustained performance (Linpack) nodes configured as 3D torus ( ); global reduction tree for fast operations (global max / sum) in a few microseconds 1024 Gbps link to global parallel file system further 806 TB disc space; operating system SuSE SLES

32 HLRB II (world s #6 for 04/2006) installed in 2006 at LRZ, Garching installation costs 38 M monthly costs approx. 400,000 upgrade in 2007 (finished) one of Germany s 3 supercomputers SGI Altix 4700 consists of 19 nodes (SGI NUMA link 2D torus) 256 blades (ccnuma link with partition fat tree) Intel Itanium2 Montecito Dual Core (12.8 GFlops) 4 GB memory per core 9728 processor cores (62.3 TFlops peak performance) and 39 TB memory; 56.5 TFlops sustained performance (Linpack) footprint 24m 12m; total weight 103 metric tons 1 32

33 standard classification according to FLYNN global data and instruction streams as criterion instruction stream: sequence of commands to be executed data stream: sequence of data subject to instruction streams two-dimensional subdivision according to amount of instructions per time a computer can execute amount of data elements per time a computer can process hence, FLYNN distinguishes four classes of architectures SISD: single instruction, single data SIMD: single instruction, multiple data MISD: multiple instruction, single data MIMD: multiple instruction, multiple data drawback: very different computers may belong to the same class 1 33

34 standard classification according to FLYNN (cont d) SISD one processing unit that has access to one data memory and to one program memory classical monoprocessor following VON NEUMANN s principle data memory processor program memory 1 34

35 standard classification according to FLYNN (cont d) SIMD several processing units, each with separate access to a (shared or distributed) data memory; one program memory synchronous execution of instructions example: array computer, vector computer advantages: easy programming model due to control flow with a strict synchronous-parallel execution of all instructions drawbacks: specialised hardware necessary, easily becomes outdated due to recent developments at commodity market data memory processor program memory data memory processor 1 35

36 standard classification according to FLYNN (cont d) MISD several processing units that have access to one data memory; several program memories not very popular class (mainly for special applications such as Digital Signal Processing) operating on a single stream of data, forwarding results from one processing unit to the next example: systolic array (network of primitive processing elements that pump data) processor program memory data memory processor program memory 1 36

37 standard classification according to FLYNN (cont d) MIMD several processing units, each with separate access to a (shared or distributed) data memory; several program memories classification according to (physical) memory organisation shared memory shared (global) address space distributed memory distributed (local) address space example: multiprocessor systems, networks of computers data memory processor program memory data memory processor program memory 1 37

38 processor coupling cooperation of processors / computers as well as their shared use of various resources require communication and synchronisation the following types of processor coupling can be distinguished memory-coupled multiprocessor systems (MemMS) message-coupled multiprocessor systems (MesMS) shared address space distributed address space global memory MemMS, SMP distributed memory Mem-MesMS (hybrid) MesMS 1 38

39 processor coupling (cont d) central issues scalability: costs for adding new nodes / processors programming model: costs for writing parallel programs portability: costs for portation (migration), i. e. transfer from one system to another while preserving executability and flexibility load distribution: costs for obtaining a uniform load distribution among all nodes / processors MemMS are advantageous concerning scalability, MesMS are typically better concerning the rest hence, combination of MemMS and MesMS for exploiting all advantages distributed / virtual shared memory (DSM / VSM) physical distributed memory with global shared address space 1 39

40 processor coupling (cont d) uniform memory access (UMA) each processor P has direct access via the network to each memory module M with same access times to all data standard programming model can be used (i. e. no explicit send / receive of messages necessary) communication and synchronisation via shared variables (inconsistencies (write conflicts, e. g.) have to prevented in general by the programmer) M M M network P P P 1 40

41 processor coupling (cont d) symmetric multiprocessor (SMP) only a small amount of processors, in most cases a central bus, one address space (UMA), but bad scalability cache-coherence implemented in hardware (i. e. a read always provides a variable s value from its last write) example: double or quad boards, SGI Challenge M C: cache C P C P C P 1 41

42 processor coupling (cont d) non-uniform memory access (NUMA) memory modules physically distributed among processors shared address space, but access times depend on location of data (i. e. local addresses faster than remote addresses) differences in access times are visible in the program example: DSM / VSM, Cray T3E network M M P P 1 42

43 processor coupling (cont d) cache-coherent non-uniform memory access (ccnuma) caches for local and remote addresses; cache-coherence implemented in hardware for entire address space problem with scalability due to frequent cache actualisations example: SGI Origin 2000 network M M C P C P 1 43

44 processor coupling (cont d) cache-only memory access (COMA) each processor has only cache-memory entirety of all cache-memories = global shared memory cache-coherence implemented in hardware example: Kendall Square Research KSR-1 network C P C P C P 1 44

45 processor coupling (cont d) no remote memory access (NORMA) each processor has direct access to its local memory only access to remote memory only via explicit message exchange (due to distributed address space) possible synchronisation implicitly via the exchange of messages performance improvement between memory and I/O due to parallel data transfer (Direct Memory Access, e. g.) possible example: IBM SP2, ASCI Red / Blue / White network P M P M P M 1 45

46 Overview motivation classification of parallel computers levels of parallelism quantitative performance evaluation 1 46

47 Levels of Parallelism the suitability of a parallel architecture for a given parallel program strongly depends on the granularity of parallelism some remarks on granularity quantitative meaning: ratio of computational effort and communication / synchronisation effort ( amount of instructions between two necessary communication / synchronisation steps) qualitative meaning: level on which work is done in parallel coarse-grain parallelism program level process level block level instruction level sub-instruction level fine-grain parallelism 1 47

48 Levels of Parallelism program level parallel processing of different programs independent units without any shared data no or only small amount of communication / synchronisation organised by the OS process level a program is subdivided into processes to be executed in parallel each process consists of a larger amount of sequential instructions and has a private address space synchronisation necessary (in case all processes in one program) communication in most cases necessary (data exchange, e. g.) support by OS via routines for process management, process communication, and process synchronisation term of process often referred to as heavy-weight process 1 48

49 Levels of Parallelism block level blocks of instructions are executed in parallel each block consists of a smaller amount of instructions and shares the address space with other blocks communication via shared variables; synchronisation mechanisms term of block often referred to as light-weight-process (thread) instruction level parallel execution of machine instructions optimising compilers can increase this potential by modifying the order of commands (better exploitation of superscalar architecture and pipelining mechanisms) sub-instruction level instructions are further subdivided in units to be executed in parallel or via overlapping (vector operations, e. g.) 1 49

50 Overview motivation classification of parallel computers levels of parallelism quantitative performance evaluation 1 50

51 Quantitative Performance Evaluation execution time time T of a parallel program between start of the execution on one processor and end of all computations on the last processor during execution all processors are in one of the following states compute computation time T COMP time spent for computations communicate idle communication time T COMM time spent for send and receive operations idle time T IDLE time spent for waiting (sending / receiving messages) hence T = T COMP + T COMM + T IDLE 1 51

52 Quantitative Performance Evaluation parallel profile measures the amount of parallelism of a parallel program graphical representation x-axis shows time, y-axis shows amount of parallel activities identification of computation, communication, and idle periods example proc. A proc. B proc. C amount of processes compute communicate idle time 1 52

53 Quantitative Performance Evaluation parallel profile (cont d) degree of parallelism P(t) indicates the amount of processes (of one application) that can be executed in parallel at any point in time (i. e. y-values of the previous example for any time t) average parallelism (often referred to as parallel index) A(p) indicates the average amount of processes that can be executed in parallel, hence A(p) = t 2 1 t 1 t t 2 1 P(t)dt or A(p) = p i i t = 1 p t where p is the amount of processes and t i is the time when exactly i processes are busy i= 1 i i, 1 53

54 Quantitative Performance Evaluation parallel profile (cont d) previous example: A(p) = ( ) / 35 = 65/35 = 1.86 P(t) time for A(p) exist several theoretical (typically quite pessimistic) estimates, often used as arguments against parallel systems example: estimate of MINSKY (1971) problem: amount of used processors is halved in every step parallel summation of 2p numbers on p processors, e. g. result? 1 54

55 Quantitative Performance Evaluation comparison multiprocessor / monoprocessor correlation of multi- and monoprocessor systems performance important: program that can be executed on both systems definitions P(1): amount of unit operations of a program on the monoprocessor system P(p): amount of unit operations of a program on the multiprocessor systems with p processors T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles) T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors 1 55

56 Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) simplifying preconditions T(1) = P(1) one operation to be executed in one step on the monoprocessor system T(p) P(p) more than one operation to be executed in one step (for p 2) on the multiprocessor system with p processors 1 56

57 Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) speed-up S(p) indicates the improvement in processing speed in general, 1 S(p) p efficiency S(p) = E(p) indicates the relative improvement in processing speed improvement is normalised by the amount of processors p in general, 1/p E(p) 1 T(1) T(p) S(p) E(p) = p 1 57

58 Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) speed-up and efficiency can be seen in two different ways algorithm-independent best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system absolute speed-up absolute efficiency algorithm-dependent parallel algorithm is treated as sequential one to measure the execution time on the monoprocessor system; unfair due to communication and synchronisation overhead relative speed-up relative efficiency 1 58

59 Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) overhead O(p) indicates the necessary overhead of a multiprocessor system for organisation, communication, and synchronisation in general, 1 O(p) parallel index P(p) O(p) = P(1) I(p) indicates the amount of operations executed on average per time unit P(p) I(p) = T(p) I(p) relative speed-up (taking into account the overhead) 1 59

60 Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) utilisation U(p) indicates the amount of operations each processor executes on average per time unit conforms to the normalised parallel index conclusions all defined expressions have a value of 1 for p = 1 the parallel index is an upper bound for the speed-up 1 S(p) I(p) p the workload is an upper bound for the efficiency 1/p E(p) U(p) 1 I(p) U(p) = p 1 60

61 Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) example (1) a monoprocessor systems needs 6000 steps for the execution of 6000 operations to compute some result a multiprocessor system with five processors needs 6750 operations for the computation of the same result, but it needs only 1500 steps for the execution thus P(1) = T(1) = 6000, P(5) = 6750, and T(5) = 1500 speed-up and efficiency can be computed as S(5) = 6000/1500 = 4 and E(5) = 4/5 = 0.8 there is an acceleration of factor 4 compared to the monoprocessor system, i. e. on average an improvement of 80% for each processor of the multiprocessor system 1 61

62 Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) example (2) parallel index and utilisation can be computed as I(5) = 6750/1500 = 4.5 and U(5) = 4.5/5 = 0.9 on average 4.5 processors are simultaneously busy, i. e. each processor is working only for 90% of the execution time overhead can be computed as O(5) = 6750/6000 = there is an overhead of 12.5% on the multiprocessor system compared to the monoprocessor system 1 62

63 Quantitative Performance Evaluation scalability objective: adding further processing elements to the system shall reduce the execution time without any program modifications i. e. a linear performance increase with an efficiency close to 1 important for the scalability is a sufficient problem size one porter may carry one suitcase in a minute 60 porters won t do it in a second but 60 porters may carry 60 suitcases in a minute in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems 1 63

64 Quantitative Performance Evaluation AMDAHL s law the probably most important and most famous estimate for the speedup (even if quite pessimistic) underlying model each program consists of a sequential part s, 0 s 1, that can only be executed in a sequential way; synchronisation, data I/O, e. g furthermore, each program consists of a parallelisable part 1 s that can be executed in parallel by several processes; finding the maximum value within a set of numbers, e. g. hence, the execution time for the parallel program executed on p processors can be written as 1 s T(p) = s T(1) + T(1) p 1 64

65 Quantitative Performance Evaluation AMDAHL s law (cont d) the speed-up can thus be computed as S(p) = T(1) T(p) T(1) 1 s s T(1) + T(1) p when increasing p we finally get AMDAHL s law speed-up is bounded: S(p) 1/s = lim S(p) p = 1 lim p 1 s s + p 1 1 s s + p the sequential part can have a dramatic impact on the speed-up therefore central effort of all (parallel) algorithms: keep s small many parallel programs have a small sequential part (s < 0.1) = 1 s = 1 65

66 Quantitative Performance Evaluation AMDAHL s law (cont d) example s = 0.1 and, thus, S(p) 10 independent from p the speed-up is bounded by this limit where s the error? S(p) 10 s = p 1 66

67 Quantitative Performance Evaluation GUSTAFSON s law addresses the shortcomings of AMDAHL s law as it states that any sufficient large problem can be efficiently parallelised instead of a fixed problem size it supposes a fixed time concept underlying model execution time on the parallel machine is normalised to 1 this contains a non-parallelisable part σ, 0 σ 1 hence, the execution time for the sequential program on the monoprocessor can be written as T(1) = σ+p (1 σ) the speed-up can thus be computed as S(p) = σ+p (1 σ) = p + σ (1 p) 1 67

68 Quantitative Performance Evaluation GUSTAFSON s law (cont d) difference to AMDAHL sequential part s(p) is not constant, but gets smaller with increasing p σ s(p) =, σ + p (1 σ) s(p) ]0, 1[ often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, ) speed-up is not bounded for increasing p 1 68

69 Quantitative Performance Evaluation GUSTAFSON s law (cont d) some more thoughts about speed-up theory tells: a superlinear speed-up does not exist each parallel algorithm can be simulated on a monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system but superlinear speed-up can be observed when improving an inferior sequential algorithm when a parallel program (that does not fit into the main memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system 1 69

70 Quantitative Performance Evaluation communication computation-ratio (CCR) important quantity measuring the success of a parallelisation gives the relation of pure communication time and pure computing time a small CCR is favourable typically: CCR decreases with increasing problem size example N N matrix distributed among p processors (N/p rows each) iterative method: in each step, each matrix element is replaced by the average of its eight neighbour values hence, the two neighbouring rows are always necessary computation time: 8N N/p communication time: 2N CCR: p/4n what does this mean? 1 70

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

High Performance Computing

High Performance Computing High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.

More information

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing

More information

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

A Very Brief History of High-Performance Computing

A Very Brief History of High-Performance Computing A Very Brief History of High-Performance Computing CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Very Brief History of High-Performance Computing Spring 2016 1

More information

High Performance Computing in the Multi-core Area

High Performance Computing in the Multi-core Area High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable

More information

Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner

Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner (Conference Report) Peter Wegner SC2004 conference Top500 List BG/L Moors Law, problems of recent architectures Solutions Interconnects Software Lattice QCD machines DESY @SC2004 QCDOC Conclusions Technical

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

Large Scale Simulation on Clusters using COMSOL 4.2

Large Scale Simulation on Clusters using COMSOL 4.2 Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical

More information

How To Write A Parallel Computer Program

How To Write A Parallel Computer Program An Introduction to Parallel Programming An Introduction to Parallel Programming Tobias Wittwer VSSD Tobias Wittwer First edition 2006 Published by: VSSD Leeghwaterstraat 42, 2628 CA Delft, The Netherlands

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN 1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction

More information

Parallel Programming

Parallel Programming Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Parallel Architectures Acknowledgements Prof. Felix

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance of the JMA NWP models on the PC cluster TSUBAME. Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)

More information

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1 Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these

More information

White Paper The Numascale Solution: Extreme BIG DATA Computing

White Paper The Numascale Solution: Extreme BIG DATA Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007

More information

BSC - Barcelona Supercomputer Center

BSC - Barcelona Supercomputer Center Objectives Research in Supercomputing and Computer Architecture Collaborate in R&D e-science projects with prestigious scientific teams Manage BSC supercomputers to accelerate relevant contributions to

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information

MOSIX: High performance Linux farm

MOSIX: High performance Linux farm MOSIX: High performance Linux farm Paolo Mastroserio [mastroserio@na.infn.it] Francesco Maria Taurino [taurino@na.infn.it] Gennaro Tortone [tortone@na.infn.it] Napoli Index overview on Linux farm farm

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale

More information

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

More information

Systolic Computing. Fundamentals

Systolic Computing. Fundamentals Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW

More information

Lecture 23: Multiprocessors

Lecture 23: Multiprocessors Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

High Performance Computing, an Introduction to

High Performance Computing, an Introduction to High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Michel

More information

Basic Concepts in Parallelization

Basic Concepts in Parallelization 1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline

More information

1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing 1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

More information

Performance metrics for parallel systems

Performance metrics for parallel systems Performance metrics for parallel systems S.S. Kadam C-DAC, Pune sskadam@cdac.in C-DAC/SECG/2006 1 Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism

More information

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Parallelism and Cloud Computing

Parallelism and Cloud Computing Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

More information

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

2: Computer Performance

2: Computer Performance 2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12

More information

CS 352H: Computer Systems Architecture

CS 352H: Computer Systems Architecture CS 352H: Computer Systems Architecture Topic 14: Multicores, Multiprocessors, and Clusters University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Introduction Goal:

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS

CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS GH. ADAM 1,2, S. ADAM 1,2, A. AYRIYAN 2, V. KORENKOV 2, V. MITSYN 2, M. DULEA 1, I. VASILE 1 1 Horia Hulubei National Institute for Physics

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

A Comparison of Distributed Systems: ChorusOS and Amoeba

A Comparison of Distributed Systems: ChorusOS and Amoeba A Comparison of Distributed Systems: ChorusOS and Amoeba Angelo Bertolli Prepared for MSIT 610 on October 27, 2004 University of Maryland University College Adelphi, Maryland United States of America Abstract.

More information

Lecture 1. Course Introduction

Lecture 1. Course Introduction Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our

More information

Sun Constellation System: The Open Petascale Computing Architecture

Sun Constellation System: The Open Petascale Computing Architecture CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical

More information

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM

More information

White Paper The Numascale Solution: Affordable BIG DATA Computing

White Paper The Numascale Solution: Affordable BIG DATA Computing White Paper The Numascale Solution: Affordable BIG DATA Computing By: John Russel PRODUCED BY: Tabor Custom Publishing IN CONJUNCTION WITH: ABSTRACT Big Data applications once limited to a few exotic disciplines

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information

Access to the Federal High-Performance Computing-Centers

Access to the Federal High-Performance Computing-Centers Access to the Federal High-Performance Computing-Centers rabenseifner@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 TOP 500 Nov. List German Sites,

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7 Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:

More information

Lecture 1: the anatomy of a supercomputer

Lecture 1: the anatomy of a supercomputer Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

and RISC Optimization Techniques for the Hitachi SR8000 Architecture 1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.

More information

Multi-Core Programming

Multi-Core Programming Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Introduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

PRIMERGY server-based High Performance Computing solutions

PRIMERGY server-based High Performance Computing solutions PRIMERGY server-based High Performance Computing solutions PreSales - May 2010 - HPC Revenue OS & Processor Type Increasing standardization with shift in HPC to x86 with 70% in 2008.. HPC revenue by operating

More information

Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

More information

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are

More information

IS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration

IS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration IS-ENES/PrACE Meeting EC-EARTH 3 A High-resolution Configuration Motivation Generate a high-resolution configuration of EC-EARTH to Prepare studies of high-resolution ESM in climate mode Prove and improve

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available: Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

System Models for Distributed and Cloud Computing

System Models for Distributed and Cloud Computing System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems

More information

Altix Usage and Application Programming. Welcome and Introduction

Altix Usage and Application Programming. Welcome and Introduction Zentrum für Informationsdienste und Hochleistungsrechnen Altix Usage and Application Programming Welcome and Introduction Zellescher Weg 12 Tel. +49 351-463 - 35450 Dresden, November 30th 2005 Wolfgang

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and

More information

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information