High Performance Computing
|
|
|
- Alaina Montgomery
- 10 years ago
- Views:
Transcription
1 High Performance Computing Oliver Rheinbach Vorlesung Introduction to High Performance Computing Hörergruppen Woche Tag Zeit Raum 3. CMS, 5.Mm jede Di 14:00-15:30 PRÜ-1103 Übung Introduction to High Performance Computing Hörergruppen Woche Tag Zeit Raum 3. CMS, 5.Mm jede Fr 11:00-12:30 PRÜ-1103 or PC-Pool 1
2 Intel Processor Roadmap 1993: 486 processor, die size 81mm 2, 1.2 Mio transistors (1000nm,600nm), 33 Mhz 1 flop per cycle, gives 33 Mflops peak 1999: Pentium III (Coppermine), die size 140mm 2, 28 mio transistors, (250nm, 130nm), Mhz 2004: Pentium 4 (Prescott), 122mm 2 (90nm), 2.8 Ghz Ghz 2009: Core 2 Duo, die size 111mm 2, 274 Mio transistors (45 nm), 3 Ghz 4 flop per cycle x 2 cores 24 Gflops peak theoretical. Gives (naively) a 727-fold increase from : Corei7 (Bloomfield): 4 cores, 731 Mio transistors (45nm), 3.33 Ghz 2013: Corei7 (Haswell): 4 cores, die size 177mm 2, 1400 Mio transistors (22nm), Ghz 2014: Xeon E7 (Ivy Bridge EX): 15 cores, die size 541mm 2, Mio transistors (22nm), Ghz 2
3 Theoretical Limits According to (a variant of) Moore s law the floating point performance of single processor cores has increased ten-fold every 5 years since the 1950ies. (Original Moore s law states doubling of number of transistors every months). Todays processors cores are by a factor of > faster than the ones from 1950ies. There are theoretical limits to the performance of synchronized single processors. Structure size: Todays processors have structures of 22 nm= m (Ivy Bridge 2014). The size of an atom is at the order of m. A factor of 220 (rather ) to go! Intel Roadmap (2014: 14 nm Broadwell pushed to 2015) Speed of light: c = cm/s. This gives a limit of 30/ 2 Ghz for a chip with diameter of 1 cm. A factor of 10 to go! 3
4 Current TOP500 List 06/2014 #Cores Name Institution Tianhe-2 Nat. U of Defense Techn Titan Oak Ridge NL Sequoia Lawrence Livermore NL K Computer RIKEN Mira Argonne NL Piz Daint Swiss Nat. SC Stampede U of Texas JUQUEEN FZ Jülich Vulcan Lawrence Livermore NL Gov. USA (12) SuperMUC Leibniz RZ Factor of from 1993 to 2013, Faktor 3000 through more parallelism The challenge for implicit methods are even higher than for other classes of parallel algorithms. New approaches may be needed for exascale computing. 4
5 Supercomputers 1993 to 2013 Faktor 3000 through more parallelism 5
6 High Performance Computing TOP500-List: A factor of 3000 in parallelism! A factor of in performance! 6
7 Hermit (HLRS Stuttgart) Rank 12 in TOP500 of 11/2011; rank 34 in TOP500 of 06/2013. The processor-dies of the 7092 Opteron-6276-Processors cover an area of > 0.7m 2 of silicon. Max (synchronous) frequency 300Mhz. Actual frequency: 2.3Ghz. 7
8 TOP500 06/2014 Statistics (Wikipedia) 8
9 TOP500 06/2014 Statistics (Wikipedia) 9
10 TOP500 06/2014 Statistics ( 10
11 TOP500 06/2014 Statistics ( 11
12 TOP500 06/2014 Statistics ( 12
13 TOP500 06/2014 Statistics ( 13
14 TOP500 06/2014 Statistics ( 14
15 TOP500 06/2014 Statistics ( 15
16 TOP500 06/2014 Statistics ( 16
17 TOP500 06/2014 Statistics ( 17
18 OpenMP See also OpenMP 1.0 is from 1998, MPI (MPI 1994). OpenMP is the de facto standard for shared memory applications in C, C++, Fortran whereas MPI is the de facto standard for distributed memory applications. OpenMP is portable and relatively simple, perfect for multicore (for the moment; new parallel languages will come up [see Juelich]). Note that MPI gives good performance for distributed AND shared memory machines. But it is more complicated. OpenMP New compiler directives Runtime library Environmental variables OpenMP uses thread-based parallelism. What is a thread? A thread is a lightweight process=an instance of the programm + data Each thread has its own ID (0,...,(N 1)) and flow control Can shared data but also have private data All communication is done via the shared data 18
19 OpenMP Programming Modell All threads have access to the same globally shared memory Data can be shared or private (shared visible for all, private visible for owner only) Data transfer is transparent to the programmer Synchronization takes place but is mostly implicit Fork and Join Model Master thread (parallel region) Parallel threads 1,...,n Synchronization Today s compilers can do automatic parallelization of loops, BUT may fail because autoparallelization may not able to determine whether there is data dependance or not (situation is better in Fortran than in C). because granularity is not high enough: Compiler will parallelize on lowest level but should do so on the highest level. State of the art speedup using auto-parallelization is typical 4-8 at max. Here, the programmer can help using explicit parallelization using OpenMP. 19
20 Some OpenMP directives omp parallel When a thread reaches a PARALLEL directive it creates a team of N threads and becomes the master of the team. The master is member of the team and has thread number 0. The number of threads N is determined at runtime from OMP NUM THREADS; all threads execute the next statement or the next block if the statement is a { }-block. After the statement, the threads join back into one. There is an implied barrier at the end of the block. C: #pragma omp parallel { // Code inside this region runs in parallel. printf("hello!\n"); } Fortran:!$OMP PARALLEL PRINT *, Hello!$OMP END PARALLEL 20
21 omp parallel #pragma omp parallel { // Code inside this region runs in parallel. printf("hello!\n"); } This example prints the text Hello! followed by a newline as many times as there are threads in the team. For a quad-core system, it will output. Hello! Hello! Hello! Hello! Note that there is overhead in creating and destroying a team. That is why it is often best to parallelize the outermost loop. 21
22 omp parallel private Variables (except for the automatic variables local to the block) are by default shared by the whole thread team. Sometimes this is not the desired behavior. In that case the PRIVATE clause can be used. All variables in its list are private to each thread, i.e. there exist N copies of them. The private variables are uninitialized! But see also later firstprivate and lastprivate. 22
23 omp parallel private (C) #include <omp.h> #include <stdio.h> #include <stdlib.h> int main () { int nthreads, tid; /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(nthreads, tid) { /* Obtain thread number */ tid = omp_get_thread_num(); printf("hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); } } /* All threads join master thread and disband */ } 23
24 omp parallel private (Fortran) PROGRAM HELLO INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM C Fork a team of threads giving them their own copies of variables!$omp PARALLEL PRIVATE(NTHREADS, TID) C C Obtain thread number TID = OMP_GET_THREAD_NUM() PRINT *, Hello World from thread =, TID Only master thread does this IF (TID.EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, Number of threads =, NTHREADS END IF C All threads join master thread and disband!$omp END PARALLEL END 24
25 omp parallel private (Output) Output: Hello World from thread = 1 Hello World from thread = 3 Hello World from thread = 0 Number of threads = 4 Hello World from thread = 2 25
26 omp barrier Threads can be synchronized by using an OpenMP barrier. The barrier clause causes threads encountering the barrier to wait until all the other threads in the same team have encountered the barrier. A barrier is implied after: end parallel, end for; end sections, end single. 26
27 omp parallel for The PARALLEL FOR directive splits the for-loop so that each thread in the current team handles a different portion of the loop. All portions are roughly of the same size. Note that (in C) some restrictions apply for the for-loop, e.g. the number of iterations of the loop must be known at runtime at the entry of the loop. There is an implicit barrier at the end of the loop. 27
28 cat omp_add.c #include <stdio.h> #include <omp.h> int main() { int n=8; int a[]={1,2,3,4,5,6,7,8}; int b[]={1,2,3,4,5,6,7,8}; int c[]={0,0,0,0,0,0,0,0}; int i,num; int nthreads; Add vectors in parallel in C #pragma omp parallel for shared (n,a,b,c) private(i,num) for(i=0;i<n;i++) { c[i]=a[i]+b[i]; } 28
29 #pragma omp parallel { num = omp_get_num_threads(); printf("num threads: %d\n",num); } #pragma omp master { printf("\n"); for(i=0;i<n;i++) { printf("%d ",c[i]); } } printf("\n"); } How to Log in, Edit, Compile and Run rheinbach@quattro:~$ g++ -fopenmp omp_add.c rheinbach@quattro:~$./a.out 29
30 Output Num threads: 4 Num threads: 4 Num threads: 4 Num threads:
31 In Fortran the directive is PARALLEL DO:!$OMP PARALLEL DO!$OMP& SHARED(A,B,C) PRIVATE(I) DO I = 1, N C(I) = A(I) + B(I) ENDDO Add vectors in parallel (FORTRAN) 31
32 Changing the number of threads The number of threads can be changed by setting the environment variable OMP NUM THREADS. A typical scalability example: rheinbach@quattro:~$ export OMP_NUM_THREADS=1 rheinbach@quattro:~$ time./a.out real 0m7.022s user 0m7.020s sys 0m0.000s rheinbach@quattro:~$ export OMP_NUM_THREADS=2 rheinbach@quattro:~$ time./a.out real 0m3.489s user 0m6.136s sys 0m0.000s rheinbach@quattro:~$ export OMP_NUM_THREADS=4 rheinbach@quattro:~$ time./a.out real 0m1.913s user 0m6.036s sys 0m0.000s 32
33 omp parallel reduction The REDUCTION clause has the form REDUCTION(op:variable,op:variable,...) It performs a reduction operation on variables in its list using the given operator. They must be shared in the enclosing block. A private copy of the variables with the same name is created at entry of the block. At the end of the block the reduction operation is performed to the private variables and the result is written into a shared variable of the same name. If the result is automatically initialized to an appropriate value (see below). You may also initialize it to any other value that 0. C: #pragma omp parallel for private(i) reduction(+:result) for (i=0; i < n; i++) { result = result + (a[i] * b[i]); } 33
34 omp parallel reduction FORTRAN:!$OMP PARALLEL DO PRIVATE(I)!$OMP& REDUCTION(+:RESULT) DO I = 1, N RESULT = RESULT + (A(I) * B(I)) ENDDO!$OMP END PARALLEL DO NOWAIT The operator in the REDUCTION list may be Operator Initialization value + - ^ 0 * && 1 & ~0 34
35 ssh pico loop.c gcc -fopenmp loop.c export OMP_NUM_THREADS=4 time./a.out See also: How to do compile and run! 35
36 omp single and omp master The single clause specifies that the given block is executed by only one thread. It is unspecified which thread executes the block! Other threads skip the statement/block and wait at the implicit barrier at the end of the block. In the master clause the block is run by the master thread, and there is no implied barrier at the end: other threads skip the construct without waiting. #pragma omp parallel { ParallelPart1(); #pragma omp single { SequentialPart(); } ParallelPart2(); } 36
37 Race Conditions/Thread Safety If several threads write different values to a shared variable at the same time the result is undefined. Such a situation is referred to as race condition ( Wettlaufsituation ). Race conditions are simply incorrect parallel algorithms. Such a race condition arises if in the vector scalar product thereduction clause is omitted. The concurrent write to a variable can also be prevented by the critical clause. Note that the C++ STL is NOT THREAD SAVE. Concurrent access to containers is not allowed other than const access. 37
38 omp flush After a write to a shared variable due to optimizations the result may still remain in a register. So a different thread will read the wrong value from the memory. A flush enforces that a consistent view of the memory at a point in the program. To ensure that both variables are flushed before the if clause a barrier would have to be added after the flush. flush is always needed after a write before a read by another thread but flush is implied after: barrier critical end critical parallel end parallel end sections end single 38
39 omp flush (Example) In this example when reaching the flush directive the result will be written to the shared memory. Therefore in this example none of the threads will enter the if-block (if they flush at the same time) or one of them will (if it reaches the if-clause before the other thread flushes): /* a=0 and b=0 */ /* Thread 1 */ b=1; #pragma omp flush(a,b) if(a==0) {... } /* Thread 2 */ #pragma omp flush(a,b) if(b==0) {... } 39
40 Memory Cells and Registers Variables can be stored in memory cells, i.e., they reside in the main memory, i.e., access is slow and typically uses caches. Variables can also reside in internal memory of processors called registers, i.e., access is then immediate. But the number of registers is very limited, e.g., 16 registers in your x86-64/amd64 computer. Remark: C has the (ancient) keyword register which is recommendation to the compiler to store the variable in a register for fast access, e.g., as loop index. Today obsolete since compilers will optimize for themselves and may even ignore register. W.r.t. language they will still honor that the adress & of a register variable cannot be taken. 40
Parallel Computing. Shared memory parallel programming with OpenMP
Parallel Computing Shared memory parallel programming with OpenMP Thorsten Grahs, 27.04.2015 Table of contents Introduction Directives Scope of data Synchronization 27.04.2015 Thorsten Grahs Parallel Computing
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
An Introduction to Parallel Programming with OpenMP
An Introduction to Parallel Programming with OpenMP by Alina Kiessling E U N I V E R S I H T T Y O H F G R E D I N B U A Pedagogical Seminar April 2009 ii Contents 1 Parallel Programming with OpenMP 1
To copy all examples and exercises to your local scratch directory type: /g/public/training/openmp/setup.sh
OpenMP by Example To copy all examples and exercises to your local scratch directory type: /g/public/training/openmp/setup.sh To build one of the examples, type make (where is the
Towards OpenMP Support in LLVM
Towards OpenMP Support in LLVM Alexey Bataev, Andrey Bokhanko, James Cownie Intel 1 Agenda What is the OpenMP * language? Who Can Benefit from the OpenMP language? OpenMP Language Support Early / Late
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State
Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization
OpenMP Objectives Overview of OpenMP Structured blocks Variable scope, work-sharing Scheduling, synchronization 1 Overview of OpenMP OpenMP is a collection of compiler directives and library functions
#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:
omp critical The code inside a CRITICAL region is executed by only one thread at a time. The order is not specified. This means that if a thread is currently executing inside a CRITICAL region and another
University of Amsterdam - SURFsara. High Performance Computing and Big Data Course
University of Amsterdam - SURFsara High Performance Computing and Big Data Course Workshop 7: OpenMP and MPI Assignments Clemens Grelck [email protected] Roy Bakker [email protected] Adam Belloum [email protected]
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Introduction to Supercomputing with Janus
Introduction to Supercomputing with Janus Shelley Knuth [email protected] Peter Ruprecht [email protected] www.rc.colorado.edu Outline Who is CU Research Computing? What is a supercomputer?
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
OpenMP C and C++ Application Program Interface
OpenMP C and C++ Application Program Interface Version.0 March 00 Copyright 1-00 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP
WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed
WinBioinfTools: Bioinformatics Tools for Windows Cluster Done By: Hisham Adel Mohamed Objective Implement and Modify Bioinformatics Tools To run under Windows Cluster Project : Research Project between
Hybrid Programming with MPI and OpenMP
Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
What is Multi Core Architecture?
What is Multi Core Architecture? When a processor has more than one core to execute all the necessary functions of a computer, it s processor is known to be a multi core architecture. In other words, a
OpenMP* 4.0 for HPC in a Nutshell
OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group ([email protected]) *Other brands and names are the property of their respective owners.
Getting OpenMP Up To Speed
1 Getting OpenMP Up To Speed Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline The
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
Using the Intel Inspector XE
Using the Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) Race Condition Data Race: the typical OpenMP programming error, when: two or more threads access the same memory
Workshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
OpenMP Application Program Interface
OpenMP Application Program Interface Version.1 July 0 Copyright 1-0 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP Architecture
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
Programming the Intel Xeon Phi Coprocessor
Programming the Intel Xeon Phi Coprocessor Tim Cramer [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Many Integrated Core (MIC) Architecture Programming Models Native
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Introduction to Hybrid Programming
Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Introduction to OpenMP Programming. NERSC Staff
Introduction to OpenMP Programming NERSC Staff Agenda Basic informa,on An selec(ve introduc(on to the programming model. Direc(ves for work paralleliza(on and synchroniza(on. Some hints on usage Hands-
reduction critical_section
A comparison of OpenMP and MPI for the parallel CFD test case Michael Resch, Bjíorn Sander and Isabel Loebich High Performance Computing Center Stuttgart èhlrsè Allmandring 3, D-755 Stuttgart Germany [email protected]
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Experiences with HPC on Windows
Experiences with on Christian Terboven [email protected] aachen.de Center for Computing and Communication RWTH Aachen University Server Computing Summit 2008 April 7 11, HPI/Potsdam Experiences with on
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
Software and the Concurrency Revolution
Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Practical Introduction to
1 Practical Introduction to http://tinyurl.com/cq-intro-openmp-20151006 By: Bart Oldeman, Calcul Québec McGill HPC [email protected], [email protected] Partners and Sponsors 2 3 Outline
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
A Performance Monitoring Interface for OpenMP
A Performance Monitoring Interface for OpenMP Bernd Mohr, Allen D. Malony, Hans-Christian Hoppe, Frank Schlimbach, Grant Haab, Jay Hoeflinger, and Sanjiv Shah Research Centre Jülich, ZAM Jülich, Germany
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory
Visit to the National University for Defense Technology Changsha, China Jack Dongarra University of Tennessee Oak Ridge National Laboratory June 3, 2013 On May 28-29, 2013, I had the opportunity to attend
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Kommunikation in HPC-Clustern
Kommunikation in HPC-Clustern Communication/Computation Overlap in MPI W. Rehm and T. Höfler Department of Computer Science TU Chemnitz http://www.tu-chemnitz.de/informatik/ra 11.11.2005 Outline 1 2 Optimize
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging
OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging Alexandre Eichenberger, John Mellor-Crummey, Martin Schulz, Nawal Copty, John DelSignore, Robert Dietrich,
PROACTIVE BOTTLENECK PERFORMANCE ANALYSIS IN PARALLEL COMPUTING USING OPENMP
PROACTIVE BOTTLENECK PERFORMANCE ANALYSIS IN PARALLEL COMPUTING USING OPENMP Vibha Rajput Computer Science and Engineering M.Tech.2 nd Sem (CSE) Indraprastha Engineering College. M. T.U Noida, U.P., India
Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment
Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 231 237 doi: 10.14794/ICAI.9.2014.1.231 Parallelization of video compressing
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
Debugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
Using the RDTSC Instruction for Performance Monitoring
Using the Instruction for Performance Monitoring http://developer.intel.com/drg/pentiumii/appnotes/pm1.htm Using the Instruction for Performance Monitoring Information in this document is provided in connection
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Optimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises Pierre-Yves Taunay Research Computing and Cyberinfrastructure 224A Computer Building The Pennsylvania State University University
UTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
SOSCIP Platforms. SOSCIP Platforms
SOSCIP Platforms SOSCIP Platforms 1 SOSCIP HPC Platforms Blue Gene/Q Cloud Analytics Agile Large Memory System 2 SOSCIP Platforms Blue Gene/Q Platform 3 top500.org Rank Site System Cores Rmax (TFlop/s)
Sources: On the Web: Slides will be available on:
C programming Introduction The basics of algorithms Structure of a C code, compilation step Constant, variable type, variable scope Expression and operators: assignment, arithmetic operators, comparison,
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
High Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
