Basic Concepts in Parallelization



Similar documents
Getting OpenMP Up To Speed

MPI and Hybrid Programming Models. William Gropp

Introduction to Cloud Computing

High Performance Computing. Course Notes HPC Fundamentals

Multi-Threading Performance on Commodity Multi-Core Processors

Parallel Algorithm Engineering

Parallel Computing. Parallel shared memory computing with OpenMP

Symmetric Multiprocessing

Parallel Computing. Shared memory parallel programming with OpenMP

Chapter 2 Parallel Architecture, Software And Performance

Parallel Programming Survey

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Introduction to Hybrid Programming

Multicore Parallel Computing with OpenMP

High Performance Computing

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Parallel Scalable Algorithms- Performance Parameters

An Introduction to Parallel Computing/ Programming

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance metrics for parallel systems

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Petascale Software Challenges. William Gropp

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Parallel Programming with MPI on the Odyssey Cluster

HPC Wales Skills Academy Course Catalogue 2015

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Chapter 18: Database System Architectures. Centralized Systems

Multi-core Programming System Overview

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Computer Architecture TDTS10

Lecture 2 Parallel Programming Platforms

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Experiences with HPC on Windows

Parallelization: Binary Tree Traversal

What is Multi Core Architecture?

Spring 2011 Prof. Hyesoon Kim

OpenMP and Performance

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

VII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, MAYO

BLM 413E - Parallel Programming Lecture 3

Hybrid Programming with MPI and OpenMP

Some Myths in High Performance Computing. William D. Gropp Mathematics and Computer Science Argonne National Laboratory

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Cellular Computing on a Linux Cluster

Scalability evaluation of barrier algorithms for OpenMP

Performance analysis with Periscope

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

OpenMP Programming on ScaleMP

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

benchmarking Amazon EC2 for high-performance scientific computing

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

The Effect of Process Topology and Load Balancing on Parallel Programming Models for SMP Clusters and Iterative Algorithms

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Overlapping Data Transfer With Application Execution on Clusters

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Understanding Hardware Transactional Memory

How To Understand The Concept Of A Distributed System

How To Visualize Performance Data In A Computer Program

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

MPI / ClusterTools Update and Plans

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

Application Performance Analysis Tools and Techniques

- An Essential Building Block for Stable and Reliable Compute Clusters

MOSIX: High performance Linux farm

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

HPC Deployment of OpenFOAM in an Industrial Setting

Multi-core and Linux* Kernel

MAQAO Performance Analysis and Optimization Tool

Lecture 23: Multiprocessors

SYSTEM ecos Embedded Configurable Operating System

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Cluster, Grid, Cloud Concepts

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Incorporating Multicore Programming in Bachelor of Science in Computer Engineering Program

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Metrics for Success: Performance Analysis 101

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

A Performance Data Storage and Analysis Tool

Introduction to Cloud Computing

Parallel and Distributed Computing Programming Assignment 1

independent systems in constant communication what they are, why we care, how they work

Performance of the JMA NWP models on the PC cluster TSUBAME.

Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques.

Transcription:

1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010

2 Outline Introduction Parallel Architectures Parallel Programming Models Data Races Summary

3 Introduction

4 Why Parallelization? Parallelization is another optimization technique The goal is to reduce the execution time To this end, multiple processors, or cores, are used Time 1 core 4 cores parallelization Using 4 cores, the execution time is 1/4 of the single core time

granularity 5 What Is Parallelization? "Something" is parallel if there is a certain level of independence in the order of operations In other words, it doesn't matter in what order those operations are performed A sequence of machine instructions A collection of program statements An algorithm The problem you're trying to solve

6 What is a Thread? Loosely said, a thread consists of a series of instructions with it's own program counter ( PC ) and state A parallel program executes threads in parallel These threads are then scheduled onto processors Thread 0 Thread 1 Thread 2 Thread 3 PC............ PC............ PC............ PC............ P P

7 Parallel overhead The total CPU time often exceeds the serial CPU time: The newly introduced parallel portions in your program need to be executed Threads need time sending data to each other and synchronizing ( communication ) Often the key contributor, spoiling all the fun Typically, things also get worse when increasing the number of threads Effi cient parallelization is about minimizing the communication overhead

8 Communication Serial Execution Parallel - Without communication Parallel - With communication Wallclock time Embarrassingly parallel: 4x faster Wallclock time is ¼ of serial wallclock time Additional communication Less than 4x faster Consumes additional resources Wallclock time is more than ¼ of serial wallclock time Total CPU time increases Communication

9 Load balancing Perfect Load Balancing Load Imbalance Time 1 idle 2 idle 3 idle All threads fi nish in the same amount of time No threads is idle Different threads need a different amount of time to fi nish their task Total wall clock time increases Program does not scale well Thread is idle

10 About scalability Speed-up S(P) P opt Ideal In some cases, S(P) exceeds P This is called "superlinear" behaviour Don't count on this to happen though P Defi ne the speed-up S(P) as S(P) := T(1)/T(P) The effi ciency E(P) is defi ned as E(P) := S(P)/P In the ideal case, S(P)=P and E(P)=P/P=1=100% Unless the application is embarrassingly parallel, S(P) eventually starts to deviate from the ideal curve Past this point P opt, the application sees less and less benefi t from adding processors Note that both metrics give no information on the actual run-time As such, they can be dangerous to use

11 Amdahl's Law Assume our program has a parallel fraction f This implies the execution time T(1) := f*t(1) + (1-f)*T(1) On P processors: T(P) = (f/p)*t(1) + (1-f)*T(1) Amdahl's law: Comments: S(P) = T(1)/T(P) = 1 / (f/p + 1-f) This "law' describes the effect the non-parallelizable part of a program has on scalability Note that the additional overhead caused by parallelization and speed-up because of cache effects are not taken into account

12 Amdahl's Law Speed-up 16 14 12 10 8 6 4 Value for f: 1.00 0.99 0.75 0.50 0.25 0.00 It is easy to scale on a small number of processors Scalable performance however requires a high degree of parallelization i.e. f is very close to 1 This implies that you need to parallelize that part of the code where the majority of the time is spent 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors

13 Amdahl's Law in practice We can estimate the parallel fraction f Recall: T(P) = (f/p)*t(1) + (1-f)*T(1) It is trivial to solve this equation for f : f = (1 - T(P)/T(1))/(1 - (1/P)) Example: T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70 f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84% Estimated performance on 8 processors is then: T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5 S(8) = T/T(8) = 3.78

14 Threads Are Getting Cheap... 100 90 80 70 Elapsed time and speed up for f = 84% Using 8 cores: 100/26.5 = 3.8x faster 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of cores used = Elapsed time = Speed up

15 Numerical results Consider: A = B + C + D + E Serial Processing A = B + C A = A + D Parallel Processing Thread 0 Thread 1 T1 = B + C T2 = D + E T1 = T1 + T2 A = A + E The roundoff behaviour is different and so the numerical results may be different too This is natural for parallel programs, but it may be hard to differentiate it from an ordinary bug...

16 Cache Coherence

17 Typical cache based system? CPU L1 cache L2 cache Memory

18 Cache line modifi cations page cache line caches cache line inconsistency! Main Memory

19 Caches in a parallel computer A cache line starts in memory Over time multiple copies of this line may exist Cache Coherence ("cc"): Tracks changes in copies Makes sure correct cache line is used Different implementations possible Need hardware support to make it effi cient Processors Caches Memory CPU 0 CPU 1 CPU 2 invalidated invalidated CPU 3 invalidated

20 Parallel Architectures

21 Uniform Memory Access (UMA) cache CPU Pro Con Memory Interconnect cache CPU Easy to use and to administer Effi cient use of resources Said to be expensive cache CPU Said to be non-scalable I/O I/O I/O Also called "SMP" (Symmetric Multi Processor) Memory Access time is Uniform for all CPUs CPU can be multicore Interconnect is "cc": Bus Crossbar No fragmentation - Memory and I/O are shared resources

22 NUMA Pro Con Interconnect M M M I/O I/O I/O cache cache cache CPU CPU CPU Said to be cheap Said to be scalable Diffi cult to use and administer In-effi cient use of resources Also called "Distributed Memory" or NORMA (No Remote Memory Access) Memory Access time is Non-Uniform Hence the name "NUMA" Interconnect is not "cc": Ethernet, Infi niband, etc,... Runs 'N' copies of the OS Memory and I/O are distributed resources

23 The Hybrid Architecture Second-level Interconnect SMP/MC node SMP/MC node Second-level interconnect is not cache coherent Ethernet, Infi niband, etc,... Hybrid Architecture with all Pros and Cons: UMA within one SMP/Multicore node NUMA across nodes

24 cc-numa 2-nd level Interconnect Two-level interconnect: UMA/SMP within one system NUMA between the systems Both interconnects support cache coherence i.e. the system is fully cache coherent Has all the advantages ('look and feel') of an SMP Downside is the Non-Uniform Memory Access time

25 Parallel Programming Models

26 How To Program A Parallel Computer? There are numerous parallel programming models The ones most well-known are: Distributed Memory Sockets (standardized, low level) Any Cluster and/or single SMP PVM - Parallel Virtual Machine (obsolete) MPI - Message Passing Interface (de-facto std) Shared Memory Posix Threads (standardized, low level) SMP only OpenMP (de-facto standard) Automatic Parallelization (compiler does it for you)

27 Parallel Programming Models Distributed Memory - MPI

28 What is MPI? MPI stands for the Message Passing Interface MPI is a very extensive de-facto parallel programming API for distributed memory systems (i.e. a cluster) An MPI program can however also be executed on a shared memory system First specifi cation: 1994 The current MPI-2 specifi cation was released in 1997 Major enhancements Remote memory management Parallel I/O Dynamic process management

29 More about MPI MPI has its own data types (e.g. MPI_INT) User defi ned data types are supported as well MPI supports C, C++ and Fortran Include fi le <mpi.h> in C/C++ and mpif.h in Fortran An MPI environment typically consists of at least: A library implementing the API A compiler and linker that support the library A run time environment to launch an MPI program Various implementations available HPC Clustertools, MPICH, MVAPICH, LAM, Voltaire MPI, Scali MPI, HP-MPI,...

30 The MPI Programming Model M 0 1 M M 2 3 M M 4 5 M A Cluster Of Systems

31 The MPI Execution Model All processes start MPI_Init = communication MPI_Finalize All processes fi nish

32 The MPI Memory Model private private P P P Transfer Mechanism P P private private All threads/processes have access to their own, private, memory only Data transfer and most synchronization has to be programmed explicitly All data is private Data is shared explicitly by exchanging buffers private

33 Example - Send N Integers #include <mpi.h> you = 0; him = 1; MPI_Init(&argc, &argv); include fi le MPI_Comm_rank(MPI_COMM_WORLD, &me); initialize MPI environment get process ID if ( me == 0 ) { process 0 sends error_code = MPI_Send(&data_buffer, N, MPI_INT, him, 1957, MPI_COMM_WORLD); } else if ( me == 1 ) { process 1 receives error_code = MPI_Recv(&data_buffer, N, MPI_INT, you, 1957, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } MPI_Finalize(); leave the MPI environment

34 Run time Behavior Process 0 Process 1 you = 1 him = 0 N integers destination = you = 1 label = 1957 you = 1 him = 0 me = 0 me = 1 MPI_Send MPI_Recv N integers sender = him =0 label = 1957 Yes! Connection established

35 The Pros and Cons of MPI Advantages of MPI: Flexibility - Can use any cluster of any size Straightforward - Just plug in the MPI calls Widely available - Several implementations out there Widely used - Very popular programming model Disadvantages of MPI: Redesign of application - Could be a lot of work Easy to make mistakes - Many details to handle Hard to debug - Need to dig into underlying system More resources - Typically, more memory is needed Special care - Input/Output

36 Parallel Programming Models Shared Memory - Automatic Parallelization

37 Automatic Parallelization (-xautopar) Compiler performs the parallelization (loop based) Different iterations of the loop executed in parallel Same binary used for any number of threads for (i=0; i<1000; i++) a[i] = b[i] + c[i]; OMP_NUM_THREADS=4 Thread 0 Thread 1 Thread 2 Thread 3 0-249 250-499 500-749 750-999

38 Parallel Programming Models Shared Memory - OpenMP

39 Shared Memory 0 M 1 M P M http://www.openmp.org

40 Shared Memory Model private private T T T Shared Memory private T T private private Programming Model All threads have access to the same, globally shared, memory Data can be shared or private Shared data is accessible by all threads Private data can only be accessed by the thread that owns it Data transfer is transparent to the programmer Synchronization takes place, but it is mostly implicit

41 About data In a shared memory parallel program variables have a "label" attached to them: Labelled "Private" Visible to one thread only Change made in local data, is not seen by others Example - Local variables in a function that is executed in parallel Labelled "Shared" Visible to all threads Change made in global data, is seen by all others Example - Global data

42 Example - Matrix times vector #pragma omp parallel for default(none) \ private(i,j,sum) shared(m,n,a,b,c) for (i=0; i<m; i++) { j sum = 0.0; for (j=0; j<n; j++) sum += b[i][j]*c[j]; = * a[i] = sum; } i TID = 0 TID = 1 for (i=0,1,2,3,4) for (i=5,6,7,8,9) i = 0 i = 5 sum = b[i=0][j]*c[j] sum = b[i=5][j]*c[j] a[0] = sum a[5] = sum i = 1 i = 6 sum = b[i=1][j]*c[j] sum = b[i=6][j]*c[j] a[1] = sum... etc... a[6] = sum

43 A Black and White comparison MPI OpenMP De-facto standard De-facto standard Endorsed by all key players Endorsed by all key players Runs on any number of (cheap) systems Limited to one (SMP) system Grid Ready Not (yet?) Grid Ready High and steep learning curve Easier to get started (but,...) You're on your own Assistance from compiler All or nothing model Mix and match model No data scoping (shared, private,..) Requires data scoping More widely used (but...) Increasingly popular (CMT!) Sequential version is not preserved Preserves sequential code Requires a library only Need a compiler Requires a run-time environment No special environment Easier to understand performance Performance issues implicit

44 The Hybrid Parallel Programming Model

45 The Hybrid Programming Model Distributed Memory Shared Memory Shared Memory 0 1 P 0 1 P M M M M M M

46 Data Races

47 About Parallelism Parallelism Independence "Something" that does not obey this rule, is not parallel (at that level...) No Fixed Ordering

48 Shared Memory Programming X private private T T T private Y Shared Memory T T private X X private Threads communicate via shared memory

49 What is a Data Race? Two different threads in a multi-threaded shared memory program Access the same (=shared) memory location Asynchronously Without holding any common exclusive locks At least one of the accesses is a write/store and and

50 Example of a data race #pragma omp parallel shared(n) {n = omp_get_thread_num();} private T W W n Shared Memory T private

51 Another example #pragma omp parallel shared(x) {x = x + 1;} private T R/W R/W x Shared Memory T private

52 About Data Races Loosely described, a data race means that the update of a shared variable is not well protected A data race tends to show up in a nasty way: Numerical results are (somewhat) different from run to run Especially with Floating-Point data diffi cult to distinguish from a numerical side-effect Changing the number of threads can cause the problem to seemingly (dis)appear May also depend on the load on the system May only show up using many threads

53 A parallel loop for (i=0; i<8; i++) a[i] = a[i] + b[i]; Every iteration in this loop is independent of the other iterations Thread 1 Thread 2 a[0]=a[0]+b[0] a[1]=a[1]+b[1] a[2]=a[2]+b[2] a[3]=a[3]+b[3] a[4]=a[4]+b[4] a[5]=a[5]+b[5] a[6]=a[6]+b[6] a[7]=a[7]+b[7] Time

54 Not a parallel loop for (i=0; i<8; i++) a[i] = a[i+1] + b[i]; The result is not deterministic when run in parallel! Thread 1 Thread 2 a[0]=a[1]+b[0] a[1]=a[2]+b[1] a[2]=a[3]+b[2] a[3]=a[4]+b[3] a[4]=a[5]+b[4] a[5]=a[6]+b[5] a[6]=a[7]+b[6] a[7]=a[8]+b[7] Time

55 About the experiment We manually parallelized the previous loop The compiler detects the data dependence and does not parallelize the loop Vectors a and b are of type integer We use the checksum of a as a measure for correctness: checksum += a[i] for i = 0, 1, 2,..., n-2 The correct, sequential, checksum result is computed as a reference We ran the program using 1, 2, 4, 32 and 48 threads Each of these experiments was repeated 4 times

56 Numerical results threads: 1 checksum 1953 correct value 1953 threads: 1 checksum 1953 correct value 1953 threads: 1 checksum 1953 correct value 1953 threads: 1 checksum 1953 correct value 1953 threads: 2 checksum 1953 correct value 1953 threads: 2 checksum 1953 correct value 1953 threads: 2 checksum 1953 correct value 1953 threads: 2 checksum 1953 correct value 1953 threads: 4 checksum 1905 correct value 1953 threads: threads: 4 checksum 4 checksum 1905 correct value 1953 correct value 1953 1953 threads: 4 checksum 1937 correct value 1953 threads: 32 checksum 1525 correct value 1953 threads: 32 checksum 1473 correct value 1953 threads: 32 checksum 1489 correct value 1953 threads: 32 checksum 1513 correct value 1953 threads: 48 checksum 936 correct value 1953 threads: 48 checksum 1007 correct value 1953 threads: 48 checksum 887 correct value 1953 threads: 48 checksum 822 correct value 1953 Data Race In Action!

57 Summary

58 Parallelism Is Everywhere Multiple levels of parallelism: Granularity Technology Programming Model Instruction Level Superscalar Compiler Chip Level Multicore Compiler, OpenMP, MPI System Level SMP/cc-NUMA Compiler, OpenMP, MPI Grid Level Cluster MPI Threads Are Getting Cheap