Introducción a la computación de altas prestaciones!!"

Size: px
Start display at page:

Download "Introducción a la computación de altas prestaciones!!""

Transcription

1 Introducción a la computación de altas prestaciones!!"

2 Questions Why Parallel Computers? How Can the Quality of the Algorithms be Analyzed? How Should Parallel Computers Be Programmed? Why the Message Passing Programming Paradigm? Why de Shared Memory Programming Paradigm?

3 OUTLINE Introduction to Parallel Computing Performance Metrics Models of Parallel Computers The MPI Message Passing Library Examples The OpenMP Shared Memory Library Examples Improvements in black hole detection using parallelism

4 Why Parallel Computers? Applications Demanding more Computational Power: Artificial Intelligence Weather Prediction Biosphere Modeling Processing of Large Amounts of Data (from sources such as satellites) Combinatorial Optimization Image Processing Neural Network Speech Recognition Natural Language Understanding etc..

5 Top500

6 Performace Metrics

7 Speed-up T s = Sequential Run Time: Time elapsed between the begining and the end of its execution on a sequential computer. T p = Parallel Run Time: Time that elapses from the moment that a parallel computation starts to the moment that the last processor finishes the execution. Speed-up: T * s / T p? p T * s = Time of the best sequential algorithm to solve the problem.

8 Speed-up Linear Speed-up Linear and Actual Speed-up Speed-up Processors

9 Speed-up Linear Speed-up Actual Speed-up1 Linear and Actual Speed-up Speed-up Processors

10 Speed-up Linear Speed-up Actual Speed-up1 Actual Speed-up Speed-up Processors

11 Speed-up Linear Speed-up Actual Speed-up1 Actual Speed-up2 Speed-up Optimal Number of Processors Processors

12 Efficiency In practice, ideal behavior of an speed-up equal to p is not achieved because while executing a parallel algorithm, the processing elements cannot devote 100% of their time to the computations of the algorithm. Efficiency: Measure of the fraction of time for which a processing element is usefully employed. E = (Speed-up / p) x 100 %

13 Efficiency 120 Optimum Efficiency Actual Efficiency-1 Actual Efficiency Efficiency (%) Processors

14 Amdahl`s Law Amdahl`s law attempt to give a maximum bound for speed-up from the nature of the algorithm chosen for the parallel implementation. Seq = Proportion of time the algorithm needs to be spent in purely sequential parts. Par = Proportion of time that might be done in parallel Seq + Par = 1 (where 1 is for algebraic simplicity) Maximum Speed-up = (Seq + Par) / (Seq + Par / p) = 1 / (Seq + Par / p) % Seq Par Maximum 0,1 0,001 0,999 Speed-up 500,25 p = ,5 0,005 0, ,81 1 0,01 0,99 90, ,1 0,9 9,91

15 Example A problem to be solved many times over several different inputs. Evaluate F(x,y,z) x in {1,..., 20}; y in {1,..., 10}; z in {1,..., 3}; The total number of evaluations is 20*10*3 = 600. The cost to evaluate F in one point (x, y, z) is t. The total running time is t * 600. If t is equal to 3 hours. The total running time for the 600 evaluations is 1800 hours 75 days

16 Speed-up Linear Speed-up Actual Speed-up1 Actual Speed-up2 Super-Linear Speedup Speed-up Processors

17 Models

18 The Sequential Model The RAM model express computations on von Neumann architectures. The von Neumann architecture is universally accepted for sequential computations.

19 The Parallel Model Computational Models! """ # Programming Models Architectural Models

20 Address-Space Organization

21 Digital AlphaServer 8400 "# $%& Hardware!

22 SGI Origin 2000 Hardware "#)*+*$+#,$*%& ' (!

23 The SGI Origin 3000 Architecture (1/2) jen50.ciemat.es 160 processors MIPS R14000 / 600MHz On 40 nodes with 4 processors each Data and instruction cache on-chip Irix Operating System Hypercubic Network

24 The SGI Origin 3000 Architecture (2/2) cc-numa memory Architecture 1 Gflops Peak Speed 8 MB external cache 1 Gb main memory each proc. 1 Tb Hard Disk

25 Beowulf Computers -%".*- %" $+ )*+*$+#

26 Towards Grid Computing. Source: & updated

27 The Parallel Model Computational Models!$ """ # Programming Models Architectural Models

28 Drawbacks that arise when solving Problems using Parallelism Parallel Programming is more complex than sequential. Results may vary as a consequence of the intrinsic non determinism. New problems. Deadlocks, starvation... Is more difficult to debug parallel programs. Parallel programs are less portable.

29 The Message Passing Model

30 The Message Passing Model processor processor processor Interconnection Network processor processor processor processor!%$ & '%$ &

31

32 MPI $( $' )$ *#$! + """ #,# """ $$"## """

33 MPI What Is MPI? Message Passing Interface standard The first standard and portable message passing library with good performance "Standard" by consensus of MPI Forum participants from over 40 organizations Finished and published in May 1994, updated in June 1995 What does MPI offer? Standardization - on many levels Portability - to existing and new systems Performance - comparable to vendors' proprietary libraries Richness - extensive functionality, many quality implementations

34 A Simple MPI Program ""-.#"!/!#- 0.#"!1 $#- 1 #%# 2'34&5 # $!678 9#%::'&8 9 9;% 9 9< +: &8 9 9#=% 9 9< +:$&8 $#%>"" $?!?!@1 $&8 9#"#=%&8 A $> mpicc -o hello hello.c $> mpirun -np 4 hello Hello from processor 2 of 4 Hello from processor 3 of 4 Hello from processor 1 of 4 Hello from processor 0 of 4

35 "" -.#"!/!#- 0.#"!/#- 0.#"!1 $#- 1 #%# 2'34&5 # $!678 3B #%::'&8 9 9;% 9 9< +: &8 9 9#=% 9 9< +:$&8 $> mpirun -np 4 helloms Processor 2 of 4 Processor 3 of 4 Processor 1 of 4 processor 0, p = 4 greetings from process 1! greetings from process 2! greetings from process 3! A Simple MPI Program #% C67&5 $#%1?!?!@1 $&8 $#% 1# $?!C1 &8! 678 9!% "% &DB 9! 99<+&8 A"5 $#%1$7$6?!1$&8 %6B8/$8DD&5 9'% B < +:&8 &8 A A 9#"#=%&8 A

36 / / / ( ( / / / / / / / / / / Linear Model to Predict Communication Performance Time to send N bytes= τ n + β 1 0,9 0,1 0,8 0,7 7E-07 n + 0,0003 0,01 0,001 BEOULL CRAYT3E 0,6 0,5 0,4 BEOULL CRAYT3E 0,0001 0, ,3 0,2 0,1 0 5E-08 n + 5E-05

37 Performace Prediction: Fast, Gigabit Ethernet, Myrinet 0,1 0,09 y = 9E-08x + 0,0007 0,1 0,09 0,08 0,07 0,06 0,05 0,04 0,03 0,02 0,01 0 y = 5E-08x + 0,0003 y = 4E-09x + 2E Fast Giga Myrinet Lineal (Fast) Lineal (Giga) Lineal (Myrinet) 0,08 0,07 0,06 0,05 0,04 0,03 0,02 0, Fast Giga Myrinet

38 Basic Communication Operations

39 One-to-all broadcast Single-node Accumulation -0+# "*0&0#$$+*0 "+ "+ --- "+

40 Broadcast on Hypercubes E F G ( H I B 7 E F G ( H I B 7 #$ E F G ( H I B 7 E F G ( H I B 7!$

41 Broadcast on Hypercubes #!$ F E F E I H I H ( G ( G 7 B 7 B

42 MPI Broadcast int MPI_Bcast( void *buffer; int count; MPI_Datatype datatype; int root; MPI_Comm comm; ); Broadcasts a message from the process with rank "root" to all other processes of the group

43 Reduction on conmutative and associative operator A i in processor i Every processor has to obtain A P-1 ( / / 1 1 / 1 1 ( ( 1 1 ( 1 1 ( / / ( 1 1 ( 1

44 Reductions with MPI int MPI_Reduce( void void int MPI_Datatype MPI_Op int MPI_Comm ); *sendbuf; *recvbuf; count; datatype; op; root; comm; '#$2$ 0 +*0&2$ int MPI_Allreduce( ); *0 2$ 0# #*+*$+ + $+3+ ;

45 All-To-All Broadcast Multinode Accumulation +# "*0&0#$$+*0 '#$+*0 *4$

46 MPI Collective Operations MPI Operator Operation MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND logical and MPI_BAND bitwise and MPI_LOR logical or MPI_BOR bitwise or MPI_LXOR logical exclusive or MPI_BXOR bitwise exclusive or MPI_MAXLOC max value and location MPI_MINLOC min value and location

47 Computing π: Sequential π (1+x 2 ) dx!,"$#67-7j8 "#6-8!,"""$# B-7K%!,"&8 %#678#/8#DD&5 ) 6%#D7-G&2 8 $#D6%)&8 A $#

48 Computing π: Parallel π (1+x 2 ) dx 9%:B97 99<+&8 6B-7K%!,"&8 L$#67-78 %#6 8#/8#D6 $&5 )6 2%#D7-G&2 8 L$#D6%)&8 A L$# !%: L$#:$#B <+&

49 The Master Slave Paradigm "'

50 Condor University Wisconsin-Madison. A problem to be solved many times over several different inputs. The problem to be solved is computational expensive.

51 Condor Condor is a specialized workload management system for computeintensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. In many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource.

52 What is OpenMP? Application Program Interface (API) for shared memory parallel programming What the application programmer inserts into code to make it run in parallel Addresses only shared memory multiprocessors Directive based approach with library support Concept of base language and extensions to base language OpenMP is available for Fortran 90 / 77 and C / C++

53 OpenMP is not... Not Automatic parallelization User explicitly specifies parallel execution Compiler does not ignore user directives even if wrong Not just loop level parallelism Functionality to enable coarse grained parallelism Not a research project Only practical constructs that can be implemented with high performance in commercial compilers Goal of parallel programming: application speedup Simple/Minimal with Opportunities for Extensibility

54 Why OpenMP? Parallel programming landscape before OpenMP Standard way to program distributed memory computers (MPI and PVM) No standard API for shared memory programming Several vendors had directive based APIs for shared memory programming Silicon Graphics, Cray Research, Kuck & Associates, Digital Equipment. All different, vendor proprietary Sometimes similar but with different spellings Most were targeted at loop level parallelism Limited functionality - mostly for parallelizing loops

55 Why OpenMP? (cont.) Commercial users, high end software vendors have big investment in existing code Not very eager to rewrite their code in new languages Performance concerns of new languages End result: users who want portability forced to program shared memory machines using MPI Library based, good performance and scalability But sacrifice the built in shared memory advantages of the hardware Both require major investment in time and money Major effort: entire program needs to be rewritten New features needs to be curtailed during conversion

56 The OpenMP API Multi-platform shared-memory parallel programming OpenMP is portable: supported by Compaq, HP, IBM, Intel, SGI, Sun and others on Unix and NT Multi-Language: C, C++, F77, F90 Scalable Loop Level parallel control Coarse grained parallel control

57 The OpenMP API Single source parallel/serial programming: OpenMP is not intrusive (to original serial code). Instructions appear in comment statements for Fortran and through pragmas for C/C++!$omp parallel do do i = 1, n... enddo #pragma omp parallel for for (i = 0; I < n; i++) {... } Incremental implementation: OpenMP programs can be implemented incrementally one subroutine (function) or even one do (for) loop at a time

58 Multithreading: Threads Sharing a single CPU between multiple tasks (or "threads") in a way designed to minimise the time required to switch threads This is accomplished by sharing as much as possible of the program execution environment between the different threads so that very little state needs to be saved and restored when changing thread. Threads share more of their environment with each other than do tasks under multitasking. Threads may be distinguished only by the value of their program counters and stack pointers while sharing a single address space and set of global variables

59 OpenMP Overview: How do threads interact? OpenMP is a shared memory model Threads communicate by sharing variables Unintended sharing of data causes race conditions: race condition: when the program s outcome changes as the threads are scheduled differently To control race conditions: Use synchronizations to protect data conflicts Synchronization is expensive so: Change how data is accessed to minimize the need for synchronization

60 OpenMP Parallel Computing Solution Stack Prog. Layer (OpenMP API) System layer User layer End User Application Directives OpenMP Library Environment Variables Runtime Library OS/system support for shared memory

61 Reasoning about programming Programming is a process of successive refinement of a solution relative to a hierarchy of models The models represent the problem at a different level of abstraction The top levels express the problem in the original problem domain The lower levels represent the problem in the computer s domain The models are informal, but detailed enough to support simulation Source: J.-M. Hoc, T.R.G. Green, R. Samurcay and D.J. Gilmore (eds.), Psychology of Programming, Academic Press Ltd., 1990

62 Layers of abstraction in Programming Domain Model: Bridges between domains Problem Specification Algorithm Source Code Computation Hardware Programming Computational Cost OpenMP only defines these two!

63 The OpenMP Computational Model OpenMP was created with a particular abstract machine or computational model in mind: Multiple processing elements A shared address space with equal-time access for each processor Multiple light weight processes (threads) managed outside of OpenMP (the OS or some other. third. party ). Proc Proc Proc Proc N Shared Address Space

64 The OpenMP programming model fork-join parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally: i.e. the sequential program evolves into a parallel program Master Thread Parallel Regions A nested Parallel Region

65 So, How good is OpenMP? A high quality programming environment supports transparent mapping between models OpenMP does this quite well for the models it defines: Programming model: Threads forked by OpenMP map onto threads in modern OSs. Computational model: Multiprocessor systems with cache coherency map onto OpenMP shared address space

66 What about the cost model? OpenMP doesn t say much about the cost model programmers are left to their own devices Real systems have memory hierarchies, OpenMP s assumed machine model doesn t: Caches mean some data is closer to some processors Scalable multiprocessor systems organize their RAM into modules - another source of NUMA OpenMP programmers must deal with these issues as they: Optimize performance on each platform Scale to run onto larger NUMA machines

67 What about the specification model? Programmers reason in terms of a specification model as they design parallel algorithms Some parallel algorithms are natural in OpenMP: Specification models implied by loop-splitting and SPMD algorithms map well onto OpenMP s programming model Some parallel algorithms are hard for OpenMP Recursive problems and list processing is challenging for OpenMP s models

68 Is OpenMP a good API? Model: Bridges between domains Specification Fair (5) Programming Good (8) Computational Good (9) Cost Poor (3) Overall score: OpenMP is OK (6), but it sure could be better!

69 OpenMP today Hardware Vendors Compaq/Digital (DEC) Hewlett-Packard (HP) IBM Intel Silicon Graphics Sun Microsystems 3rd Party Software Vendors Absoft Edinburgh Portable Compilers (EPC) Kuck & Associates (KAI) Myrias Numerical Algorithms Group (NAG) Portland Group (PGI)

70 Directives The OpenMP components Environment Variables Shared / Private Variables Runtime Library OS Threads

71 The OpenMP directives Directives are special comments in the language Fortran fixed form:!$omp, C$OMP, *$OMP Fortran free form:!$omp Special comments are interpreted by OpenMP compilers w = 1.0/n #, #$!,L$ $#" sum = 0.0!$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum) do I=1,n x = w*(i-0.5) sum = sum + f(x) end do pi = w*sum print *,pi end

72 The OpenMP directives Look like a comment (sentinel / pragma syntax) F77/F90!$OMP directive_name [clauses] C/C++ #pragma omp pragmas_name [clauses] Declare start and end of multithread execution Control work distribution Control how data is brought into and taken out of parallel sections Control how data is written/read inside sections

73 The OpenMP environment variables OMP_NUM_THREADS - number of threads to run in a parallel section MPSTKZ size of stack to provide for each thread OMP_SCHEDULE - Control state of scheduled executions. setenv OMP_SCHEDULE "STATIC, 5 setenv OMP_SCHEDULE "GUIDED, 8 setenv OMP_SCHEDULE "DYNAMIC" OMP_DYNAMIC OMP_NESTED

74 Shared / Private variables Shared variables can be accessed by all of the threads. Private variables are local to each thread In a typical parallel loop, the loop index is private, while the data being indexed is shared!$omp parallel!$omp parallel do!$omp& shared(x,y,z), private(i) do I=1, 100 Z(I) = X(I) + Y(I) end do!$omp end parallel

75 OpenMP Runtime routines Writing a parallel section of code is matter of asking two questions How many threads are working in this section? Which thread am I? Other things you may wish to know How many processors are there? Am I in a parallel section? How do I control the number of threads? Control the execution by setting directive state

76 Os threads In the case of Linux, it needs to be installed with an SMP kernel. Not a good idea to assign more threads than CPUs available: omp_set_num_threads(omp_get_num_procs())

77 A simple example: computing π π = + = + ( + (( ) ) ) 5/

78 Computing π double t, pi=0.0, w; long i, n = ; double local, pi = 0.0, w = 1.0 / n;... for(i = 0; i < n; i++) { t = (i + 0.5) * w; pi += 4.0/(1.0 + t*t); } pi *= w;...

79 Computing π #pragma omp directives in C Ignored by non-openmp compilers double t, pi=0.0, w; long i, n = ; double local, pi = 0.0, w = 1.0 / n;... #pragma omp parallel for reduction(+:pi) private(i,t) for(i = 0; i < n; i++) { t = (i + 0.5) * w; pi += 4.0/(1.0 + t*t); } pi *= w;...

80 Computing π on a SunFire 6800

81 Compiling OpenMP programs OpenMP directives are ignored by default Example: SGI Irix platforms f90 -O3 foo.f cc -O3 foo.c OpenMP directives are enabled with -mp Example: SGI Irix platforms f90 -O3 -mp foo.f cc -O3 -mp foo.c

82 Fortran example program f77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n,m) c$omp parallel default(none) c$omp& private(i,j) shared(a) do j=1,m do i=1,n a(i,j)=i+(j-1)*n enddo enddo c$omp end parallel end OpenMP directives used: c$omp parallel [clauses] c$omp end parallel Parallel clauses include: default(none private shared) private(...) shared(...)

83 Fortran example program f77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n,m) c$omp parallel default(none) c$omp& private(i,j) shared(a) do j=1,m do i=1,n a(i,j)=i+(j-1)*n enddo enddo c$omp end parallel Each arrow denotes one thread All threads perform identical task end

84 Fortran example program f77_parallel implicit none integer n, m, i, j parameter (n=10, m=20) integer a(n,m) c$omp parallel default(none) c$omp& private(i,j) shared(a) do j=1,m do i=1,n a(i,j)=i+(j-1)*n enddo enddo c$omp end parallel With default scheduling, Thread a works on j = 1:5 Thread b on j = 6:10 Thread c on j = 11:15 Thread d on j = 16:20 a b c d end

85 The Foundations of OpenMP: OpenMP: a parallel programming API Parallelism Working with concurrency Layers of abstractions or models used to understand and use OpenMP

86 Summary of OpenMP Basics Parallel Region (to create threads) C$omp parallel #pragma omp parallel Worksharing (to split-up work between threads) C$omp do #pragma omp for C$omp sections #pragma omp sections C$omp single #pragma omp single C$omp workshare Data Environment (to manage data sharing) # directive: threadprivate # clauses: shared, private, lastprivate, reduction, copyin, copyprivate Synchronization directives: critical, barrier, atomic, flush, ordered, master Runtime functions/environment variables

87 Improvements in black hole detection using parallelism % # $!!$

88 Introduction (1/3) Very frequently there is a divorce between computer scientists and researchers in other scientific disciplines This work collects the experiences of a collaboration between researchers coming from two different fields: astrophysics and parallel computing We present different approaches to the parallelization of a scientific code that solves an important problem in astrophysics: the detection of supermassive black holes

89 Introduction (2/3) The IAC co-authors initially developed a Fortran77 code solving the problem The execution time for this original code was not acceptable and that motivated them to contact with researchers with expertise in the parallel computing field We know in advance that these scientist programmers deal with intense time-consuming sequential codes that are not difficult to tackle using HPC techniques Researchers with a purely scientific background are interested in these techniques, but they are not willing to spend time learning about them

90 Introduction (3/3) One of our constraints was to introduce the minimum amount of changes in the original code Even with the knowledge that some optimizations could be done in the sequential code To preserve the use of the NAG functions was also another restriction in our development

91 The Problem Outline Black holes and quasars The method: gravitational lensing Fluctuations in quasars light curves Mathematical formulation of the problem Sequential code Parallelizations: MPI, OpenMP, Mixed MPI/OpenMP Computational results Conclusions

92 Black holes Supermassive black holes (SMBH) are supposed to exist in the nucleus of many if not all the galaxies Some of these objects are surrounded by a disk of material continuously spiraling towards the deep gravitational potential pit of the SMBH and releasing huge quantities of energy giving rise to the phenomena known as quasars (QSO)

93 The accretion disk

94 Quasars (Quasi Stellar Objects, QSO) QSOs are currently believed to be the most luminous and distant objects in the universe QSOs are the cores of massive galaxies with super giant black holes that devour stars at a rapid rate, enough to produce the amount of energy observed by a telescope

95 The method We are interested in objects of dimensions comparable to the Solar System in galaxies very far away from the Milky Way Objects of this size can not be directly imaged and alternative observational methods are used to study their structure The method we use is the observation of QSO images affected by a microlensing event to study the structure of the accretion disk

96 Gravitational Microlensing Gravitational lensing (the attraction of light by matter) was predicted by General Relativity and observationally confirmed in 1919 If light from a QSO pass through a galaxy located between the QSO and the observer it is possible that a star in the intervening galaxy crosses the QSO light beam The gravitational field of the star amplifies the light emission coming from the accretion disk (gravitational microlensing)

97 Microlensing Quasar-Star MACROIMAGES QUASAR MICROIMAGES OBSERVER

98 Microlensing The phenomenon is more complex because the magnification is not due to a single isolated microlens, but it rather is a collective effect of many stars As the stars are moving with respect to the QSO light beam, the amplification varies during the crossing

99 Microlensing

100 Double Microlens

101 Multiple Microlens Magnification pattern in the source plane, produced by a dense field of stars in the lensing galaxy. The color reflects the magnification as a function of the quasar position: the sequence blue-green-red-yellow indicates increasing magnification

102 Q (1/2) So far the best example of a microlensed quasar is the quadruple quasar Q

103 Q (2/2)

104 Fluctuations in Q light curves Lightcurves of the four images of Q over a period of almost ten years The changes in relative brightness are very obvious

105 Q In Q , and thanks to the unusually small distance between the observer and the lensing galaxy, the microlensing events have a timescale of the order of months We have observed Q from 1999 October during approximately 4 months at the Roque de los Muchachos Observatory

106 Fluctuations in light curves The curve representing the change in luminosity of the QSO with time depends on the position of the star and also on the structure of the accretion disk Microlens-induced fluctuations in the observed brightness of the quasar contain information about the light-emitting source (size of continuum region or broad line region of the quasar, its brightness profile, etc.) Hence from a comparison between observed and simulated quasar microlensing we can draw conclusions about the accretion disk

107 Q light curves (2/2) Our goal is to model light curves of QSO images affected by a microlensing event to study the unresolved structure of the accretion disk

108 Mathematical formulation (1/5) Leaving aside the physical meaning of the different variables, the function modeling the dependence of the observed flux with time t, can be written as: Where is the ratio between the outer and inner radii of the accretion disk (we will adopt ).

109 Mathematical formulation (2/5) And G is the function: To speed up the computation, G has been approximated by using MATHEMATICA

110 Mathematical formulation (3/5) Our goal is to estimate in the observed flux : the values of the parameters, B, C,, t 0 by fitting to the observational data

111 Mathematical formulation (4/5) Specifically, to find the values of the 5 parameters that minimize the error between the theoretical model and the observational data according to a chi-square criterion: Where: N is the number of data points corresponding to times t i (i =1, 2,..., N) is the theoretical function evaluated at time t i is the observational error associated to each data value

112 Mathematical formulation (5/5) To minimize we use the e04ccf NAG routine, that only requires evaluation of the function and not of the derivatives The determination of the minimum in the 5-parameters space depends on the initial conditions, so we consider a 5-dimensional grid of starting points and m sampling intervals in each variable for each one of the points of the this grid we compute a local minimum Finally, we select the absolute minimum among them

113 The sequential code program seq_black_hole double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit c Data input c Initialize best solution do k1=1, m do k2=1, m do k3=1, m do k4=1, m do k5=1, m c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo enddo enddo enddo enddo end

114 The sequential code program seq_black_hole double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit c Data input c Initialize best solution do k1=1, m do k2=1, m do k3=1, m do k4=1, m do k5=1, m c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo enddo enddo enddo enddo end

115 Sequential Times In a Sun Blade 100 Workstation running Solaris 5.0 and using the native Fortran77 Sun compiler (v. 5.0) with full optimizations this code takes 5.89 hours for sampling intervals of size m= hours for sampling intervals of size m=5 In a SGI Origin 3000 using the native MIPSpro F77 compiler (v. 7.4) with full optimizations 0.91 hours for m= hours for m=5

116 Loop transformation program seq_black_hole2 implicit none parameter(m = 4)... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) integer k1, k2, k3, k4, k5 common/var/t, fit c Data input c Initialize best solution do k = 1, m^5 c Index transformation c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo end seq_black_hole

117 MPI & OpenMP (1/2) In the last years OpenMP and MPI have been universally accepted as the standard tools to develop parallel applications OpenMP is a standard for shared memory programming It uses a fork-join model and is mainly based on compiler directives that are added to the code that indicate the compiler regions of code to be executed in parallel MPI Uses an SPMD model Processes can read and write only to their respective local memory Data are copied across local memories using subroutine calls The MPI standard defines the set of functions and procedures available to the programmer

118 MPI & OpenMP (2/2) Each one of these two alternatives have both advantages and disadvantages Very frequently it is not obvious which one should be selected for a specific code: MPI programs run on both distributed and shared memory architectures while OpenMP run only on shared memory The abstraction level is higher in OpenMP MPI is particularly adaptable to coarse grain parallelism. OpenMP is suitable for both coarse and fine grain parallelism While it is easy to obtain a parallel version of a sequential code in OpenMP, usually it requires a higher level of expertise in the case of MPI. A parallel code can be written in OpenMP just by inserting proper directives in a sequential code, preserving its semantics, while in MPI mayor changes are usually needed Portability is higher in MPI

119 MPI-OpenMP hybridization (1/2) Hybrid codes match the architecture of SMP clusters SMP clusters are an increasingly popular platforms MPI may suffer from efficiency problems in shared memory architectures MPI codes may need too much memory Some vendors have attempted to exted MPI in shared memory, but the result is not as efficient as OpenMP OpenMP is easy to use but it is limited to the shared memory architecture

120 MPI-OpenMP hybridization (2/2) An hybrid code may provide better scalability Or simply enable a problem to exploit more processors Not necessarily faster than pure MPI/OpenMP It depends on the code, architecture, and how the programming models interact

121 MPI code program black_hole_mpi include 'mpif.h' implicit none parameter(m = 4)... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) c Data input c Initialize best solution do k = myid, m^5-1, numprocs c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo call MPI_FINALIZE( ierr ) end

122 MPI code program black_hole_mpi include 'mpif.h' implicit none parameter(m = 4)... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) c Data input c Initialize best solution do k = myid, m^5-1, numprocs c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo call MPI_FINALIZE( ierr ) end

123 OpenMP code program black_hole_omp implicit none parameter(m = 4)... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit!$omp THREADPRIVATE(/var/, /data/) c Data input c Initialize best solution!$omp PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/) LASTPRIVATE(fx) do k = 0, m^5-1 c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo!$omp END PARALLEL DO end

124 OpenMP code program black_hole_omp implicit none parameter(m = 4)... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit!$omp THREADPRIVATE(/var/, /data/) c Data input c Initialize best solution!$omp PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/) LASTPRIVATE(fx) do k = 0, m^5-1 c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo!$omp END PARALLEL DO end

125 OpenMP code program black_hole_omp implicit none parameter(m = 4)... double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, longitud double precision t, fit(5) common/var/t, fit!$omp THREADPRIVATE(/var/, /data/) c Data input c Initialize best solution!$omp PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/) LASTPRIVATE(fx) do k = 0, m^5-1 c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo!$omp END PARALLEL DO end

126 Hybrid MPI OpenMP code program black_hole_mpi_omp double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit!$omp THREADPRIVATE(/var/, /data/) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs, ierr) c Data input c Initialize best solution!$omp PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/)LASTPRIVATE(fx) do k = myid, m^5-1, mpi_numprocs c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo c Reduce the OpenMP best solution!$omp END PARALLEL DO c Reduce the MPI best solution call MPI_FINALIZE(ierr) end

127 Hybrid MPI OpenMP code program black_hole_mpi_omp double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit!$omp THREADPRIVATE(/var/, /data/) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs, ierr) c Data input c Initialize best solution!$omp PARALLEL DO DEFAULT(SHARED) PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x) COPYIN(/data/)LASTPRIVATE(fx) do k = myid, m^5-1, mpi_numprocs c Initialize starting point x(1),..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx improves best fx) then update(best (x, fx)) endif enddo c Reduce the OpenMP best solution!$omp END PARALLEL DO c Reduce the MPI best solution call MPI_FINALIZE(ierr) end

128 The SGI Origin 3000 Architecture (1/2) jen50.ciemat.es 160 processors MIPS R14000 / 600MHz On 40 nodes with 4 processors each Data and instruction cache on-chip Irix Operating System Hypercubic Network

129 The SGI Origin 3000 Architecture (2/2) cc-numa memory Architecture 1 Gflops Peak Speed 8 MB external cache 1 Gb main memory each proc. 1 Tb Hard Disk

130 Computational results (m=4) Time (secs.) Speedup MPI OMP MPI-OMP MPI OMP MPI-OMP As we do not have exclusive mode access to the architecture, the times correspond to the minimum time from 5 different executions The figures corresponding to the mixed mode code (label MPI- OpenMP) correspond to the minimum times obtained for different combinations of MPI processes/openmp threads

131 Time Parallel execution time (m=4) Time (secs.) MPI OpenMP MPI-OpenMP Processors

132 Speedup Speedup (m=4) Speedup 25 MPI OpenMP MPI-OpenMP Processors

133 Results for mixed MPI-OpenMP Time (sec.) Processors 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads

134 Computational results (m=5) Time (secs.) Speedup MPI OMP MPI-OMP MPI OMP MPI-OMP

135 Time Parallel execution time (m=5) Time (secs.) MPI OpenMP MPI-OpenMP Processors

136 Speedup Speedup (m=5) Speedup MPI OpenMP MPI-OpenMP Processors

137 Results for mixed MPI-OpenMP (m=5) Time (secs.) Thread 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads Processors

138 Conclusions (1/3) The computational results obtained from all the parallel versions confirm the robustness of the method For the case of non-expert users and the kind of codes we have been dealing with, we believe that MPI parallel versions are easier and safer In the case of OpenMP, the proper usage of data scope attributes for the variables involved may be a handicap for users with non-parallel programming expertise The higher current portability of the MPI version is another factor to be considered

139 Conclusions (2/3) The mixed MPI/OpenMP parallel version is the most expertise-demanding Nevertheless, as it has been stated by several authors and even for the case of a hybrid architecture, this version does not offer any clear advantage and it has the disadvantage that the combination of processes/threads has to be tuned

140 Conclusions (3/3) We conclude a first step of cooperation in the way of applying HPC techniques to improve performance in astrophysics codes The scientific aim of applying HPC to computationally-intensive codes in astrophysics has been successfully achieved

141 Conclusions and Future Work The relevance of our results do not come directly from the particular application chosen, but from stating that parallel computing techniques are the key to broach large scale real problems in the mentioned scientific field From now on we plan to continue this fruitful collaboration by applying parallel computing techniques to some other astrophysical challenge problems

142 Supercomputing Centers

143 MPI links Message Passing Interface Forum MPI: The Complete Reference Parallel Programming with MPI By Peter Pacheco.

144 OpenMP links

145 Introducción a la computación de altas prestaciones! "#$%!!"

Parallel Computing. Parallel shared memory computing with OpenMP

Parallel Computing. Parallel shared memory computing with OpenMP Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014

More information

Parallel Computing. Shared memory parallel programming with OpenMP

Parallel Computing. Shared memory parallel programming with OpenMP Parallel Computing Shared memory parallel programming with OpenMP Thorsten Grahs, 27.04.2015 Table of contents Introduction Directives Scope of data Synchronization 27.04.2015 Thorsten Grahs Parallel Computing

More information

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism

More information

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based

More information

Basic Concepts in Parallelization

Basic Concepts in Parallelization 1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline

More information

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

More information

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1 Lecture 6: Introduction to MPI programming Lecture 6: Introduction to MPI programming p. 1 MPI (message passing interface) MPI is a library standard for programming distributed memory MPI implementation(s)

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

Parallel Programming with MPI on the Odyssey Cluster

Parallel Programming with MPI on the Odyssey Cluster Parallel Programming with MPI on the Odyssey Cluster Plamen Krastev Office: Oxford 38, Room 204 Email: plamenkrastev@fas.harvard.edu FAS Research Computing Harvard University Objectives: To introduce you

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN 1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

MOSIX: High performance Linux farm

MOSIX: High performance Linux farm MOSIX: High performance Linux farm Paolo Mastroserio [mastroserio@na.infn.it] Francesco Maria Taurino [taurino@na.infn.it] Gennaro Tortone [tortone@na.infn.it] Napoli Index overview on Linux farm farm

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen

More information

Getting OpenMP Up To Speed

Getting OpenMP Up To Speed 1 Getting OpenMP Up To Speed Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline The

More information

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber Introduction to grid technologies, parallel and cloud computing Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber OUTLINES Grid Computing Parallel programming technologies (MPI- Open MP-Cuda )

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Chapter 6, The Operating System Machine Level

Chapter 6, The Operating System Machine Level Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

Simplest Scalable Architecture

Simplest Scalable Architecture Simplest Scalable Architecture NOW Network Of Workstations Many types of Clusters (form HP s Dr. Bruce J. Walker) High Performance Clusters Beowulf; 1000 nodes; parallel programs; MPI Load-leveling Clusters

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Cluster Computing at HRI

Cluster Computing at HRI Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed WinBioinfTools: Bioinformatics Tools for Windows Cluster Done By: Hisham Adel Mohamed Objective Implement and Modify Bioinformatics Tools To run under Windows Cluster Project : Research Project between

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

OpenMP 1. OpenMP. Jalel Chergui Pierre-François Lavallée. Multithreaded Parallelization for Shared-Memory Machines. <Prénom.Nom@idris.

OpenMP 1. OpenMP. Jalel Chergui Pierre-François Lavallée. Multithreaded Parallelization for Shared-Memory Machines. <Prénom.Nom@idris. OpenMP 1 OpenMP Multithreaded Parallelization for Shared-Memory Machines Jalel Chergui Pierre-François Lavallée Reproduction Rights 2 Copyright c 2001-2012 CNRS/IDRIS OpenMP : plan

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis

More information

Introduction to Hybrid Programming

Introduction to Hybrid Programming Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Part I Courses Syllabus

Part I Courses Syllabus Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354 159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1

More information

OpenMP C and C++ Application Program Interface

OpenMP C and C++ Application Program Interface OpenMP C and C++ Application Program Interface Version.0 March 00 Copyright 1-00 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP

More information

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

The CNMS Computer Cluster

The CNMS Computer Cluster The CNMS Computer Cluster This page describes the CNMS Computational Cluster, how to access it, and how to use it. Introduction (2014) The latest block of the CNMS Cluster (2010) Previous blocks of the

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1

More information

High Performance Computing

High Performance Computing High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

Performance and scalability of MPI on PC clusters

Performance and scalability of MPI on PC clusters CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 24; 16:79 17 (DOI: 1.12/cpe.749) Performance Performance and scalability of MPI on PC clusters Glenn R. Luecke,,

More information

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended

More information

MPI Application Development Using the Analysis Tool MARMOT

MPI Application Development Using the Analysis Tool MARMOT MPI Application Development Using the Analysis Tool MARMOT HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de 24.02.2005 1 Höchstleistungsrechenzentrum

More information

Debugging with TotalView

Debugging with TotalView Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich

More information

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance of the JMA NWP models on the PC cluster TSUBAME. Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

An Introduction to Parallel Programming with OpenMP

An Introduction to Parallel Programming with OpenMP An Introduction to Parallel Programming with OpenMP by Alina Kiessling E U N I V E R S I H T T Y O H F G R E D I N B U A Pedagogical Seminar April 2009 ii Contents 1 Parallel Programming with OpenMP 1

More information

End-user Tools for Application Performance Analysis Using Hardware Counters

End-user Tools for Application Performance Analysis Using Hardware Counters 1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in

More information

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell R&D Manager, Scalable System So#ware Department Sandia National Laboratories is a multi-program laboratory managed and

More information

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters User s Manual for the HPCVL DMSM Library Gang Liu and Hartmut L. Schmider High Performance Computing

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures Gabriele Jost *, Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 USA {gjost,hjin}@nas.nasa.gov

More information

Keys to node-level performance analysis and threading in HPC applications

Keys to node-level performance analysis and threading in HPC applications Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

RA MPI Compilers Debuggers Profiling. March 25, 2009

RA MPI Compilers Debuggers Profiling. March 25, 2009 RA MPI Compilers Debuggers Profiling March 25, 2009 Examples and Slides To download examples on RA 1. mkdir class 2. cd class 3. wget http://geco.mines.edu/workshop/class2/examples/examples.tgz 4. tar

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

OpenACC Programming and Best Practices Guide

OpenACC Programming and Best Practices Guide OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What

More information

MPI / ClusterTools Update and Plans

MPI / ClusterTools Update and Plans HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski

More information

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Workshare Process of Thread Programming and MPI Model on Multicore Architecture Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

More information

Software and the Concurrency Revolution

Software and the Concurrency Revolution Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )

More information

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ) Advanced MPI Hybrid programming, profiling and debugging of MPI applications Hristo Iliev RZ Rechen- und Kommunikationszentrum (RZ) Agenda Halos (ghost cells) Hybrid programming Profiling of MPI applications

More information

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

Lightning Introduction to MPI Programming

Lightning Introduction to MPI Programming Lightning Introduction to MPI Programming May, 2015 What is MPI? Message Passing Interface A standard, not a product First published 1994, MPI-2 published 1997 De facto standard for distributed-memory

More information

Effective Computing with SMP Linux

Effective Computing with SMP Linux Effective Computing with SMP Linux Multi-processor systems were once a feature of high-end servers and mainframes, but today, even desktops for personal use have multiple processors. Linux is a popular

More information

Linux tools for debugging and profiling MPI codes

Linux tools for debugging and profiling MPI codes Competence in High Performance Computing Linux tools for debugging and profiling MPI codes Werner Krotz-Vogel, Pallas GmbH MRCCS September 02000 Pallas GmbH Hermülheimer Straße 10 D-50321

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 231 237 doi: 10.14794/ICAI.9.2014.1.231 Parallelization of video compressing

More information

High performance computing systems. Lab 1

High performance computing systems. Lab 1 High performance computing systems Lab 1 Dept. of Computer Architecture Faculty of ETI Gdansk University of Technology Paweł Czarnul For this exercise, study basic MPI functions such as: 1. for MPI management:

More information

Cluster, Grid, Cloud Concepts

Cluster, Grid, Cloud Concepts Cluster, Grid, Cloud Concepts Kalaiselvan.K Contents Section 1: Cluster Section 2: Grid Section 3: Cloud Cluster An Overview Need for a Cluster Cluster categorizations A computer cluster is a group of

More information

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course University of Amsterdam - SURFsara High Performance Computing and Big Data Course Workshop 7: OpenMP and MPI Assignments Clemens Grelck C.Grelck@uva.nl Roy Bakker R.Bakker@uva.nl Adam Belloum A.S.Z.Belloum@uva.nl

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information