Modellierung, Simulation, Optimierung Computer Architektur

Transcription

1 Modellierung, Simulation, Optimierung Computer Architektur Prof. Michael Resch Dr. Martin Bernreuther, Dr. Natalia Currle-Linde, Dr. Martin Hecht, Uwe Küster, Dr. Oliver Mangold, Melanie Mochmann, Christoph Niethammer, Ralf Schneider HLRS, IHR 28. Juni /55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

2 Outline A simple computer Computer and programs Modern computers in detail principle parallel architectures examples of large machines parameters describing hardware Types of Parallelism performance modeling 2/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

3 A simple computer 3/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

4 von Neumann Machine (!) 4 units Arithmetic Logical Unit Control Unit Memory for instructions and data Input and Output Unit the instructions are executed in order by incrementing an instruction pointer branches to make jumps in the instruction sequence by changing the instruction pointer no parallelism memory is linearly addressed typical for computer architecture http: //de.wikipedia.org/wiki/john_von_neumann Von-Neumann-Architektur 4/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

5 design of a simple computer (!) processor does all operations memory is a fast but volatile storage bus connecting processor, memory and IO Input and output devices: disk storing data permanently keyboard display network interface for communication processor keyboard display mouse interface devices disk memory network 5/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

6 Computer and progams 6/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

7 executing a program on a computer (!) all the work a computer can do is specified by executable programs a program consists on a large sequence of instructions combined with data the processor or core loads an instruction from memory the instruction is analysed and executed by the hardware the instruction may contain data, memory addresses, and the specification of an operation what to do with the data the instruction pointer points to the current instruction to be loaded and executed after issueing the instruction it is incremented and points to the next instruction; in this way the machine operates on a sequence of instructions it can be changed by special branch and jump instructions to point to a different location in the instruction stream; in this way loops and alternatives are formulated the same instruction sequence may executed on different data typical instruction classes are load, store integer add, multiply floating point add, multiply branch and jump (also calling procedures) compare stack instructions (push, pop) 7/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

8 purpose of the operating system (!) the operating system is a collection of executable programs for the administration of all devices as disk, keyboard, display, mouse, hardware interfaces,... scheduling and starting executable programs, assigning resources to them, finishing them handling data, files and the file system connecting to the outer world, e.g. other computing nodes, the internet hides the complexity of the machine to the user examples: Microsoft Windows, Unix/Linux, NEC SuperUX, Android,... 8/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

9 from the source code to the start of the executable (!) the source code of a program is written in a high level programming language because machine code is too complicated to read, understand and handle machine code is architecture dependent the source code is translated to the executable program by the following steps compilation of files by the compiler generates objects in machine code; relations between objects still lacking the linker binds the objects the executable program which is written to disk to run the executable the loader load the executable from disk to memory and starts the executable by setting the instruction pointer to the first instruction high level languages are for example C, C++ Fortran C#, Visual Basic, Delphi Java; works differently: source code will be translated to an intermediate byte code which is independent on the machine; the intermediate code will be executed on a run time system on the computer 9/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

10 Pseudo Assembler Assembler is near to the machine code but better readable; only for special purposes, e.g. drivers here we are using some kind of pseudo assembler Fortran: Java: do i=1,imax a(i)=b(i)+c(i) enddo for (i = 1; i <= imax; i++) a[i] = b[i] + c[i]; 1. i=0 2. start: 3. i=i+1 4. if i > imax goto end 5. load b(i) to register 0 6. load c(i) to register 1 7. add register 0 to register 1 and write to register 2 8. store register 2 to memory at location a(i) 9. goto start 10. end: 10/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

11 in X86 Assembler 1. Block 7: 2. movsdq (%rsi,%rcx,8), %xmm0 ; c(i ) -> xmm0[63..0] 3. movsdq 0x10(%rsi,%rcx,8), %xmm1 ; c(i+2) -> xmm1[63..0] 4. movsdq 0x20(%rsi,%rcx,8), %xmm2 ; c(i+4) -> xmm2[63..0] 5. movsdq 0x30(%rsi,%rcx,8), %xmm3 ; c(i+6) -> xmm3[63..0] 6. movhpdq 0x8(%rsi,%rcx,8), %xmm0 ; c(i+1) -> xmm0[ ] 7. movhpdq 0x18(%rsi,%rcx,8), %xmm1 ; c(i+3) -> xmm1[ ] 8. movhpdq 0x28(%rsi,%rcx,8), %xmm2 ; c(i+5) -> xmm2[ ] 9. movhpdq 0x38(%rsi,%rcx,8), %xmm3 ; c(i+7) -> xmm3[ ] 10. addpdx (%rdx,%rcx,8), %xmm0 ; b(i ),b(i+1) + xmm0 ->xmm0 11. addpdx 0x10(%rdx,%rcx,8), %xmm1 ; b(i+2),b(i+3) + xmm1 ->xmm1 12. addpdx 0x20(%rdx,%rcx,8), %xmm2 ; b(i+4),b(i+5) + xmm2 ->xmm2 13. addpdx 0x30(%rdx,%rcx,8), %xmm3 ; b(i+6),b(i+7) + xmm3 ->xmm3 14. movapsx %xmm0, (%rdi,%rcx,8) ; xmm0->a(i ),a(i+1) 15. movapsx %xmm1, 0x10(%rdi,%rcx,8) ; xmm1->a(i+2),a(i+3) 16. movapsx %xmm2, 0x20(%rdi,%rcx,8) ; xmm2->a(i+4),a(i+5) 17. movapsx %xmm3, 0x30(%rdi,%rcx,8) ; xmm3->a(i+6),a(i+7) 18. add $0x8, %rcx ; i=i cmp %rax, %rcx ; imax > i? 20. jb 0x303f <Block 7> ; if true -> goto <Block 7> 11/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

12 some instruction of the X86 Assembler 1. Block 7: begin of a block 2. movsdq (%rsi,%rcx,8), %xmm0 load from address rsi+ 8*rcx in register xmm0 [63:0] 3. movhpdq 0x8(%rsi,%rcx,8), %xmm0 load from address 0x8+rsi+ 8*rcx in xmm0 [127:64] 4. addpdx (%rdx,%rcx,8), %xmm0 add from rdx+8*rcx to register xmm0 writing to xmm0 5. addpdx 0x10(%rdx,%rcx,8), %xmm1 add from 0x10+rdx+8*rcx to register xmm1 6. movapsx %xmm0, (%rdi,%rcx,8) store xmm0 to address rdi+8*rcx 7. movapsx %xmm1, 0x10(%rdi,%rcx,8) store xmm1 to address 0x10+rdi+8*rcx 8. add $0x8, %rcx add 0x8 to rcx with target rcx 9. cmp %rax, %rcx compare rax and rcx 10. jb 0x303f <Block 7> jump to 0x303f if rax > rcx 12/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

13 Test program in Fortran; part 1 1. module test_module 2. contains 3. subroutine func(a,b,c,imax) 4. integer :: imax, i 5. real(kind=8), dimension(imax) :: a,b,c 6. do i=1,imax 7. a(i)=b(i)+c(i) 8. enddo 9. end subroutine func 10. end module test_module 13/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

14 Test program in Fortran; part 2 1. program test_prog 2. use test_module 3. implicit none 4. integer :: i,j,k= integer,parameter :: imax= real(kind=8),allocatable, dimension(:) :: a,b,c! are dynamic arrays 7. allocate(a(imax))! reserving space for array a 8. allocate(b(imax)) 9. allocate(c(imax)) 10. do i=1,imax! Definition of array elements 11. b(i)=i+j 12. c(i)=i*5 13. enddo 14. call func(a,b,c,imax)! call the procedure 15. write(*,*)a(1) 16. end program test_prog 14/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

15 Test program in Java; part 1 1. package test.module; 2. public class TestModule 3. { public static void func(double a[],double b[],double c[],int imax) 6. { 7. for (int i = 0; i < imax; i++) 8. a[i] = b[i] + c[i]; 9. } 10. } 15/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

16 Test program in Java; part 2 1. package test.prog; 2. import test.module.testmodule; 3. public class TestProg 4. { 5. public static void main(string[] args) 6. { 7. int i, j = 0, k = 200; 8. final int imax = ; 9. double a[], b[], c[]; 10. a = new double[imax]; // reserving space for array a 11. b = new double[imax]; 12. c = new double[imax]; 13. for (i = 0; i < imax; i++) 14. { 15. b[i] = i + j; 16. c[i] = i * 5; 17. } 18. TestModule.func(a, b, c, imax); // call the procedure 19. System.out.println(a[0]); 20. } 21. } 16/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

17 Modern computers in detail 17/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

18 Core with Caches and Memory (!) data are stored in a connected memory hierarchy L1 cache ( Level 1 cache) L2 cache L3 cache ( not all architectures ) memory the cache is an intermediate memory with the same addressing scheme replicating the data in memory has a smaller access latency and provides higher bandwidth a load/store first looks in the (low level) cache; if data is not there, the next level will tested all cached data are organized in cache lines (of 64 B for X86, X86 64) load/store moves cachelines, not single data old data will be overwritten, but saved to higher level caches or to memory if changed writing to a cacheline assumes that the cacheline is present in cache lower level caches are smaller than higher level caches nearby caches have higher bandwidth and smaller access latency load/store instructions can reach all memory locations C L1 L2 L3 M 18/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

19 Processor with Cores and Memory (!) here: Intel Sandy Bridge (2012) multi core processor with 8 cores L3 cache is shared but partitioned communication ring with high bandwidth 4 rings for data, requests, acknowledgments, snooping memory controller in ring QPI (Quick Path Interconnect) interface in ring snooping enables all cores to hear which shared cachelines are changed by other cores ( cache coherence protocol) QPI C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C C L1 L2 L3 L3 L2 L1 C MC 4 rings: data ackn requ snoop M 19/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

20 Intel Sandy Bridge architectural data 1 8 cores AVX vector unit per core with 4 mult-plus operations/cp core with L1, L2 cache shared L3-cache 4 memory channels with 1600 MHz DDR3 2 QPI links up two 8 GT/s 51.2 GB/s nominal bandwidth >2.0 GHz frequency daughter card on PCIe 3.0 up to 16 GB/s (was ist PCI) PCIe ( Peripheral Component Interconnect Express) is an interface standard for connecting external devices as graphics cards, network interface controllers, sound cards,... 20/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

21 Intel Sandy Bridge architectural data /55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

22 AMD Interlagos Processor 16 cores in 2 dies with 4 core pairs AVX vector unit per core pair(!) with 4 mult-plus operations/cp core with L1 cache 2 MB L2-cache per core pair 8 MB shared L3-cache per die, in total 16 MB 2 memory channels with 1600 MHz DDR3 per die one Hyper-Transport link between dies one Hyper-Transport link per die going outwards 51.2 GB/s nominal bandwidth for 4 channels >2.0 GHz frequency picture from AMD 22/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

23 Intel Knights Ferry 30 cores 4 vector units per core Pentium II based 100 GB/s nominal bandwidth 1.05 GHz frequency daughter card on PCIe 2.0 prototype; will be followed by Knights Corner 23/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

24 Nvidia Graphics Card nicht mehr aktuell cores 1 8Byte FP-unit per core multiprocessors GB/s nominal bandwidth GHz frequency DP peak performance daughter card on PCIe 2.0 Oliver Mangold 24/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

25 principle parallel architectures 25/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

26 Shared memory system (!) all processors/cores are using the same memory the address space is identical for all cores memory load/store instructions can reach all locations each thread may share his memory part with others shared memory may consist on different banks on different channels can be local to a processsor but accessible by others Intel: Quick Path Interconnect (QPI) AMD: Hyper-Transport (HT) bus/interconnect is a bottleneck memory bandwidth is insufficient M M M M P IO QPI / HT P M M M M 26/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

27 Distributed memory system (!) nodes are using different memories the address space is different for all cores memory load/store instructions cannot reach other locations memory bus/interconnect is independent memory bandwidth scales with the number of nodes network access via a Network Interface Controller (NIC) network links connected via Routers (R) M P NIC R M P NIC R M P NIC R M P NIC R 27/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

28 Hierarchical Parallel Computer System (!) the different levels of a parallel system: cores, processors, nodes, supernodes, system 28/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

29 Hierarchical Parallel Computer System (!) vector floating point units (FPU) ( 2-8 per core ) cores (C) ( 4-64 per CPU, 500 per GPU) processors (CPU) ( 2-8 per node ) nodes (N) ( 1-8 per supernode ) supernodes, blades, cages,... (2-8 per rack ) rack (R) ( per system) distributed system by multiplying small numbers we get a very large number 29/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

30 addressing of the total memory (!) shared memory Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA) distributed memory locally addressable nodes may have shared memory Partitioned Global Address Space (PGAS) globally addressable, every processor can address the memory in all other processors or nodes Remote Direct Memory Access (RDMA) done by the Network Interface Controller (NIC) enables simpler handling of global data structures 30/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

31 examples of large machines 31/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

32 HLRS machine Cray XE6 Cray XE nodes with two AMD processors Interlagos Gemini interconnect working since November C x 2 CPU x 96 N x 38 R = Cores (HLRS 2011) 32 GB x 96 N x 38 R = 116 TB memory 1.8 PB disk Cray XE6 at HLRS 32/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

33 K-computer at Riken Fujitsu MW finished in C x 4 N x 24 SN x 864 R = CPUs = Cores 10.6 PetaFlop/s peak performance K-computer at Riken near Kobe/Japan Rack of K-computer 33/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

34 ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: Blue Waters System I > 235 Cray XE6 cabinets I > 30 Cray XK6 cabinets I > compute nodes I > 1 TB/s usable storage bandwidth I > 1.5 PB aggregate system memory I 4 GB Memory per core I > 9000 Gemini network cables I 3D torus interconnect topology I > disks I > memory DIMMS I > 25 PB usable storage I > 11.5 PF peak performance I > AMD processors I > cores I > 3000 GPUs I 100 GB/s bandwidth to near-line storage 34/55 :: Modellierung, Simulation, Optimierung, Computer Architektur Blue Waters machine under construction pcf/inside2/index.php BlueWaters/system.html :: 28. Juni 2012 ::

35 parameters describing hardware 35/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

36 Moore s Law 1 (!) 36/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

37 Moore s Law 2 (!) Moore s Law from 1965 states that number of components on an integrated circuit doubles every 18 (24) months this is the reason that computers are inexpensive despite their complexity for some time it was understood as doubling the frequency every 18 month 37/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

38 stagnating frequency (!) the power consumption depends heavily on the frequency of the processor power frequency 3 the number of cores can be increased, if the frequency is reduced List of Intel processors: pressroom/kits/ quickreffam.htm data from Gerhard Wellein / Erlangen 38/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

39 Floating Point Performance Parameters (!) clock rate of the core/processor inverse of frequency for scheduling the instructions; typical 1 4 GHz number of floating point operations per clock tic instruction parallelism in the core/processor [FLOPs/FP unit]; typical 2 8 peak performance of processor maximum number of Floating Point Operations per second [MFLOPs, GFLOPs, TFLOPs]; typical 4 GFLOPs 100 GFLOPs total peak performance of machine number of nodes defines the size of the cluster; typical total performance=frequency x # FP-units x #cores x #processors x #nodes 39/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

40 Communication Parameters: Bandwidth (!) bandwidth: amount of data transferred per time through a transportation path typical parameter for all transportation systems for computers L1 cache bandwidth ( 4x8 B/CP ) L2 cache bandwidth ( 4x8 B/CP ) L3 cache bandwidth ( 4x8 B/CP ) memory bandwidth o (1,2,3,4) channels x 8 B x Bus frequency (at 1066, 1333, 1600 MHz) up to 51,2 GB/s interconnect bandwidth o InfiniBand (SDR 1 GB/s, DDR 2 GB/s, QDR 4 GB/s, EDR 12.5 GB/s) o Cray Gemini ( 5 GB/s x 2 directional ) o Ethernet ( 1Gbit, 10 GBit, 40 GBit) IO bandwidth o SATA disk ( 60 MB/s) o Solid State Device SSD ( MB/s) o RAID Storage Subsystem ( GB/s) o parallel Lustre Filesystem ( > 100 GB/s) 40/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

41 Communication Parameters: Latency (!) Latency: time to get the first data after sending the request connected to the length of the pipeline all system parts show a typical latency L1 cache (2-4 CP) L2 cache ( CP) L3 cache (25-70 CP) memory ( CP) interconnect ( 1-50 µ s) disc ( 1-5 ms) the number of open transfers is limited by number of open requests the achievable bandwidth is limited by the bandwidth number of open requests transferred buffer size/latency the latency is the start up of the transfer pipeline 41/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

42 Effective Memory Bandwidth (!) The nominal memory bandwidth of a modern PC-system can be calculated by BW nom = path width number of channels bus frequency path width = 8B, number of channels = 1, 2, 3, 4, bus frequency = 1066, 1333, 1600, 1866, 2133MHz The effective memory bandwidth is smaller than the nominal bandwidth BW eff = size of data transferred latency + transfer time The transfer time is related to the nominal bandwidth by transfer time = size of data transferred BW nom The effective bandwidth is influenced by the memory or cache latency size of data transferred BW eff = size of data transferred latency + BW nom 42/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

43 Example 1 Intels Sandy Bridge processor has 4 channels running at a bus frequency of 1600MHz. The channels are 8B wide. BW nom = 8B MHz = 51.2GB/s The memory latency is 70nsec. The resulting effective bandwidth for a single cache line of 64B is 64B 64B BW eff = 70nsec + 64B = = 0, 898GB/s(!) 70nsec nsec 51.2GB/s We are far from the nominal bandwidth! The estimations are not completely correct: the cacheline uses a single channel; the cacheline is transported with bus clock periods. 43/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

44 Example 2 The resulting effective bandwidth for 100 cache lines of 64B is BW eff = B 70nsec B 51.2GB/s = B = 32, 82GB/s(!) 70nsec + 125nsec How to overcome this problem? 44/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

45 Prefetching (!) The data access has to be started before consumption Prefetching initiates data access independent on using the data difficult to handle may be contraproductive for data already in caches 45/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

46 Types of Parallelism(!) single instruction single data (SISD) simple or traditional processor single instruction multiple data (SIMD) vector model in a multi processor of a graphic card multiple instructions multiple data (MIMD) multicore processors as shared memory processors different multiprocessors of graphic cards multiple programs multiple data (MPMD) multicore processors for different processes distributed memory processors multiple instructions single data (MISD)?? SISD, SIMD, MIMD, MISD: Flynns notation pipeline parallelism Bild! 46/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

47 pipeline parallelism (!) phase shifted parallelism very general principle in production, transport, flow of goods assembly line in production systems important model in bureaucracy ( first come first serve) different parts are in operation at different stages at the same time dominant in modern computer architectures in computer: accessing caches or memory within a loop accessing data from a remote processor input and output system description model of the pipeline the pipeline is filled after startup steps. This is the length of the pipeline and at the same the degree of its parallelism. Could be understood also as latency. after the pipeline is filled it produces one result per step. 47/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

48 performance modeling 48/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

49 performance modeling 49/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

50 performance behaviour of the pipeline (!) The pipeline mechanism is characterized by the fixed startup or latency costs or time for filling the pipeline time per unit T operations per unit op performance perf The resulting performance is perf (n) = n op startup + n T peak performance is the limit for increasing n op perf = lim perf (n) = n T getting half of the peak performance for n 12 = startup T startup op T perf (n 12 ) = startup + startup T = 1 op 2 T = 1 2 perf T 50/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

51 Amdahls Law: limits of parallel speedup (!) Assume a parallelizable program running a fixed size test case having a parallel part taking a total time t par serial (non parallelizable) part taking a total time t ser total time is t tot = t par + t ser We will run the program with p parallel processes The speedup is t tot speedup(p) = t par /p + t ser t tot t ser and the efficiency efficiency(p) = speedup(p) p 51/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

52 Amdahls Law: limits of parallel speedup (!) These assumptions are correct of any type of parallel work! the pictures show speedup in dependence on t ser /(t par + t ser ) in percent. The behaviour is not encouraging! Why large scale parallel computing can be successful? the speedup behaviour of parallel program for a fixed sized test case is named strong scaling 52/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

53 Amdahls Law with communication even worse: assume that communication times increase with the number of processes parallel part taking a total time t par assume communication time com p is proportional to the number of processes; other dependencies to be discussed serial part taking a total time t ser total time is t tot = t par + t ser We will run the program with p parallel processes The speedup is t tot speedup(p) = t par /p + com p + t ser t tot com p + t ser 53/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

54 Gustafson: increase the work 1 (!) Amdahls assumes a fixed amount of work. But what can we do in a fixed amount of time? parallel part taking a total time t par = t procp proportional to the number of processors serial part taking a total time t ser The speedup is tprocp + tser t proc t ser speedup(p) = = p + t procp/p + t ser t proc + t ser t proc + t ser and the efficiency efficiency(p) = t proc t ser 1 + t proc + t ser t proc + t ser p The efficiency decreases to a lower limit. 54/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

55 Gustafson: increase the work 2 (!) There are a lot of simulation problems of this type! But we are neglecting any necessary communication. the speedup behaviour of parallel program for a test case which is increasing with the number of active processors/cores is named weak scaling 55/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::

56 Danke für die Aufmerksamkeit! Kuester[at]hlrs.de 55/55 :: Modellierung, Simulation, Optimierung, Computer Architektur :: 28. Juni 2012 ::