Hiroaki Kobayashi 8/3/2011

Transcription

1 Hiroaki Kobayashi 8/3/2011

2 l l l l l l l l l Introduction The Difficulty of Creating Parallel Processing Programs Shared Memory Multiprocessors Clusters and Other Message-Passing Multiprocessors Cache Coherence on Shared-Memory Multiprocessors n Snooping protocol for Cache Coherence on Shared Memory Multiprocessors n Directory-based Protocol for Cache Coherence on Message- Passing Multiprocessors SISD, MIMD, SIMD, SPMD, and Vector Introduction to Multiprocessor Network Topologies Multiprocessor Benchmarks Summary Hiroaki Kobayashi 2

3 l Goal of multiprocessor is to create powerful computers simply by connecting many existing smaller ones. n Replace large inefficient processors with many smaller, efficient processors n Scalability, availability, power efficiency l Job-level (or process-level) parallelism n high throughput for independent jobs l Parallel processing program n Single program runs on multiple processors n Short turnaround time Hiroaki Kobayashi 3

4 Introduction (2/2) l A cluster is composed of microprocessors housed in many independent servers or PCs n Now very popular as search engine, Web servers, servers and database server in data centers l Cluster Center Multicore/Manycore microprocessors n Chips with multiple processors (cores) n Difficulty in seeking higher clock rates and improved CPI on single processor due to the power problem. n Number of cores is expected to double every two years INTEL Nehalem-EP Processor with 4-core Programmers who care about performance must become parallel programmers! Hiroaki Kobayashi 4

5 Software Sequential Concurrent Hardware Serial Parallel Matrix Multiply written in C, and compiled for and running on an Intel Pentium 4 Matrix Multiply written in C, and compiled for and running on an Intel Xeon E5345 (Clovertown) Windwos Visata Operating System running on an Intel Pentium 4 Windwos Visata Operating System running on an Intel Xeon E5345 (Clovertown) Challenge: making effective use of parallel hardware for sequential or concurrent software Hiroaki Kobayashi 5

6 l Parallel software is the problem l Need to get significant performance improvement n Otherwise, just use a faster uniprocessor, since it s easier! n Uniprocessor design technique such as superscalar and outof-order execution take advantage of ILP, normally without the involvement of the programmer l Difficulties in design of parallel programs n Partitioning (as equally as possible) n Coordination/scheduling for load balancing n Synchronization and communications overheads The difficulties get worse as the number of processors increase Hiroaki Kobayashi 6

7 Amdahl s Law says n Sequential part can limit speedup n Example: 100 processors, 90 speedup? n T new = T parallelizable /100 + T sequential n Speedup = (1 F parallelizable 1 ) + F n Solving: F parallelizable = parallelizable /100 n Need sequential part to be 0.1% of original time = 90 Hiroaki Kobayashi 7

8 l Workload: sum of 10 scalars, and matrix sum n Speed up from 10 to 100 processors l Single processor: Time = ( ) t add l 10 processors n Time = 10 t add + 100/10 t add = 20 t add n Speedup = 110/20 = 5.5 (55% of potential) l 100 processors n Time = 10 t add + 100/100 t add = 11 t add n Speedup = 110/11 = 10 (10% of potential) l Assumes load can be balanced across processors Hiroaki Kobayashi 8

9 l What if matrix size is ? l Single processor: Time = ( ) t add l 10 processors n Time = 10 t add /10 t add = 1010 t add n Speedup = 10010/1010 = 9.9 (99% of potential) l 100 processors n Time = 10 t add /100 t add = 110 t add n Speedup = 10010/110 = 91 (91% of potential) l Assuming load balanced Hiroaki Kobayashi 9

10 Getting good speed-up on a multiprocessor while keeping the problem size fixed is harder than getting good speed-up by increasing the size of the problem l Strong scaling: problem size fixed n As in example l Weak scaling: problem size proportional to number of processors n 10 processors, matrix u Time = 20 t add n 100 processors, matrix u Time = 10 t add /100 t add = 20 t add n Constant performance in this example Hiroaki Kobayashi 10

11 l In the previous example, each processor has 1% of the work to achieve the speed-up of 91 on the larger problem with 100 processors. l If one processor s load is higher than all the rest, show the impact on speed-up. Calculate at 2% and 5% u If one processor has 2% of 10,000 and other 99 will share the remaining 9800 n Execution time=max(9800t/99, 200t/1)+10t=210t n Speed-up drops to 10,010t/210t=48 u If one processor has 5% of 10,000 and other 99 will share the remaining 9500 n Execution time=max(9500t/99, 500t/1)+10t=510t n Speed-up drops to 10,010t/510t=20 Hiroaki Kobayashi 11

12 l SMP: shared memory multiprocessor n Hardware provides single physical address space for all processors n Synchronize shared variables using locks n Memory access time u UMA (uniform) vs. NUMA (nonuniform) Uniform Memory Access Multiprocessor Hiroaki Kobayashi 12

13 Single address space l Lower latency to nearby memory l Higher scalability l More efforts for efficient parallel programming Hiroaki Kobayashi 13

14 l Synchronization: The process of coordinating the behavior of two or more processes, which may be running on different processors. n As processors operating in parallel will normally share data, they also need to coordinate when operating on shared data. l Lock: a synchronization device that allows access to data to only one process at a time. n Other processors interested in shared data must wait until the original processor unlocks the variable. Hiroaki Kobayashi 14

15 l Sum 100,000 numbers on 100 processor UMA n Each processor has ID: 0 Pn 99 n Partition 1000 numbers per processor n Initial summation on each processor sum[pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[pn] = sum[pn] + A[i]; l Now need to add these partial sums n Reduction: divide and conquer n Half the processors add pairs, then quarter, n Need to synchronize between reduction steps Hiroaki Kobayashi 15

16 half = 100; repeat synch(); if (half%2!= 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[pn] = sum[pn] + sum[pn+half]; until (half == 1); Hiroaki Kobayashi 16

17 n n n n Each processor has private physical address space Hardware sends/receives messages between processors Multicomputers, clusters, Network of Workstations (NOW) Suited for job-level parallelism and applications with little communication n Web search, mail servers, file servers Hiroaki Kobayashi 17

18 l Network of independent computers n Each has private memory and OS n Connected using I/O system u E.g., Ethernet/switch, Internet l Suitable for applications with independent tasks n Web servers, databases, simulations, l High availability, scalable, affordable l Problems n Administration cost (prefer virtual machines) n Low interconnect bandwidth u c.f. processor/memory bandwidth on an SMP Hiroaki Kobayashi 18

19 l Sum 100,000 on 100 processors l First distribute 1000 numbers to each n The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; l Reduction n Half the processors send, other half receive and add n The quarter send, quarter receive and add, Hiroaki Kobayashi 19

20 l Given send() and receive() operations limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */ n Send/receive also provide synchronization n Assumes send/receive take similar time to addition Hiroaki Kobayashi 20

21 l Separate computers interconnected by long-haul networks n E.g., Internet connections n Work units farmed out, results sent back l Can make use of idle time on PCs n E.g., SETI@home n Aslo named volunteer computing n 5+ million computers, 200+ countries, 257Teraflop/s for searching extraterrestrial intelligence in 2006 n Hiroaki Kobayashi 21

22 l Suppose two CPU cores share a physical address space n Write-through caches Time step Event CPU A s cache CPU B s cache Memory CPU A reads X CPU B reads X CPU A writes 1 to X inconsistency Hiroaki Kobayashi 22

23 l Operations performed by caches in multiprocessors to ensure coherence n Migration of data to local caches u Reduces bandwidth for shared memory n Replication of read-shared data u Reduces contention for access l Snooping protocols n Each cache monitors bus reads/writes l Directory-based protocols n Caches and memory record sharing status of blocks in a directory Hiroaki Kobayashi 23

24 Broadcast bus Snoop-based system Directory-based system Hiroaki Kobayashi 24

25 l Cache gets exclusive access to a block when it is to be written n Broadcasts an invalidate message on the bus n Subsequent read in another cache misses u Owning cache supplies updated value n If two processors attempt to write the same data simultaneously, one of them wins the race, causing the other processor s copy to be invalidated. u Enforce write serialization. CPU activity Bus activity CPU A s cache CPU B s cache Memory CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X CPU A writes 1 to X Invalidate for X Hiroaki Kobayashi 25 CPU B read X Cache miss for X 1 1 1

26 l Coherence Misses: a new class of cache misses due to sharing data on a multiprocessor l True Sharing n A write to the shared data on the cache by one processor causes an invalidate to the entire block that includes the shared data cached on other processors, and a miss occurs when another processor accesses a modified word in that block. l False Sharing n A write to the shared data on the cache by one processor causes an invalidate to the entire block that includes the shared data cached on other processors, and a miss occurs even when another processor accesses a different word but in the same block. Hiroaki Kobayashi 26

27 Hiroaki Kobayashi 27

28 n Flynn s Classification (1966) Data Streams Single Multiple Instruction Streams Single SISD: Intel Pentium 4 SIMD: SSE instructions of x86 Multiple MISD: No examples today MIMD: Intel Xeon e5345 n SPMD: Single Program Multiple Data n A parallel program on a MIMD computer n Conditional code for different processors Hiroaki Kobayashi 28

29 l Operate elementwise on vectors of data n E.g., MMX and SSE instructions in x86 u Multiple data elements in 128-bit wide registers l All processors execute the same instruction at the same time n Each with different data address, etc. l Simplifies synchronization l Reduced instruction control hardware l Works best for highly data-parallel applications l Weakest in case or switch statement Hiroaki Kobayashi 29

30 l Highly pipelined function units l Stream data from/to vector registers to units n Data collected from memory into registers n Results stored from registers to memory l Example: Vector extension to MIPS n element registers (64-bit elements) n Vector instructions u lv, sv: load/store vector u addv.d: add vectors of double u addvs.d: add scalar to each element of vector of double l Significantly reduces instruction-fetch bandwidth Hiroaki Kobayashi 30

31 l l l Vector registers n Each vector register is a fixed-length bank holding a single vector and must have at least two read ports and one write port. Vector functional units n Each unit is fully pipelined and can start a new operation on every clock cycle. A control unit is needed to detect hazards, both from conflicts for the functional unit and from conflicts for register accesses. Vector load-store unit n This is a vector memory unit that loads or stores a vector to or from memory. Vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory with a bandwidth of 1 word per clock cycle, after an initial latency. From VR1 &VR2 Pipelined FP ADD Unit To VR3 Hiroaki Kobayashi 31

32 n Conventional MIPS code l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to load loop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done n Vector MIPS code l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result Hiroaki Kobayashi 32

33 l Vector architectures and compilers n Simplify data-parallel programming n Explicit statement of absence of loop-carried dependences u Reduced checking in hardware n Regular access patterns benefit from interleaved and burst memory n Avoid control hazards by avoiding loops l More general than ad-hoc media extensions (such as MMX, SSE) n Better match with compiler technology Hiroaki Kobayashi 33

34 l Early video cards (VGA controller) n Frame buffer memory with address generation for video output l 3D graphics processing n Originally high-end computers (e.g., SGI) n Moore s Law lower cost, higher density n 3D graphics cards for PCs and game consoles l Graphics Processing Units n Processors oriented to 3D graphics tasks n Vertex/pixel processing, shading, texture mapping, rasterization Hiroaki Kobayashi 34

35 16GB/s Hiroaki Kobayashi 35

37 l Processing is highly data-parallel n GPUs are highly multithreaded n Use thread switching to hide memory latency u Less reliance on multi-level caches n Graphics memory is wide and high-bandwidth l Trend toward general purpose GPUs n Heterogeneous CPU/GPU systems n CPU for sequential code, GPU for parallel code l Programming languages/apis n DirectX, OpenGL n C for Graphics (Cg), High Level Shader Language (HLSL) n Compute Unified Device Architecture (CUDA) Hiroaki Kobayashi 37

39 345.6Gflop/s = 16 MPs x 8SPs x 2Flop x 1.35x10 9 Streaming multiprocesso r 8 Streaming processors Hiroaki Kobayashi 39

40 n Streaming Processors n Single-precision FP and integer units n Each SP is fine-grained multithreaded n Warp: group of 32 threads n Executed in parallel, SIMD style n 8 SPs 4 clock cycles n Hardware contexts for 24 warps n Registers, PCs, Hiroaki Kobayashi 40

41 n Don t fit nicely into SIMD/MIMD model n Conditional execution in a thread allows an illusion of MIMD n But with performance degredation n Need to write general purpose code with care Instruction-Level Parallelism Data-Level Parallelism Static: Discovered at Compile Time VLIW SIMD or Vector Dynamic: Discovered at Runtime Superscalar Tesla Multiprocessor Hiroaki Kobayashi 41

42 n Network topologies n Arrangements of processors, switches, and links Bus Ring N-cube (N = 3) 2D Mesh Fully connected Hiroaki Kobayashi 42

44 l Performance n Latency per message (unloaded network) n Throughput u Link bandwidth Peak data transfer rate per link u Total network bandwidth Collective data transfer rate of all links in the network u Bisection bandwidth The bandwidth between two equal parts of a multiprocessor. This measure is for a worst case split of the multiprocessor n Congestion delays (depending on traffic) l Cost l Power l Routability in silicon Hiroaki Kobayashi 44

45 l l l l l Linpack: matrix linear algebra n DAXPY routine n Weak scaling n SPECrate: parallel run of SPEC CPU programs n Job-level parallelism (no communication between them) n Throughput metric n Weak scaling SPLASH: Stanford Parallel Applications for Shared Memory n Mix of kernels and applications, strong scaling NAS (NASA Advanced Supercomputing) suite n computational fluid dynamics kernels n Weak scaling PARSEC (Princeton Application Repository for Shared Memory Computers) suite n Multithreaded applications using Pthreads and OpenMP n Weak scaling Hiroaki Kobayashi 45

47 l Goal: higher performance by using multiple processors l Difficulties n Developing parallel software n Devising appropriate architectures l Many reasons for optimism n Changing software and application environment n Chip-level multiprocessors with lower latency, higher bandwidth interconnect l An ongoing challenge for computer architects! Hiroaki Kobayashi 47

48 l 8/10 n GPU l 8/15 web Hiroaki Kobayashi 48