CS 575 Parallel Processing

Transcription

1 CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

2 Course Topics Introduction, Background Orders of magnitude, Recurrences Models of Parallel Computing, communication Performance, Speedup, Efficiency Parallel Algorithms Dense Linear Algebra Sorting Graphs Search Fast Fourier Transform CS575 lecture 1 2

3 Course Organization Course reorganization Unite 575, 575dl Modernize: more // algorithms, GPUs We have separate course streams in networking and distributed systems Check the web page regularly Course organization is described on the web let's go look... Project changes regularly to stay fresh second half of the course GPUs/CUDA CS575 lecture 1 3

4 Cost effective Parallel Computing Off the shelf, commodity processors are very fast Memory is very cheap Building a processor that is a small factor faster costs an order of magnitude more Clusters: Cheapest way to get more performance: multiprocessor NoW: Networks of workstations Datacenters employ O(100K) simple processors with cheap interconnects Workstation can be an SMP Shared memory, Bus or Crossbar (eg. Cray) CS575 lecture 1 4

5 Wile E. Coyote s Parallel Computer Get a lot of the fastest processors Get a lot of memory per processor Get the fastest network Hook it all together And then what??? CS575 lecture 1 5

6 Now you gotta program it! Parallel programming introduces: CS575 lecture 1 6

7 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling CS575 lecture 1 7

8 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution CS575 lecture 1 8

9 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization CS575 lecture 1 9

10 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization Load balancing CS575 lecture 1 10

11 Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization Load balancing Latency issues hiding tolerance CS575 lecture 1 11

12 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory CS575 lecture 1 12

13 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? CS575 lecture 1 13

14 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both CS575 lecture 1 14

15 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower WHY? HOW? CS575 lecture 1 15

16 Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower in terms of number of cycles it takes to access Memory hierarchy gets more complex CS575 lecture 1 16

17 Sequential Algorithms Efficient Sequential Algorithms Minimize time, space Maximize state (avoiding re-computation) Efficiency is portable Efficient program on Pentium ~ Efficient program on Opteron CS575 lecture 1 17

18 Parallel Algorithms Efficient Parallel Algorithms Use efficient sequential algorithms Maximize parallelism re-computation is sometimes better than communication Minimize overhead synchronization, remote accesses Parallel efficiency is Architecture Dependent CS575 lecture 1 18

19 Speedup Ideal: n processors à n fold speed up Ideal not always possible. WHY? Tasks are data dependent Not all processors are always busy Remote data needs communication Memory wall PLUS Communication wall Linear speedup: α n speedup (α <= 1) CS575 lecture 1 19

20 Super linear speedup Super linear speedup: α > 1 Discuss... is it possible? CS575 lecture 1 20

21 Super linear speedup Super linear speedup: α > 1 Nonsense! Because we can execute the faster parallel program sequentially CS575 lecture 1 21

22 Super linear speedup Super linear speedup: α > 1 No nonsense!! Because parallel computers do not just have more processors, they have more local memory / caches CS575 lecture 1 22

23 Parallel Programming Paradigms Implicit parallel programming: Super Compilers Compiler extracts parallelism from sequential code Distributes data, creates and schedules tasks Complication: side effects: -the sequential order of reads and writes to a memory location determines the program outcome -a parallelizing compiler must obey the sequential order of side effecting statements and still create //ism - pointers, aliases, indirect array reference make analyzing which statements access which locations hard or impossible - 40 years of compiler research for general purpose parallel computing has not brought much result. CS575 lecture 1 23

24 Paradigms cont Implicit parallel programming cont Simple, clean case: Functional Programming (FP) Functions: no side effects, order of execution less constrained F ( P(x,y), Q(y,z) ) P and Q can be executed in parallel Simple single assigment memory model: no pointers, no write after read or write after write hazards (dataflow semantics) FP was long doomed too high level too inefficient, because the simple memory model causes lots of copies FP is coming back: MapReduce approach in data centers (Google) is a data parallel functional paradigm CS575 lecture 1 24

25 Explicit parallel programming Explicit parallel programming Multithreading: OpenMP Message Passing: MPI Data parallel programming (important niche): CUDA Explicit Parallelism complicates programming creation, allocation, scheduling of processes data partitioning Synchronization ( semaphores, locks, messages ) CS575 lecture 1 25

26 Example 1: Weather Prediction Area, segments 3000*3000*11 cubic miles.1*.1*.1 cubic mile: ~ segments Two day prediction half hour time steps: ~ 100 time steps Computation per segment Temp, Pressure, Humidity, Wind speed, Wind direction for each time step in each segment Assume ~ 100 FLOPs per time step per segment CS575 lecture 1 26

27 Performance: Weather Prediction Computational requirement: FLOPs assume one FLOP per clock cycle 1 core: 4 GHz Total serial time: 25*10 4 sec ~ 70 hours Not too good for 48 hour weather prediction CS575 lecture 1 27

28 Parallel Weather Prediction 1 K workstations, grid connected 10 8 segment computations per processor 10 8 instructions per second 100 instructions per segment computation 100 time steps: 10 4 seconds = ~3 hours Much more acceptable Assumption: Communication not a problem here Why is this assumption reasonable? More workstations: finer grid, better accuracy CS575 lecture 1 28

29 Example 2: N body problem Astronomy: bodies in space Attract each other: Gravitational force Newtons law O(n 2 ) calculations per snapshot Galaxy: ~ bodies -> ~ calculations/snapshot Calculation 1 micro sec Snapshot: secs = ~10 11 days = ~ 3*10 8 years Is parallelism going to help us? NO What does help? Better algorithm: Barnes Hut Divides the space in quad tree (or oct tree ) Treats far away quads as one body: O(n log n) How much time per snapshot now? CS575 lecture 1 29

30 Other Challenging Applications Satellite data acquisition: billions of bits / sec Pollution levels, Remote sensing of materials Image recognition Discrete optimization problems Planning, Scheduling, VLSI design Bio-informatics, computational chemistry Airplane/Satellite/Vehicle design Internet (Google search) CS575 lecture 1 30

31 Application Specific Architectures ASICs: Application Specific Integrated Circuits Levels of specificity Full custom ASICs Standard cell ASICs Field programmable gate arrays Computational models Dataflow graphs Systolic arrays Promising orders of magnitude better performance, lower power CS575 lecture 1 31

32 ASICS cont How much faster than General purpose? Example: 1D 1024 FFT General purpose machine (G4): 25 micro secs ASIC device (MIT Lincoln Labs): 32 nano secs ASIC device uses 20 milliwatts (100 * less power) Other applications Finite Impulse Response (FIR) Filters Matrix multiply QR decomposition What do these all have in common? CS575 lecture 1 32

33 Background If you do not have necessary background in analysis of algorithms See the book Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein Or go online Topics to study Introduction Growth of functions Summations Recurrences CS575 lecture 1 33

34 O, Ω, Θ Background: Orders of Magnitude f(x) = O(g(x)) iff c, n 0 : f(x) < c.g(x) n> n 0 used for upper bound of algorithm complexity: this particular algorithm takes at most c.g(n) time f(x) = Ω(g(x)) iff c, n 0 : f(x) > c.g(x) n> n 0 used for lower bound of problem complexity: any algorithm for solving this problem takes at least c.g(n) time f(x) = Θ(g(x)) iff f(x)=o(g(x)) and f(x)=ω(g(x)) Tight bound CS575 lecture 1 34

35 Background: Closed problems Closed problem P: algorithm X with O(X) = Ω(P) eg. Sort has tight bound: Θ(nlog(n)) Problem P has algorithmic gap: P is not closed, eg., all NP Complete problems (problems with polynomial lower bound but currently exponential upper bound, such as TSP) CS575 lecture 1 35

36 Recurrence Relations Algorithmic complexity often described using recurrence relations: f(n) = R( f(1).. f(n-1) ) Two important types of recurrence relations Linear Divide and Conquer cs420(dl) covers these CS575 lecture 1 36

37 Repeated substitution Simple recurrence relations (one recurrent term in the rhs) can sometimes be solved using repeated substitution Two types: Linear and DivCo Linear F(n) = af(n-d)+g(n), base: F(1)=v 1 Divco F(n)= af(n/d)+g(n), base: F(1)=v 1 Two questions: what is the pattern how often is it applied until we hit the base case

38 Linear Example M(n)=2M(n-1)+1, M(1)=1 recognize this recurrence? M(n) = 2M(n-1)+1 = 2(2M(n-2)+1)+1 = 4M(n-2)+2+1 = 4(2M(n-3)+1)+2+1= 8M(n-3)+4+2+1= inductive step 2 k M(n-k)+2 k-1 +2 k = hit base for k = n-1: = 2 n-1 M(1)+2 n-1 +2 n = 2 n -1 for more on Linear recurrence relations, see 420dl

39 DivCo example Merge sort: T(n) = 2T(n/2) + n, T(1)=1 n = 2 k T(n)=2(2(T(n/4)+n/2)+n = 4T(n/4) + 2n = 8T(n/8) + 3n... inductive step = 2 k T(n/2 k )+kn hit base for k = logn = 2 k T(n/2 k )+kn = n+kn = O(nlogn)

40 Another one: binary search f(n) = f(n/2)+c f(1)=1 let n = 2 k f(n)=f(n/2)+c = f(n/4)+2c = f(n/8)+3c = f(n/2 k )+kc = hit base for k=log n: f(1)+ c logn = O(log n)

41 Master Method Cookbook approach to solution, based on repeated substitution (Cormen et.al. or Rosen) A n = C A n/d +knp A n = O(n p ) if C < d p eg A n = 3 A n/2 +n2 A n = O(n p log(n)) if C = d p eg A n = 2A n/2 +n A n = O(n log d c ) if C > d p eg A n = 3 A n/2 +n Do binary search and merge sort with this method

42 Examples Merge Sort T(n) = 2T(n/2) + n, T(1)=1 C=? d=? p=? d p =? T(n) = O(??? ) Binary Search f(n) = f(n/2)+c f(1)=1 C=? d=? p=? d p =? f(n) = O(??? ) CS575 lecture 1 42