Performance metrics for parallelism

Size: px

Start display at page:

Download "Performance metrics for parallelism"

Dwight Lindsey
10 years ago
Views:

1 Performance metrics for parallelism 8th of November, 2013

2 Sources Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley.

3 Definition (Parallel overhead) let T seq be the time taken by a sequential algorithm; let T p be the time taken by a parallelisation of that algorithm, using p processes. Then, the parallel overhead T o is given by T o = pt p T s. (Effort is proportional to the number of workers multiplied with the duration of their work, that is, equal to pt p.) Best case: T o = 0, such that T p = T seq /p.

Then, the parallel overhead T o is given by T o = pt p T s.

4 Definition (Speedup) Let T seq, p, and T p be as before. Then, the speedup S is given by S(p) = T seq /T p.

5 Definition (Speedup) Let T seq, p, and T p be as before. Then, the speedup S is given by S(p) = T seq /T p. Target: S = p (no overhead; T o = 0). Best case: S > p (superlinear speedup). Worst case: S < 1 (slowdown).

6 What is T seq? Many sequential algorithms solving the same problem.

7 What is T seq? Many sequential algorithms solving the same problem. When determining the speedup S, compare against the best sequential algorithm (that is available on your architecture).

8 What is T seq? Many sequential algorithms solving the same problem. When determining the speedup S, compare against the best sequential algorithm (that is available on your architecture). When determining the overhead T o, compare against the most similar algorithm (maybe even take T seq = T 1 ).

algorithm (that is available on your architecture).

9 Definition (strong scaling) S(p) = T seq /T p = Ω(p) (i.e., lim sup p S(p)/p > 0) 6 5 MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling 512k-length vector log 2 speedup log 2 p Question: is it reasonable to expect strong scalability for (good) parallel algorithms?

scaling 512k-length vector log 2 speedup 4 3 2 1 0 1 2 3 4 5 6 7 log 2 p

10 Definition (strong scaling) S(p) = T seq /T p = Ω(p) (i.e., lim sup p S(p)/p > 0) 6 5 MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling 512k-length vector 32M-length vector log 2 speedup log 2 p Answer: not as p. You cannot efficiently clean a table with 50 people, or paint a wall with 500 painters.

vector 32M-length vector log 2 speedup 4 3 2 1 0 1 2 3 4 5 6 7 log 2 p Answer: not

11 Definition (weak scaling) S(n) = T seq (n)/t p (n) = Ω(1), with p fixed BSPlib: FFT on two different parallel machines Perfect speedup Lynx (distributed-memory) HP DL980 (shared-memory) speedup log 2 n For large enough problems, we do expect to make maximum use of our parallel computer.

(distributed-memory) HP DL980 (shared-memory) speedup 60 40 20 0 10 12 14 16 18 20

12 : example MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes log 2 n This processor advertises 64 processors,

Perfect scaling 64 processes 5 0 14 16 18 20

13 : example MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. We oversubscribed!

5 0 14 16 18 20 22 24 26 log 2 n This processor

14 MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes 32 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. Be careful with oversubscription! (Including hyperthreading!)

22 24 26 log 2 n This processor advertises 64 processors, but only

15 MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes 32 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. Q: would you say this algorithm scales on the Sun Ultrasparc T2?

26 log 2 n This processor advertises 64 processors, but only has 32

16 A: if the speedup stabilises around 16x, then yes, since the relative efficiency is stable. Definition (Parallel efficiency) Let T seq, p, T p, and S as before. The parallel efficiency E equals E = T seq p /T p = T seq /pt p = S/p.

17 If there is no overhead (T o = pt p T seq = 0), the efficiency E = 1; decreasing the overhead increases the efficiency. Weak scalability: what happens if the problem size increases?

18 If there is no overhead (T o = pt p T seq = 0), the efficiency E = 1; decreasing the overhead increases the efficiency. Weak scalability: what happens if the problem size increases? But what is a sensible definition of the problem size? Consider the following applications: inner-product calculation; binary search; sorting (quicksort).

Weak scalability: what happens if the problem size increases?

19 Problem Size Run-time Inner-product Θ(n) bytes Θ(n) flops Binary search Θ(n) bytes Θ(log 2 n) comparisons Sorting Θ(n) bytes Θ(n log 2 n) swaps FFT Θ(n) bytes Θ(n log 2 n) flops Hence the problem size is best identified by T seq. Question: How should the ratio T o /T seq behave as T seq, for the algorithm to scale in a weak sense?

log 2 n) flops Hence the problem size is best identified by T seq.

20 If T o /T seq = c, with c R 0 constant, then pt p T seq T seq = ps 1 1 = c, so S = Note that here, E = S/p = 1 c+1 p, which is constant when p is fixed. c + 1 Question: How should the ratio T o /T seq behave as p, for the algorithm to scale in a strong sense?

21 If T o /T seq = c, with c R 0 constant, then pt p T seq T seq = ps 1 1 = c, so S = Note that here, E = S/p = 1 c+1 p c + 1. which is still constant! Answer: Exactly the same! Both strong and weak scalability are iso-efficiency constraints (E remains constant).

22 Definition (iso-efficiency) Let E be as before. Suppose T o = f (T seq, p) is a known function. Then the iso-efficiency relation is given by T seq = 1 1 E 1f (T seq, p). This follows from the definition of E: E 1 = pt p /T seq + 1 T seq T seq = 1 + pt p T seq T seq = 1 + T o T seq, so T seq (E 1 1) = T o.

23 Questions Does this scale? Yes! (It s again an FFT) Time Problem size

24 Questions Does this scale? Yes! (It s again an FFT) 10 Measured running time Theoretical optimum (asymptotic) Execution time (logarithmic) log 2 n

25 Questions Better use speedups when investigating scalability. 70 HP DL speedup log 2 n

26 In summary... A BSP algorithm is scalable when T = O(T seq /p + p). This considers scalability of the speedup and includes parallel overhead.

27 In summary... A BSP algorithm is scalable when T = O(T seq /p + p). This considers scalability of the speedup and includes parallel overhead. It does not include memory scalability: M = O(M seq /p + p), where M is the memory taken by one BSP process and M seq the memory requirement of the best sequential algorithm. Question: does this definition of a scalable algorithm make sense?

28 In summary... It does indeed make sense: If T o = Θ(p), then E 1 = pt p T s E = = 1 + pt p T s T s = 1 + T o T s p/t s = T s T s + p. Note lim p E = 0 (strong scalability, Amdahl s Law) and lim Ts = 1 (weak scalability). For iso-efficiency, the ratio T o and p has to change appropriately.

29 Practical session Subjects: Coming Thursday, see ~albert-jan.yzelman/education/parco13/ 1 Amdahl s Law 2 Parallel Sorting 3 Parallel Grid computations

30 Insights from the practical session This intuition leads to a definition of how parallel certain algorithms are: Definition (Parallelism) Consider a parallel algorithm that runs in T p time. Let T seq the time taken by the best sequential algorithm that solves the same problem. Then the parallelism is given by T seq T T seq = lim. p T p This kind of analysis is fundamental for fine-grained parallelisation schemes. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou Cilk: an efficient multithreaded runtime system. SIGPLAN Not. 30, 8 (August 1995), pp

Performance metrics for parallel systems

Performance metrics for parallel systems S.S. Kadam C-DAC, Pune [email protected] C-DAC/SECG/2006 1 Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism