Improving Perfect Parallelism

Size: px

Start display at page:

Download "Improving Perfect Parallelism"

Candace Elliott
7 years ago
Views:

1 Improving Perfect Parallelism Replacing shared resource competition with friendly cooperation Lars Karlsson in collaboration with Carl Christian Kjelgaard Mikkelsen and Bo Kågström

2 Who are we? Parallel and Scientific Computing group Research group directed by Prof. Bo Kågström Long experience with parallel and high-performance computing Theory, algorithms, & software Contributions to widely used software libraries Project directed by Prof. Mats G Larson (Math dept) Interdisciplinary research environment (Phys+Math+CS) Physics simulations and scalable computing HPC2N 2

3 Who am I? Lars Karlsson PhD in Computing Science Umeå University, 2011 Parallel numerical linear algebra Matrix factorizations Eigenvalue problems Focus on efficient resource utilization Scheduling Cache blocking/tiling Data layouts and in-place conversion Novel parallel algorithms and implementations HPC2N 3

4 Perfectly parallel problems A perfectly parallel problem is one that is easy to decompose into a large number of independent tasks Solutions exhibit perfect parallelism HPC2N 4

5 Some perfectly parallel problems Parameter sweep studies Database searches (e.g. BLAST) Monte Carlo simulations Uncertainty quantification... HPC2N 5

6 Standard parallelization scheme Run each task sequentially Spread the tasks over the processors Balance the load dynamically HPC2N 6

7 The perfect speedup assumption Perfect parallelism => perfect speedup Hidden assumption: The N independent tasks are scheduled onto p << N independent processors This is an invalid assumption if by processor we mean a single core within a multicore processor HPC2N 7

8 Some shared resources s in a multicore (may) share Caches Memory buses Floating point units I/O channels... HPC2N 8

9 HPC2N AMD Bulldozer processors 4 sockets per node 2 NUMA domains per socket 6 cores per NUMA domain: HPC2N 9

10 Introducing forced parallelization Forced parallelization Parallelization for some other purpose than to increase the degree of concurrency HPC2N 10

11 Why forced parallelization? More main memory per task Avoid swapping to disk HPC2N 11

12 Why forced parallelization? More cache per task Avoid spilling over to main memory HPC2N 12

13 Effective cache capacity per task L3 L2 L2 L2 HPC2N 13

14 Cache per task: Sequential (solo) L3 L2 L2 L2 Idle Running solo; HPC2N 14 Lots of cache

15 Cache per task: Standard scheme 6-way competition L3 L2 L2 2-way competition L2 Other tasks Performing one task; HPC2N 15 Competing for cache

16 Cache per task: Forced parallel L3 L2 L2 L2 Cooperating on the same task; Lots and lots of cache HPC2N 16

17 Sample applications Copy benchmark Copy memory back and forth Spike workload Solve heat equations using direct tridiagonal solver Radix sort workload Sort lists of integers using radix sort Power method workload Compute the dominant eigenvector of sparse matrices HPC2N 17

18 Copy Copying back and forth Best case scenario HPC2N 18

19 Copy Copying back and forth HPC2N 19

20 Spike Fast banded solver Heat equation; many initial conditions HPC2N 20

21 Spike Fast banded solver Speedup Slowdown HPC2N 21

22 Radix sort Sorting integer lists Permute list; repeat HPC2N 22

23 Radix sort Sorting integer lists Speedup Slowdown HPC2N 23

24 Power method Sparse y=ax v := A*v v := v / v Repeat HPC2N 24

25 Power method Sparse y=ax Speedup Slowdown HPC2N 25

26 Connection to superlinear speedup Superlinear speedup: Scale one problem over many processors More cache available to solve the problem Forced parallelization: Resource competition => less cache per task Scale one task over a few processors More cache available per task Superlinear speedup of each task HPC2N 26

27 Connection to superlinear speedup T1 = Time per task using standard scheme T6 = Time per task using 6 cores/task Forced parallelization faster means that T1 > 6*T6 or equivalently T1 / T6 > 6 Hence, superlinear speedup HPC2N 27

28 Superlinear speedup John L. Gustafson (1990) However, tiered memory can make performance increase instead of decrease as problem size per processor shrinks, and workload can shift to routines with higher speed as the problem is scaled. Superlinear speedup results in such cases. HPC2N 28

29 Conflicting results? From Parallel Computing (1986) HPC2N 29

30 Title page Abstract HPC2N 30

31 HPC2N 31

32 Superlinear speedup: Conclusion Superlinear speedups proved for large-scale problems Forced parallelization relies on superlinear speedup But only at small scales (a few cores) HPC2N 32

33 Guidelines 1. Is there any resource competition? Run sequentially and in parallel on one processor. Less than perfect speedup? 2. Is there any cache reuse in the tasks? If not then additional cache will not help. 3. Can tasks be parallelized easily? If not then too much overhead. 4. If YES consider forced parallelization HPC2N 33

34 Take home message Perfect parallelism!= perfect speedup s compete for resources Consider forced parallelization HPC2N 34

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al. Geosci. Model Dev. Discuss., 8, C1166 C1176, 2015 www.geosci-model-dev-discuss.net/8/c1166/2015/ Author(s) 2015. This work is distributed under the Creative Commons Attribute 3.0 License. Geoscientific