Scientific Computing Programming with Parallel Objects

Size: px

Start display at page:

Download "Scientific Computing Programming with Parallel Objects"

Jasmin Harper
9 years ago
Views:

1 Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology

2 Parallel Architectures Galore Personal Computing Embedded Computing Moore s Law Dennard Scaling Mobile Computing Supercomputing 2

3 My Parallel Laptop Processor (multicore) Accelerator (manycore) Intel Core i7 NVIDIA GeForce GT 750M 4 cores 384 cores 2.5 GHz 967 MHz 160 GFLOPs GFLOPs 3

4 It s movie time! Heat Transfer Problem 4

5 Speedup Heat Transfer Problem Time (seconds) Speedup Sequential Multicore Manycore 1 1 Sequential Multicore Manycore 5

6 Supercomputer IBM BlueGene/L Architecture 6

7 Top500 Source: (June 2015) 7

8 Exascale Big Data Big Network Big Intelligence Big Compute (Internet of Things) (Deep Learning) (Exascale) Challenges: Heterogeneity Low resilience Thermal variation Irregular computation Programability Source: (June 2015) 8

9 Single Program Multiple Data (SPMD) Sequential Message Passing send receive CPU CPU CPU Parallel MPI Poor functional decomposition Synchronized communication data decomposition + communication 9

10 Parallel Objects Parallel Flexible Distribution NAMD Non-blocking communication operations Source: Charm++ Entities and interactions Asynchronous communication 10

11 Parallel Objects Model An application is decomposed into wudus (work and data units). Objects are reactive entities: interface of remote methods. All message-passing operations are nonblocking: asynchronous method invocation. A message-driven execution similar to Active Messages. $ Objects know how to serialize/deserialize, also called the packunpack (PUP) framework.! # & ( % " ' Goals: Latency hiding Load balancing Adaptivity 11

12 Introspective Runtime System A thin layer between the application and the machine. Based on object-based overdecomposition: many more objects than processing entities. Components: Message scheduler. Routing tables. Load and communication monitoring.! # $ & ( % " ' Adaptive Runtime System! " # $ % & ' ( Node A Node B Node C Node D 12

13 Migration The underlying system consists of a collection of processing entities (processors, or nodes). Objects are distributed among the processing entities. That assignment may change dynamically if load imbalance arises. An introspective runtime system detects performance bottlenecks and balances load by moving objects around.! " # $ % & ' # ( Node A Node B Node C Node D 13

14 Dynamic Load Balancing NP-complete problem. Runtime system collects load information and communication graph. Greedy strategies, graph partitioning. Runtime system shuffles objects around to avoid overloading. Principle of persistence. Based on PUP framework. 14

15 Charm++ Actively developed since mid 90 s. Features language extensions, network layers, load balancers, tools, and several applications. Objects are called chares. Chare arrays are the main collection of objects. Source: 15

16 Charm++ (cont.) Source: 16

17 Charm++ (cont.) Source: 17

18 Charm++ Runtime System Source: 18

19 MPI vs Charm++ MPI Charm++ Over-decomposition No* Yes Load Balancing No* Yes Fault Tolerance No* Yes Non-blocking Collectives Yes** Yes Dynamic Adaptivity No Yes Introspection No Yes Wide Adoption Yes No * Some third-party libraries may implement this feature. ** As of MPI-3 standard. 19

20 Example: Heat Transfer Problem Source: 20

21 Example: Heat Transfer Problem Source: 21

22 Computational Fluid Dynamics #"Grids" #"Par*cles" #"Species" Required" Memory" GBs" GFLOP"per" #"Itera*ons" itera*on" Serial"""" Run>*me"" (1"GFLOP/s)" 106$ 6$x$106$ 9$ 1.69$ 29.5$ 60,000$ 20.5$days$ 106$ 6$x$106$ 19$ 2.48$ 90.7$ 60,000$ 63$days$ 5$x$106$ 50$x$106$ 19$ 24.0$ 544.7$ 220,000$ 3.8$years$ 22

23 IPLMCFD IPLMCFD: Irregularly Portioned Lagrangian Monte Carlo Finite Difference. A massively parallel solver for turbulent reactive flows. LES via filtered density function (FDF). 23

24 Load Imbalance IPLMCFD uses a graph partitioning library (METIS) to redistribute work. Requires to split execution between calls to repartition cells. 24

25 IPLMCFD Goals: Load balance processors through weighted graph partitioning. To minimize the edge-cut. Irregularly shaped decompositions: Disadvantages: Nontrivial communication patterns Increased communication cost. Advantage (major): Evenly distributed load among partitions. P. H. Pisciuneri et al., SIAM J. Sci. Comput., vol. 35, no. 4, pp. C438- C452 (2013). 25

26 Simulation of a Premixed Flame 26

27 Performance of IPLMCFD T Unbalanced - T IPLMCFD = 30 hours 27

28 Cost of Repartitioning O(10 2 )-O(10 3 ) iterations 28

29 HPC Languages HPF UPC Fortran C/C++ CAF Chapel Python 29

30 Parallel Objects in Python Patch i Compute (i,j) Patch j Node X Node Y Node Z class Patch: particles =... def send(): computes[i,j].recv(particles) def update(part_info):... class Compute: def recv(particles):... patches[i].update(part_info) patches[j].update(part_info) 30

31 Acknowledgments University of Illinois: Prof. Laxmikant V. Kalé (Computer Science) University of Pittsburgh: Dr. Patrick Pisciuneri (Center for Simulation and Modeling) Prof. Peyman Givi (Mechanical Engineering) Images extracted from Wikipedia and

32 Conclusions Big potential of parallel objects in scientific computing: Simplified programming model Improved performance due to overdecomposition Dynamic load balancing Research opportunity: Parallel-objects abstractions in Python Thank you! 32

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration