Scientific Computing Programming with Parallel Objects

Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology

Parallel Architectures Galore Personal Computing Embedded Computing Moore s Law Dennard Scaling Mobile Computing Supercomputing 2

My Parallel Laptop Processor (multicore) Accelerator (manycore) Intel Core i7 NVIDIA GeForce GT 750M 4 cores 384 cores 2.5 GHz 967 MHz 160 GFLOPs 742.7 GFLOPs 3

It s movie time! Heat Transfer Problem 4

Speedup Heat Transfer Problem 40 100 68.78 32.674 Time (seconds) 30 20 10 8.83 Speedup 10 3.7 0 0.475 Sequential Multicore Manycore 1 1 Sequential Multicore Manycore 5

Supercomputer IBM BlueGene/L Architecture 6

Top500 Source: http://www.top500.org (June 2015) 7

Exascale Big Data Big Network Big Intelligence Big Compute (Internet of Things) (Deep Learning) (Exascale) Challenges: Heterogeneity Low resilience Thermal variation Irregular computation Programability Source: http://www.top500.org (June 2015) 8

Single Program Multiple Data (SPMD) Sequential Message Passing send receive CPU CPU CPU Parallel MPI Poor functional decomposition Synchronized communication data decomposition + communication 9

Parallel Objects Parallel Flexible Distribution NAMD Non-blocking communication operations Source: http://charm.cs.illinois.edu Charm++ Entities and interactions Asynchronous communication 10

Parallel Objects Model An application is decomposed into wudus (work and data units). Objects are reactive entities: interface of remote methods. All message-passing operations are nonblocking: asynchronous method invocation. A message-driven execution similar to Active Messages. $ Objects know how to serialize/deserialize, also called the packunpack (PUP) framework.! # & ( % " ' Goals: Latency hiding Load balancing Adaptivity 11

Introspective Runtime System A thin layer between the application and the machine. Based on object-based overdecomposition: many more objects than processing entities. Components: Message scheduler. Routing tables. Load and communication monitoring.! # $ & ( % " ' Adaptive Runtime System! " # $ % & ' ( Node A Node B Node C Node D 12

Migration The underlying system consists of a collection of processing entities (processors, or nodes). Objects are distributed among the processing entities. That assignment may change dynamically if load imbalance arises. An introspective runtime system detects performance bottlenecks and balances load by moving objects around.! " # $ % & ' # ( Node A Node B Node C Node D 13

Dynamic Load Balancing NP-complete problem. Runtime system collects load information and communication graph. Greedy strategies, graph partitioning. Runtime system shuffles objects around to avoid overloading. Principle of persistence. Based on PUP framework. 14

Charm++ Actively developed since mid 90 s. Features language extensions, network layers, load balancers, tools, and several applications. Objects are called chares. Chare arrays are the main collection of objects. Source: http://charm.cs.illinois.edu 15

Charm++ (cont.) Source: http://charm.cs.illinois.edu 16

Charm++ (cont.) Source: http://charm.cs.illinois.edu 17

Charm++ Runtime System Source: http://charm.cs.illinois.edu 18

MPI vs Charm++ MPI Charm++ Over-decomposition No* Yes Load Balancing No* Yes Fault Tolerance No* Yes Non-blocking Collectives Yes** Yes Dynamic Adaptivity No Yes Introspection No Yes Wide Adoption Yes No * Some third-party libraries may implement this feature. ** As of MPI-3 standard. 19

Example: Heat Transfer Problem Source: http://charm.cs.illinois.edu 20

Example: Heat Transfer Problem Source: http://charm.cs.illinois.edu 21

Computational Fluid Dynamics #"Grids" #"Par*cles" #"Species" Required" Memory" GBs" GFLOP"per" #"Itera*ons" itera*on" Serial"""" Run>*me"" (1"GFLOP/s)" 106$ 6$x$106$ 9$ 1.69$ 29.5$ 60,000$ 20.5$days$ 106$ 6$x$106$ 19$ 2.48$ 90.7$ 60,000$ 63$days$ 5$x$106$ 50$x$106$ 19$ 24.0$ 544.7$ 220,000$ 3.8$years$ 22

IPLMCFD IPLMCFD: Irregularly Portioned Lagrangian Monte Carlo Finite Difference. A massively parallel solver for turbulent reactive flows. LES via filtered density function (FDF). 23

Load Imbalance IPLMCFD uses a graph partitioning library (METIS) to redistribute work. Requires to split execution between calls to repartition cells. 24

IPLMCFD Goals: Load balance processors through weighted graph partitioning. To minimize the edge-cut. Irregularly shaped decompositions: Disadvantages: Nontrivial communication patterns Increased communication cost. Advantage (major): Evenly distributed load among partitions. P. H. Pisciuneri et al., SIAM J. Sci. Comput., vol. 35, no. 4, pp. C438- C452 (2013). 25

Simulation of a Premixed Flame 26

Performance of IPLMCFD T Unbalanced - T IPLMCFD = 30 hours 27

Cost of Repartitioning O(10 2 )-O(10 3 ) iterations 28

HPC Languages HPF UPC Fortran C/C++ CAF Chapel Python 29

Parallel Objects in Python Patch i Compute (i,j) Patch j Node X Node Y Node Z class Patch: particles =... def send(): computes[i,j].recv(particles) def update(part_info):... class Compute: def recv(particles):... patches[i].update(part_info) patches[j].update(part_info) 30

Acknowledgments University of Illinois: Prof. Laxmikant V. Kalé (Computer Science) University of Pittsburgh: Dr. Patrick Pisciuneri (Center for Simulation and Modeling) Prof. Peyman Givi (Mechanical Engineering) Images extracted from Wikipedia and www.defenceindustrydaily.com www.maclife.com www.theregister.co.uk www.geforce.com 31

Conclusions Big potential of parallel objects in scientific computing: Simplified programming model Improved performance due to overdecomposition Dynamic load balancing Research opportunity: Parallel-objects abstractions in Python Thank you! esteban.meneses@acm.org www.emeneses.org 32