Scientific Computing Programming with Parallel Objects

Similar documents
Distributed communication-aware load balancing with TreeMatch in Charm++

Trends in High-Performance Computing for Power Grid Applications

Sourcery Overview & Virtual Machine Installation

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

HPC Programming Framework Research Team

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

HPC Deployment of OpenFOAM in an Industrial Setting

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Turbomachinery CFD on many-core platforms experiences and strategies

HPC enabling of OpenFOAM R for CFD applications

Large-Scale Reservoir Simulation and Big Data Visualization

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

An Introduction to Parallel Computing/ Programming

Recent Advances in HPC for Structural Mechanics Simulations

Parallel Computing with MATLAB

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Multicore Parallel Computing with OpenMP

Optimizing Load Balance Using Parallel Migratable Objects

10- High Performance Compu5ng

1 Bull, 2011 Bull Extreme Computing

HPC Wales Skills Academy Course Catalogue 2015

Cloud Friendly Load Balancing for HPC Applications: Preliminary Work

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Mesh Generation and Load Balancing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

A Load Balancing Schema for Agent-based SPMD Applications

Power Aware and Temperature Restraint Modeling for Maximizing Performance and Reliability Laxmikant Kale, Akhil Langer, and Osman Sarood

Petascale Software Challenges. William Gropp

Software Development around a Millisecond

Stream Processing on GPUs Using Distributed Multimedia Middleware

Layer Load Balancing and Flexibility

A Promising Approach to Dynamic Load Balancing of Weather Forecast Models

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

BSC vision on Big Data and extreme scale computing

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Mesh Partitioning and Load Balancing

ST810 Advanced Computing

BLM 413E - Parallel Programming Lecture 3

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Multi-GPU Load Balancing for Simulation and Rendering

CUDA programming on NVIDIA GPUs

GPUs for Scientific Computing

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Charm++, what s that?!

Accelerating CFD using OpenFOAM with GPUs

High Performance Computing

HP ProLiant SL270s Gen8 Server. Evaluation Report

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015

Introduction to Cloud Computing

Hybrid Software Architectures for Big

High Performance Computing in CST STUDIO SUITE

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

FPGA-based Multithreading for In-Memory Hash Joins

Rodrigo Fernandes de Mello, Evgueni Dodonov, José Augusto Andrade Filho

Parallel Computing. Benson Muite. benson.

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

Introduction to Cluster Computing

Introduction to GPU Programming Languages

Optimizing Distributed Application Performance Using Dynamic Grid Topology-Aware Load Balancing

HPC with Multicore and GPUs

Methodology for predicting the energy consumption of SPMD application on virtualized environments *

End-user Tools for Application Performance Analysis Using Hardware Counters

Performance of the JMA NWP models on the PC cluster TSUBAME.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Ubiquitous access Inherently distributed Many, diverse clients (single purpose rich) Unlimited computation and data on demand

High Performance Computing for Operation Research

benchmarking Amazon EC2 for high-performance scientific computing

Spring 2011 Prof. Hyesoon Kim

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Challenges and Opportunities for Exscale Resource Management and How Today's Petascale Systems are Guiding the Way

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Programming Languages for Large Scale Parallel Computing. Marc Snir

In situ data analysis and I/O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN

Data Centric Systems (DCS)

Overview of HPC Resources at Vanderbilt

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications

MOSIX: High performance Linux farm

Overlapping Data Transfer With Application Execution on Clusters

A Comparative Analysis of Load Balancing Algorithms Applied to a Weather Forecast Model

Parallelism and Cloud Computing

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Graphic Processing Units: a possible answer to High Performance Computing?

Hectiling: An Integration of Fine and Coarse Grained Load Balancing Strategies 1

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

PeerMon: A Peer-to-Peer Network Monitoring System

Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications

How to program efficient optimization algorithms on Graphics Processing Units - The Vehicle Routing Problem as a case study

Transcription:

Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology

Parallel Architectures Galore Personal Computing Embedded Computing Moore s Law Dennard Scaling Mobile Computing Supercomputing 2

My Parallel Laptop Processor (multicore) Accelerator (manycore) Intel Core i7 NVIDIA GeForce GT 750M 4 cores 384 cores 2.5 GHz 967 MHz 160 GFLOPs 742.7 GFLOPs 3

It s movie time! Heat Transfer Problem 4

Speedup Heat Transfer Problem 40 100 68.78 32.674 Time (seconds) 30 20 10 8.83 Speedup 10 3.7 0 0.475 Sequential Multicore Manycore 1 1 Sequential Multicore Manycore 5

Supercomputer IBM BlueGene/L Architecture 6

Top500 Source: http://www.top500.org (June 2015) 7

Exascale Big Data Big Network Big Intelligence Big Compute (Internet of Things) (Deep Learning) (Exascale) Challenges: Heterogeneity Low resilience Thermal variation Irregular computation Programability Source: http://www.top500.org (June 2015) 8

Single Program Multiple Data (SPMD) Sequential Message Passing send receive CPU CPU CPU Parallel MPI Poor functional decomposition Synchronized communication data decomposition + communication 9

Parallel Objects Parallel Flexible Distribution NAMD Non-blocking communication operations Source: http://charm.cs.illinois.edu Charm++ Entities and interactions Asynchronous communication 10

Parallel Objects Model An application is decomposed into wudus (work and data units). Objects are reactive entities: interface of remote methods. All message-passing operations are nonblocking: asynchronous method invocation. A message-driven execution similar to Active Messages. $ Objects know how to serialize/deserialize, also called the packunpack (PUP) framework.! # & ( % " ' Goals: Latency hiding Load balancing Adaptivity 11

Introspective Runtime System A thin layer between the application and the machine. Based on object-based overdecomposition: many more objects than processing entities. Components: Message scheduler. Routing tables. Load and communication monitoring.! # $ & ( % " ' Adaptive Runtime System! " # $ % & ' ( Node A Node B Node C Node D 12

Migration The underlying system consists of a collection of processing entities (processors, or nodes). Objects are distributed among the processing entities. That assignment may change dynamically if load imbalance arises. An introspective runtime system detects performance bottlenecks and balances load by moving objects around.! " # $ % & ' # ( Node A Node B Node C Node D 13

Dynamic Load Balancing NP-complete problem. Runtime system collects load information and communication graph. Greedy strategies, graph partitioning. Runtime system shuffles objects around to avoid overloading. Principle of persistence. Based on PUP framework. 14

Charm++ Actively developed since mid 90 s. Features language extensions, network layers, load balancers, tools, and several applications. Objects are called chares. Chare arrays are the main collection of objects. Source: http://charm.cs.illinois.edu 15

Charm++ (cont.) Source: http://charm.cs.illinois.edu 16

Charm++ (cont.) Source: http://charm.cs.illinois.edu 17

Charm++ Runtime System Source: http://charm.cs.illinois.edu 18

MPI vs Charm++ MPI Charm++ Over-decomposition No* Yes Load Balancing No* Yes Fault Tolerance No* Yes Non-blocking Collectives Yes** Yes Dynamic Adaptivity No Yes Introspection No Yes Wide Adoption Yes No * Some third-party libraries may implement this feature. ** As of MPI-3 standard. 19

Example: Heat Transfer Problem Source: http://charm.cs.illinois.edu 20

Example: Heat Transfer Problem Source: http://charm.cs.illinois.edu 21

Computational Fluid Dynamics #"Grids" #"Par*cles" #"Species" Required" Memory" GBs" GFLOP"per" #"Itera*ons" itera*on" Serial"""" Run>*me"" (1"GFLOP/s)" 106$ 6$x$106$ 9$ 1.69$ 29.5$ 60,000$ 20.5$days$ 106$ 6$x$106$ 19$ 2.48$ 90.7$ 60,000$ 63$days$ 5$x$106$ 50$x$106$ 19$ 24.0$ 544.7$ 220,000$ 3.8$years$ 22

IPLMCFD IPLMCFD: Irregularly Portioned Lagrangian Monte Carlo Finite Difference. A massively parallel solver for turbulent reactive flows. LES via filtered density function (FDF). 23

Load Imbalance IPLMCFD uses a graph partitioning library (METIS) to redistribute work. Requires to split execution between calls to repartition cells. 24

IPLMCFD Goals: Load balance processors through weighted graph partitioning. To minimize the edge-cut. Irregularly shaped decompositions: Disadvantages: Nontrivial communication patterns Increased communication cost. Advantage (major): Evenly distributed load among partitions. P. H. Pisciuneri et al., SIAM J. Sci. Comput., vol. 35, no. 4, pp. C438- C452 (2013). 25

Simulation of a Premixed Flame 26

Performance of IPLMCFD T Unbalanced - T IPLMCFD = 30 hours 27

Cost of Repartitioning O(10 2 )-O(10 3 ) iterations 28

HPC Languages HPF UPC Fortran C/C++ CAF Chapel Python 29

Parallel Objects in Python Patch i Compute (i,j) Patch j Node X Node Y Node Z class Patch: particles =... def send(): computes[i,j].recv(particles) def update(part_info):... class Compute: def recv(particles):... patches[i].update(part_info) patches[j].update(part_info) 30

Acknowledgments University of Illinois: Prof. Laxmikant V. Kalé (Computer Science) University of Pittsburgh: Dr. Patrick Pisciuneri (Center for Simulation and Modeling) Prof. Peyman Givi (Mechanical Engineering) Images extracted from Wikipedia and www.defenceindustrydaily.com www.maclife.com www.theregister.co.uk www.geforce.com 31

Conclusions Big potential of parallel objects in scientific computing: Simplified programming model Improved performance due to overdecomposition Dynamic load balancing Research opportunity: Parallel-objects abstractions in Python Thank you! esteban.meneses@acm.org www.emeneses.org 32