Charm++, what s that?!



Similar documents
Distributed communication-aware load balancing with TreeMatch in Charm++

Optimizing Load Balance Using Parallel Migratable Objects

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

An Incomplete C++ Primer. University of Wyoming MA 5310

Parallelization: Binary Tree Traversal

Scientific Computing Programming with Parallel Objects

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Parallel Computing. Shared memory parallel programming with OpenMP

Lightning Introduction to MPI Programming

Parallel Computing. Parallel shared memory computing with OpenMP

C++ Programming Language

High performance computing systems. Lab 1

MUSIC Multi-Simulation Coordinator. Users Manual. Örjan Ekeberg and Mikael Djurfeldt

Debugging with TotalView

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

CSC230 Getting Starting in C. Tyler Bletsch

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Parallel Programming with MPI on the Odyssey Cluster

High Performance Computing in CST STUDIO SUITE

Client/Server Computing Distributed Processing, Client/Server, and Clusters

1 Bull, 2011 Bull Extreme Computing

Parallel Programming Survey

The C Programming Language course syllabus associate level

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Multicore Parallel Computing with OpenMP

Cloud Friendly Load Balancing for HPC Applications: Preliminary Work

Storage Classes CS 110B - Rule Storage Classes Page 18-1 \handouts\storclas

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Mesh Partitioning and Load Balancing

GPI Global Address Space Programming Interface

The CNMS Computer Cluster

How To Create A Graphlab Framework For Parallel Machine Learning

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

Architectures for Big Data Analytics A database perspective

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Information Processing, Big Data, and the Cloud

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Binary search tree with SIMD bandwidth optimization using SSE

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Chapter 2 Parallel Architecture, Software And Performance

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Technical Computing Suite Job Management Software

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Fast Multipole Method for particle interactions: an open source parallel library component

Overlapping Data Transfer With Application Execution on Clusters

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Introduction to Cloud Computing

HPC Wales Skills Academy Course Catalogue 2015

An Introduction to High Performance Computing in the Department

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

OpenMP Programming on ScaleMP

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

A Load Balancing Schema for Agent-based SPMD Applications

Cellular Computing on a Linux Cluster

Multi-Threading Performance on Commodity Multi-Core Processors

A Case Study - Scaling Legacy Code on Next Generation Platforms

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Performance Monitoring of Parallel Scientific Applications

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE

Performance Analysis of Web based Applications on Single and Multi Core Servers

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

In-situ Visualization

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Session 2: MUST. Correctness Checking

MOSIX: High performance Linux farm

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

High Performance Computing. Course Notes HPC Fundamentals

What is Multi Core Architecture?

Overview of HPC Resources at Vanderbilt

Tableau Server Scalability Explained

A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems

MAQAO Performance Analysis and Optimization Tool

Improved LS-DYNA Performance on Sun Servers

Transcription:

Charm++, what s that?! Les Mardis du dev François Tessier - Runtime team October 15, 2013 François Tessier Charm++ 1 / 25

Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 2 / 25

Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 3 / 25

Introduction Parallel programming Decomposition What to do in parallel Mapping Which processor execute each task Scheduling The order Machine dependent expression Express the above decisions for the particular parallel machine François Tessier Charm++ 4 / 25

Introduction Scalable execution of parallel applications Cluster computing Number of cores is increasing But memory per core is decreasing (or increasing slowly) Applications need to communicate more and more... #Cores / Memory per core (in MB) 1e+07 1e+06 100000 10000 1000 100 Number of cores and amount of memory per core of the first ranking cluster in the Top500 list since 1993 #Cores Memory per core 10 1990 1995 2000 2005 2010 2015 Years François Tessier Charm++ 5 / 25

Introduction How to communicate between nodes? Message passing paradigm Inter-process communication MPI, SOAP, Charm++ Communication network Message passing interface Process 1 Process 2 Process 3 Process 4 Figure : Communication from process 1 to process 4 with a message passing interface François Tessier Charm++ 6 / 25

Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 7 / 25

Charm++ François Tessier Charm++ 8 / 25

Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python François Tessier Charm++ 9 / 25

Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python How to define Charm++? Object-oriented Based on the C++ programming language Objects called chares (contain data, send and receive messages, perform task) François Tessier Charm++ 9 / 25

Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python How to define Charm++? Object-oriented Based on the C++ programming language Objects called chares (contain data, send and receive messages, perform task) Asynchronous message passing Chares exchange messages to communicate Messages sent/received asynchronously to the code execution François Tessier Charm++ 9 / 25

Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python How to define Charm++? Object-oriented Based on the C++ programming language Objects called chares (contain data, send and receive messages, perform task) Asynchronous message passing Chares exchange messages to communicate Messages sent/received asynchronously to the code execution Parallel Chunks of code Messages François Tessier Charm++ 9 / 25

Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python How to define Charm++? Object-oriented Based on the C++ programming language Objects called chares (contain data, send and receive messages, perform task) Asynchronous message passing Chares exchange messages to communicate Messages sent/received asynchronously to the code execution Parallel Chunks of code Messages Programming paradigm Way of writing program Features and structures added on top of C++ François Tessier Charm++ 9 / 25

Charm++ What is a Charm++ Program? Collection of communicant chare objects, included the "main" chare (global object space) No importance about the number of PU or the type of interconnect (RTS) Send a message by calling another chare s entry point function (reception point) returns immediately from the perspective of the calling chare Figure : User s view of a Charm++ Application (Credits : http://charm.cs.illinois.edu) François Tessier Charm++ 10 / 25

Charm++ The Charm++ Compilation and the Runtime System First strenght: it works! Application written as a collection of communicating objects Compilation, execution : Specific target platform, physical resources (-np) RTS manages the details of the physical resources...... and can take some decisions about : Mapping chare objects to physical processors Load-balancing chare objects Checkpointing Fault-tolereance... Figure : Compilation Process for a Chare Class(Credits : http://charm.cs.illinois.edu) François Tessier Charm++ 11 / 25

Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 12 / 25

Charm++ - Hello World! Listing 1: Hello.h # ifndef MAIN_H # define MAIN_H class Main : public CBase_Main { }; public : Main ( CkArgMsg * msg ); Main ( CkMigrateMessage * msg ); # endif // MAIN_H Listing 2: Hello.ci mainmodule main { mainchare Main { entry Main ( CkArgMsg * msg ); }; Listing 3: Hello.C # include "main.decl.h" # include "main.h" // Entry point of Charm ++ application Main :: Main ( CkArgMsg * msg ) { } // Print a message for the user CkPrintf (" Hello World!\n"); // Exit the application CkExit (); // Constructor needed for chare object // migration ( ignore for now ) // NOTE : This constructor does not need // to appear in the ". ci" file Main :: Main ( CkMigrateMessage * msg ) { } # include "main.def.h" }; François Tessier Charm++ 13 / 25

Charm++ - 1Darray Listing 4: 1Darray.ci mainmodule hello { readonly CProxy_Main mainproxy ; readonly int nelements ; mainchare Main { entry Main ( CkArgMsg *m); entry void done ( void ); }; array [1D] Hello { entry Hello ( void ); entry void SayHi ( int from ); }; }; Listing 5: 1Darray.C # include <stdio.h> # include " hello. decl.h" /* readonly */ CProxy_Main mainproxy ; /* readonly */ int nelements ; /* mainchare */ class Main : public CBase_Main { public : Main ( CkArgMsg * m) { if(m-> argc >1 ) nelements = atoi (m-> argv [1]); delete m; CkPrintf (" Run on %d processors for %d el.\n", CkNumPes (), nelements ); mainproxy = thisproxy ; CProxy_Hello arr = CProxy_Hello :: cknew ( nelements ); arr [0]. SayHi ( -1); }; void done ( void ) { CkPrintf (" All done \n"); CkExit (); }; }; /* array [1D]*/ class Hello : public CBase_Hello { public : Hello () { CkPrintf (" Hello %d created \n",thisindex ); } Hello ( CkMigrateMessage *m) {} void SayHi ( int from ) { CkPrintf ("\" Hello \" from Hello chare # %d on " " processor %d ( told by %d ).\ n", thisindex, CkMyPe (), from ); if ( thisindex < nelements -1) thisproxy [thisindex +1]. SayHi (thisindex ); else mainproxy. done (); // Done! } }; François Tessier Charm++ 14 / 25

Charm++ - 1Darray $./charmrun +p3./hello 10 Running "Hello World" with 10 elements using 3 processors. "Hello" from Hello chare # 0 on processor 0 (told by -1). "Hello" from Hello chare # 1 on processor 1 (told by 0). "Hello" from Hello chare # 2 on processor 2 (told by 1). "Hello" from Hello chare # 3 on processor 0 (told by 2). "Hello" from Hello chare # 4 on processor 1 (told by 3). "Hello" from Hello chare # 5 on processor 2 (told by 4). "Hello" from Hello chare # 6 on processor 0 (told by 5). "Hello" from Hello chare # 7 on processor 1 (told by 6). "Hello" from Hello chare # 9 on processor 0 (told by 8). "Hello" from Hello chare # 8 on processor 2 (told by 7). François Tessier Charm++ 15 / 25

Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 16 / 25

Charm++ - Load balancing Load-balancing Balance the load between the processing units Goal : Optimize the CPU consumption, temperature, or other metrics... François Tessier Charm++ 17 / 25

Charm++ - Load balancing Initial state François Tessier Charm++ 18 / 25

Charm++ - Load balancing Initial state LB François Tessier Charm++ 18 / 25

Charm++ - Load balancing Listing 6: NewLB.C CreateLBFunc_Def (NewLB, " Create NewLB ") void NewLB :: work ( BaseLB :: LDStats * stats ) { CkPrintf ("%d chares executed on %d procs \n", stats ->n_obj, stats ->nprocs ()); // Processor array ProcArray *parr = new ProcArray ( stats ); // Object graph ObjGraph *ogr = new ObjGraph ( stats ); std :: vector <Vertex >:: iterator v_it ; std :: vector <Edge >:: iterator e_it ; } for (v_it=ogr ->vertices. begin (); v_it!=ogr ->vertices.end (); ++v_it ) { double load = (* v_it ). getvertexload (); for (e_it=(* v_it ). sendtolist. begin (); e_it!=(* v_it ). sendtolist.end (); ++e_it ) { int from = (* v_it ). getvertexid (); int to = (* e_it ). getneighborid (); if ( from!= to) { CkPrintf ("%d msgs sent from %d to %d\n", (* e_it ). getnummsgs (), from, to ); } } } ogr ->vertices [12]. setnewpe (2); ogr ->convertdecisions ( stats ); François Tessier Charm++ 19 / 25

Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) class MyClass { public : int a; int *b; Listing 7: PUP function MyClass () { int a = 42; int *b = (int *) malloc (a* sizeof (int )); } void pup (PUP :: er &p) { p a; PUParray (p, b, a); if (p. isunpacking ()) { b = (int *) malloc (a* sizeof (int )); } } } François Tessier Charm++ 20 / 25

Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers GreedyLB RefineLB Hierarchical load balancers Thermal aware load balancers... 0 2 4 6 1 3 5 7 CPU Load Groups of chares assigned to cores François Tessier Charm++ 21 / 25

Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers GreedyLB RefineLB Hierarchical load balancers Thermal aware load balancers... 0 2 4 6 1 3 5 7 CPU Load Groups of chares assigned to cores François Tessier Charm++ 21 / 25

Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers GreedyLB RefineLB Hierarchical load balancers Thermal aware load balancers... 0 2 4 6 Chares François Tessier Charm++ 21 / 25

Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers GreedyLB RefineLB Hierarchical load balancers Thermal aware load balancers... 0 2 4 6 1 3 5 7 CPU Load Groups of chares assigned to cores François Tessier Charm++ 21 / 25

Charm++ - Results kneighbor Benchmarks application designed to simulate intensive communication between processes Experiments on PlaFRIM : 8 nodes with 8 cores on each (Intel Xeon 5550) Execution time (in seconds) kneighbor on 64 cores 256 elements 1MB message size 2000 1500 1000 500 0 DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased François Tessier Charm++ 22 / 25

Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 23 / 25

Conclusion To conclude... Charm++, message passing since 1984 High-level abstraction of a parallel program Interesting features : load balancing, fault-tolerance,... Useful tools; CharmDebug, Projections Used by some important applications : ChaNGa, NAMD, OpenAtom Want to know more? Website : http://charm.cs.uiuc.edu/ Tutorials : http://charm.cs.illinois. edu/tutorial/tableofcontents.htm François Tessier Charm++ 24 / 25

The End Thanks for your attention! François Tessier Charm++ 25 / 25