Charm++, what s that?!
|
|
|
- Bennett Bryant
- 10 years ago
- Views:
Transcription
1 Charm++, what s that?! Les Mardis du dev François Tessier - Runtime team October 15, 2013 François Tessier Charm++ 1 / 25
2 Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 2 / 25
3 Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 3 / 25
4 Introduction Parallel programming Decomposition What to do in parallel Mapping Which processor execute each task Scheduling The order Machine dependent expression Express the above decisions for the particular parallel machine François Tessier Charm++ 4 / 25
5 Introduction Scalable execution of parallel applications Cluster computing Number of cores is increasing But memory per core is decreasing (or increasing slowly) Applications need to communicate more and more... #Cores / Memory per core (in MB) 1e+07 1e Number of cores and amount of memory per core of the first ranking cluster in the Top500 list since 1993 #Cores Memory per core Years François Tessier Charm++ 5 / 25
6 Introduction How to communicate between nodes? Message passing paradigm Inter-process communication MPI, SOAP, Charm++ Communication network Message passing interface Process 1 Process 2 Process 3 Process 4 Figure : Communication from process 1 to process 4 with a message passing interface François Tessier Charm++ 6 / 25
7 Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 7 / 25
8 Charm++ François Tessier Charm++ 8 / 25
9 Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python François Tessier Charm++ 9 / 25
10 Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python How to define Charm++? Object-oriented Based on the C++ programming language Objects called chares (contain data, send and receive messages, perform task) François Tessier Charm++ 9 / 25
11 Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python How to define Charm++? Object-oriented Based on the C++ programming language Objects called chares (contain data, send and receive messages, perform task) Asynchronous message passing Chares exchange messages to communicate Messages sent/received asynchronously to the code execution François Tessier Charm++ 9 / 25
12 Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python How to define Charm++? Object-oriented Based on the C++ programming language Objects called chares (contain data, send and receive messages, perform task) Asynchronous message passing Chares exchange messages to communicate Messages sent/received asynchronously to the code execution Parallel Chunks of code Messages François Tessier Charm++ 9 / 25
13 Charm++ Presentation Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python How to define Charm++? Object-oriented Based on the C++ programming language Objects called chares (contain data, send and receive messages, perform task) Asynchronous message passing Chares exchange messages to communicate Messages sent/received asynchronously to the code execution Parallel Chunks of code Messages Programming paradigm Way of writing program Features and structures added on top of C++ François Tessier Charm++ 9 / 25
14 Charm++ What is a Charm++ Program? Collection of communicant chare objects, included the "main" chare (global object space) No importance about the number of PU or the type of interconnect (RTS) Send a message by calling another chare s entry point function (reception point) returns immediately from the perspective of the calling chare Figure : User s view of a Charm++ Application (Credits : François Tessier Charm++ 10 / 25
15 Charm++ The Charm++ Compilation and the Runtime System First strenght: it works! Application written as a collection of communicating objects Compilation, execution : Specific target platform, physical resources (-np) RTS manages the details of the physical resources and can take some decisions about : Mapping chare objects to physical processors Load-balancing chare objects Checkpointing Fault-tolereance... Figure : Compilation Process for a Chare Class(Credits : François Tessier Charm++ 11 / 25
16 Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 12 / 25
17 Charm++ - Hello World! Listing 1: Hello.h # ifndef MAIN_H # define MAIN_H class Main : public CBase_Main { }; public : Main ( CkArgMsg * msg ); Main ( CkMigrateMessage * msg ); # endif // MAIN_H Listing 2: Hello.ci mainmodule main { mainchare Main { entry Main ( CkArgMsg * msg ); }; Listing 3: Hello.C # include "main.decl.h" # include "main.h" // Entry point of Charm ++ application Main :: Main ( CkArgMsg * msg ) { } // Print a message for the user CkPrintf (" Hello World!\n"); // Exit the application CkExit (); // Constructor needed for chare object // migration ( ignore for now ) // NOTE : This constructor does not need // to appear in the ". ci" file Main :: Main ( CkMigrateMessage * msg ) { } # include "main.def.h" }; François Tessier Charm++ 13 / 25
18 Charm++ - 1Darray Listing 4: 1Darray.ci mainmodule hello { readonly CProxy_Main mainproxy ; readonly int nelements ; mainchare Main { entry Main ( CkArgMsg *m); entry void done ( void ); }; array [1D] Hello { entry Hello ( void ); entry void SayHi ( int from ); }; }; Listing 5: 1Darray.C # include <stdio.h> # include " hello. decl.h" /* readonly */ CProxy_Main mainproxy ; /* readonly */ int nelements ; /* mainchare */ class Main : public CBase_Main { public : Main ( CkArgMsg * m) { if(m-> argc >1 ) nelements = atoi (m-> argv [1]); delete m; CkPrintf (" Run on %d processors for %d el.\n", CkNumPes (), nelements ); mainproxy = thisproxy ; CProxy_Hello arr = CProxy_Hello :: cknew ( nelements ); arr [0]. SayHi ( -1); }; void done ( void ) { CkPrintf (" All done \n"); CkExit (); }; }; /* array [1D]*/ class Hello : public CBase_Hello { public : Hello () { CkPrintf (" Hello %d created \n",thisindex ); } Hello ( CkMigrateMessage *m) {} void SayHi ( int from ) { CkPrintf ("\" Hello \" from Hello chare # %d on " " processor %d ( told by %d ).\ n", thisindex, CkMyPe (), from ); if ( thisindex < nelements -1) thisproxy [thisindex +1]. SayHi (thisindex ); else mainproxy. done (); // Done! } }; François Tessier Charm++ 14 / 25
19 Charm++ - 1Darray $./charmrun +p3./hello 10 Running "Hello World" with 10 elements using 3 processors. "Hello" from Hello chare # 0 on processor 0 (told by -1). "Hello" from Hello chare # 1 on processor 1 (told by 0). "Hello" from Hello chare # 2 on processor 2 (told by 1). "Hello" from Hello chare # 3 on processor 0 (told by 2). "Hello" from Hello chare # 4 on processor 1 (told by 3). "Hello" from Hello chare # 5 on processor 2 (told by 4). "Hello" from Hello chare # 6 on processor 0 (told by 5). "Hello" from Hello chare # 7 on processor 1 (told by 6). "Hello" from Hello chare # 9 on processor 0 (told by 8). "Hello" from Hello chare # 8 on processor 2 (told by 7). François Tessier Charm++ 15 / 25
20 Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 16 / 25
21 Charm++ - Load balancing Load-balancing Balance the load between the processing units Goal : Optimize the CPU consumption, temperature, or other metrics... François Tessier Charm++ 17 / 25
22 Charm++ - Load balancing Initial state François Tessier Charm++ 18 / 25
23 Charm++ - Load balancing Initial state LB François Tessier Charm++ 18 / 25
24 Charm++ - Load balancing Listing 6: NewLB.C CreateLBFunc_Def (NewLB, " Create NewLB ") void NewLB :: work ( BaseLB :: LDStats * stats ) { CkPrintf ("%d chares executed on %d procs \n", stats ->n_obj, stats ->nprocs ()); // Processor array ProcArray *parr = new ProcArray ( stats ); // Object graph ObjGraph *ogr = new ObjGraph ( stats ); std :: vector <Vertex >:: iterator v_it ; std :: vector <Edge >:: iterator e_it ; } for (v_it=ogr ->vertices. begin (); v_it!=ogr ->vertices.end (); ++v_it ) { double load = (* v_it ). getvertexload (); for (e_it=(* v_it ). sendtolist. begin (); e_it!=(* v_it ). sendtolist.end (); ++e_it ) { int from = (* v_it ). getvertexid (); int to = (* e_it ). getneighborid (); if ( from!= to) { CkPrintf ("%d msgs sent from %d to %d\n", (* e_it ). getnummsgs (), from, to ); } } } ogr ->vertices [12]. setnewpe (2); ogr ->convertdecisions ( stats ); François Tessier Charm++ 19 / 25
25 Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) class MyClass { public : int a; int *b; Listing 7: PUP function MyClass () { int a = 42; int *b = (int *) malloc (a* sizeof (int )); } void pup (PUP :: er &p) { p a; PUParray (p, b, a); if (p. isunpacking ()) { b = (int *) malloc (a* sizeof (int )); } } } François Tessier Charm++ 20 / 25
26 Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers GreedyLB RefineLB Hierarchical load balancers Thermal aware load balancers CPU Load Groups of chares assigned to cores François Tessier Charm++ 21 / 25
27 Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers GreedyLB RefineLB Hierarchical load balancers Thermal aware load balancers CPU Load Groups of chares assigned to cores François Tessier Charm++ 21 / 25
28 Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers GreedyLB RefineLB Hierarchical load balancers Thermal aware load balancers Chares François Tessier Charm++ 21 / 25
29 Charm++ - Load balancing Load-balancing For iterative applications charmrun +p64./app +balancer newlb +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers GreedyLB RefineLB Hierarchical load balancers Thermal aware load balancers CPU Load Groups of chares assigned to cores François Tessier Charm++ 21 / 25
30 Charm++ - Results kneighbor Benchmarks application designed to simulate intensive communication between processes Experiments on PlaFRIM : 8 nodes with 8 cores on each (Intel Xeon 5550) Execution time (in seconds) kneighbor on 64 cores 256 elements 1MB message size DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased François Tessier Charm++ 22 / 25
31 Outline 1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 Conclusion François Tessier Charm++ 23 / 25
32 Conclusion To conclude... Charm++, message passing since 1984 High-level abstraction of a parallel program Interesting features : load balancing, fault-tolerance,... Useful tools; CharmDebug, Projections Used by some important applications : ChaNGa, NAMD, OpenAtom Want to know more? Website : Tutorials : edu/tutorial/tableofcontents.htm François Tessier Charm++ 24 / 25
33 The End Thanks for your attention! François Tessier Charm++ 25 / 25
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
Optimizing Load Balance Using Parallel Migratable Objects
Optimizing Load Balance Using Parallel Migratable Objects Laxmikant V. Kalé, Eric Bohm Parallel Programming Laboratory University of Illinois Urbana-Champaign 2012/9/25 Laxmikant V. Kalé, Eric Bohm (UIUC)
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Outline Dynamic Load Balancing framework in Charm++ Measurement Based Load Balancing Examples: Hybrid Load Balancers Topology-aware
An Incomplete C++ Primer. University of Wyoming MA 5310
An Incomplete C++ Primer University of Wyoming MA 5310 Professor Craig C. Douglas http://www.mgnet.org/~douglas/classes/na-sc/notes/c++primer.pdf C++ is a legacy programming language, as is other languages
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
Parallel Computing. Shared memory parallel programming with OpenMP
Parallel Computing Shared memory parallel programming with OpenMP Thorsten Grahs, 27.04.2015 Table of contents Introduction Directives Scope of data Synchronization 27.04.2015 Thorsten Grahs Parallel Computing
Lightning Introduction to MPI Programming
Lightning Introduction to MPI Programming May, 2015 What is MPI? Message Passing Interface A standard, not a product First published 1994, MPI-2 published 1997 De facto standard for distributed-memory
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
C++ Programming Language
C++ Programming Language Lecturer: Yuri Nefedov 7th and 8th semesters Lectures: 34 hours (7th semester); 32 hours (8th semester). Seminars: 34 hours (7th semester); 32 hours (8th semester). Course abstract
High performance computing systems. Lab 1
High performance computing systems Lab 1 Dept. of Computer Architecture Faculty of ETI Gdansk University of Technology Paweł Czarnul For this exercise, study basic MPI functions such as: 1. for MPI management:
MUSIC Multi-Simulation Coordinator. Users Manual. Örjan Ekeberg and Mikael Djurfeldt
MUSIC Multi-Simulation Coordinator Users Manual Örjan Ekeberg and Mikael Djurfeldt March 3, 2009 Abstract MUSIC is an API allowing large scale neuron simulators using MPI internally to exchange data during
Debugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013
MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
CSC230 Getting Starting in C. Tyler Bletsch
CSC230 Getting Starting in C Tyler Bletsch What is C? The language of UNIX Procedural language (no classes) Low-level access to memory Easy to map to machine language Not much run-time stuff needed Surprisingly
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015
CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015 1. Goals and Overview 1. In this MP you will design a Dynamic Load Balancer architecture for a Distributed System 2. You will
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism
Parallel Programming with MPI on the Odyssey Cluster
Parallel Programming with MPI on the Odyssey Cluster Plamen Krastev Office: Oxford 38, Room 204 Email: [email protected] FAS Research Computing Harvard University Objectives: To introduce you
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Client/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
The C Programming Language course syllabus associate level
TECHNOLOGIES The C Programming Language course syllabus associate level Course description The course fully covers the basics of programming in the C programming language and demonstrates fundamental programming
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Cloud Friendly Load Balancing for HPC Applications: Preliminary Work
Cloud Friendly Load Balancing for HPC Applications: Preliminary Work Osman Sarood, Abhishek Gupta and Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign Urbana,
Storage Classes CS 110B - Rule Storage Classes Page 18-1 \handouts\storclas
CS 110B - Rule Storage Classes Page 18-1 Attributes are distinctive features of a variable. Data type, int or double for example, is an attribute. Storage class is another attribute. There are four storage
Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)
Introduction Reading Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF 1 Programming Assignment Notes Assume that memory is limited don t replicate the board on all nodes Need to provide load
Mesh Partitioning and Load Balancing
and Load Balancing Contents: Introduction / Motivation Goals of Load Balancing Structures Tools Slide Flow Chart of a Parallel (Dynamic) Application Partitioning of the initial mesh Computation Iteration
GPI Global Address Space Programming Interface
GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming
The CNMS Computer Cluster
The CNMS Computer Cluster This page describes the CNMS Computational Cluster, how to access it, and how to use it. Introduction (2014) The latest block of the CNMS Cluster (2010) Previous blocks of the
How To Create A Graphlab Framework For Parallel Machine Learning
This work is supported by NSF grants CCF-1149252, CCF-1337215, and STARnet, a Semiconductor Research Corporation Program, sponsored by MARCO and DARPA. Datacenter Simulation Methodologies: GraphLab Tamara
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH P.Neelakantan Department of Computer Science & Engineering, SVCET, Chittoor [email protected] ABSTRACT The grid
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
Information Processing, Big Data, and the Cloud
Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive
P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE
1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State
Technical Computing Suite Job Management Software
Technical Computing Suite Job Management Software Toshiaki Mikamo Fujitsu Limited Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Outline System Configuration and Software Stack Features The major functions
University of Amsterdam - SURFsara. High Performance Computing and Big Data Course
University of Amsterdam - SURFsara High Performance Computing and Big Data Course Workshop 7: OpenMP and MPI Assignments Clemens Grelck [email protected] Roy Bakker [email protected] Adam Belloum [email protected]
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Introduction to Cloud Computing
Introduction to Cloud Computing Distributed Systems 15-319, spring 2010 11 th Lecture, Feb 16 th Majd F. Sakr Lecture Motivation Understand Distributed Systems Concepts Understand the concepts / ideas
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
An Introduction to High Performance Computing in the Department
An Introduction to High Performance Computing in the Department Ashley Ford & Chris Jewell Department of Statistics University of Warwick October 30, 2012 1 Some Background 2 How is Buster used? 3 Software
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, [email protected] Abstract 1 Interconnect quality
A Load Balancing Schema for Agent-based SPMD Applications
A Load Balancing Schema for Agent-based SPMD Applications Claudio Márquez, Eduardo César, and Joan Sorribes Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona,
Cellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)
Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming) Dynamic Load Balancing Dr. Ralf-Peter Mundani Center for Simulation Technology in Engineering Technische Universität München
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE
EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings Sun Microsystems, Inc. Alok Choudhary Northwestern University September 19, 2006 This material is based on work supported
Performance Analysis of Web based Applications on Single and Multi Core Servers
Performance Analysis of Web based Applications on Single and Multi Core Servers Gitika Khare, Diptikant Pathy, Alpana Rajan, Alok Jain, Anil Rawat Raja Ramanna Centre for Advanced Technology Department
Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
Development of Monitoring and Analysis Tools for the Huawei Cloud Storage
Development of Monitoring and Analysis Tools for the Huawei Cloud Storage September 2014 Author: Veronia Bahaa Supervisors: Maria Arsuaga-Rios Seppo S. Heikkila CERN openlab Summer Student Report 2014
In-situ Visualization
In-situ Visualization Dr. Jean M. Favre Scientific Computing Research Group 13-01-2011 Outline Motivations How is parallel visualization done today Visualization pipelines Execution paradigms Many grids
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014
Using WestGrid Patrick Mann, Manager, Technical Operations Jan.15, 2014 Winter 2014 Seminar Series Date Speaker Topic 5 February Gino DiLabio Molecular Modelling Using HPC and Gaussian 26 February Jonathan
Session 2: MUST. Correctness Checking
Center for Information Services and High Performance Computing (ZIH) Session 2: MUST Correctness Checking Dr. Matthias S. Müller (RWTH Aachen University) Tobias Hilbrich (Technische Universität Dresden)
MOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [[email protected]] Francesco Maria Taurino [[email protected]] Gennaro Tortone [[email protected]] Napoli Index overview on Linux farm farm
Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)
Advanced MPI Hybrid programming, profiling and debugging of MPI applications Hristo Iliev RZ Rechen- und Kommunikationszentrum (RZ) Agenda Halos (ghost cells) Hybrid programming Profiling of MPI applications
Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH
Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability
Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 18, 1037-1048 (2002) Short Paper Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors PANGFENG
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
What is Multi Core Architecture?
What is Multi Core Architecture? When a processor has more than one core to execute all the necessary functions of a computer, it s processor is known to be a multi core architecture. In other words, a
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
Tableau Server Scalability Explained
Tableau Server Scalability Explained Author: Neelesh Kamkolkar Tableau Software July 2013 p2 Executive Summary In March 2013, we ran scalability tests to understand the scalability of Tableau 8.0. We wanted
A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems
2012 41st International Conference on Parallel Processing A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems Laércio L. Pilla 1,2, Christiane Pousa Ribeiro 2, Daniel Cordeiro 2,
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Improved LS-DYNA Performance on Sun Servers
8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms
