PGAS, Global Arrays, and MPI-3 How do they fit together? Brad Chamberlain Chapel Team, Cray Inc. Global Arrays Technical Meeting May 7, 2010

Similar documents
Domains are first-class index sets Specify the size and shape of arrays Support iteration, array operations, etc.

BLM 413E - Parallel Programming Lecture 3

Programming Languages for Large Scale Parallel Computing. Marc Snir

Petascale Software Challenges. William Gropp

Data Centric Systems (DCS)

MPI and Hybrid Programming Models. William Gropp

BSC vision on Big Data and extreme scale computing

Unified Performance Data Collection with Score-P

Sourcery Overview & Virtual Machine Installation

The Design and Implementation of Scalable Parallel Haskell

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

OS/Run'me and Execu'on Time Produc'vity

A Case Study - Scaling Legacy Code on Next Generation Platforms

HPC Wales Skills Academy Course Catalogue 2015

Scientific Computing Programming with Parallel Objects

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Locales Domain Maps Layouts Distribu?ons Chapel Standard Layouts and Distribu?ons User defined Domain Maps. Chapel: Domain Maps

Keys to node-level performance analysis and threading in HPC applications

Spring 2011 Prof. Hyesoon Kim

General Introduction

Virtual Machine Learning: Thinking Like a Computer Architect

CUDA programming on NVIDIA GPUs

Software Service Engineering Architect s Dream or Developer s Nightmare?

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Developing Parallel Applications with the Eclipse Parallel Tools Platform

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

RUP Design. Purpose of Analysis & Design. Analysis & Design Workflow. Define Candidate Architecture. Create Initial Architecture Sketch

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

High Performance Computing

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper.

SOA Adoption Challenges

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

HPC enabling of OpenFOAM R for CFD applications

GASPI A PGAS API for Scalable and Fault Tolerant Computing

A Multi-layered Domain-specific Language for Stencil Computations

OpenACC 2.0 and the PGI Accelerator Compilers

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

A Local-View Array Library for Partitioned Global Address Space C++ Programs

Running applications on the Cray XC30 4/12/2015

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Cloud+X: Exploring Asynchronous Concurrent Applications

Chap 1. Introduction to Software Architecture

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GETTING STARTED WITH ANDROID DEVELOPMENT FOR EMBEDDED SYSTEMS

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

OpenACC Programming and Best Practices Guide

Program Coupling and Parallel Data Transfer Using PAWS

Applying 4+1 View Architecture with UML 2. White Paper

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Getting the Most Out of VMware Mirage with Hitachi Unified Storage and Hitachi NAS Platform WHITE PAPER

Cluster, Grid, Cloud Concepts

GPU Computing - CUDA

Programming Languages

Software Development around a Millisecond

Principles and characteristics of distributed systems and environments

Parallel Computing with MATLAB

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

The SpiceC Parallel Programming System of Computer Systems

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Trends in High-Performance Computing for Power Grid Applications

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Using Peer to Peer Dynamic Querying in Grid Information Services

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Pedraforca: ARM + GPU prototype

Java Environment for Parallel Realtime Development Platform Independent Software Development for Multicore Systems

aicas Technology Multi Core und Echtzeit Böse Überraschungen vermeiden Dr. Fridtjof Siebert CTO, aicas OOP 2011, 25 th January 2011

Software and the Concurrency Revolution

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Software Paradigms (Lesson 1) Introduction & Procedural Programming Paradigm

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Data Management for Portable Media Players

HPC ABDS: The Case for an Integrating Apache Big Data Stack

USING RUST TO BUILD THE NEXT GENERATION WEB BROWSER

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Tool Support for Inspecting the Code Quality of HPC Applications

Appendix 1 ExaRD Detailed Technical Descriptions

Introduction to Software Paradigms & Procedural Programming Paradigm

Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.

Chapter 2: Remote Procedure Call (RPC)

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

SwiftScale: Technical Approach Document

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Virtual Machines.

CSCI E 98: Managed Environments for the Execution of Programs

An Approach to High-Performance Scalable Temporal Object Storage

supercomputing. simplified.

Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters

Matrix Multiplication

Development Techniques for Native/Hybrid Tizen Apps. Presenter Matti Pakarinen

GPUs for Scientific Computing

CORBA and object oriented middleware. Introduction

HIGH PERFORMANCE BIG DATA ANALYTICS

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

The Microsoft Windows Hypervisor High Level Architecture

Transcription:

PGAS, Global Arrays, and MPI-3 How do they fit together? Brad Chamberlain Chapel Team, Cray Inc. Global Arrays Technical Meeting May 7, 2010

Disclaimer This talk s contents should be considered my personal opinions (or at least one facet of them) and not necessarily those of Cray Inc. nor my funding sources Chamberlain (2)

Outline PGAS Programming Models PGAS and communication libraries like ARMCI, MPI-3 PGAS and Global Arrays Chamberlain (3)

PGAS Definition PGAS: Partitioned Global Address Space (Or perhaps better: partitioned global namespace) Concept: support a strong sense of ownership and locality each variable is stored in a particular memory segment local variables are cheaper to access than remote ones details vary between languages support a shared namespace permit tasks to access any lexically visible variable, local or remote var x = 10; { x } // x is lexically visible Chamberlain (4)

PGAS Languages charter members: UPC, Co-Array Fortran, Titanium extensions to C, Fortran, and Java, respectively details vary, but potential for: arrays that are decomposed across nodes pointers that refer to remote objects honorary retroactive members: HPF, ZPL, Global Arrays, the next generation: Chapel, X10, Chamberlain (5)

PGAS Languages PGAS: What s in a Name? memory model programming model execution model data structures communication MPI distributed memory cooperating executables (often SPMD in practice) manually fragmented APIs OpenMP shared memory global-view parallelism shared memory multithreaded shared memory arrays N/A CAF UPC Titanium PGAS Single Program, Multiple Data (SPMD) co-arrays 1D block-cyc arrays/ distributed pointers class-based arrays/ distributed pointers co-array refs implicit method-based Chapel PGAS global-view parallelism distributed memory multithreaded global-view distributed arrays implicit

PGAS Languages PGAS: What s in a Name? memory model programming model execution model data structures communication X10 PGAS global-view parallelism distributed memory multithreaded global-view distributed arrays visible in syntax Global Arrays PGAS Single Program, Multiple Data (SPMD) global-view distributed arrays implicit via APIs CAF UPC Titanium PGAS Single Program, Multiple Data (SPMD) co-arrays 1D block-cyc arrays/ distributed pointers class-based arrays/ distributed pointers co-array refs implicit method-based Chapel PGAS global-view parallelism distributed memory multithreaded global-view distributed arrays implicit

Evaluation: Traditional PGAS Languages Strengths: + Clean expression of communication through variable namespace + Able to reason about locality/affinity to support scalable performance + Separation of data transfer from synchronization to reduce overheads + Elegant, reasonably minimalist extensions to established languages Weaknesses: Limited to a single-threaded SPMD programming/execution model unless you mix in some other model (if you are able to) Conflation of parallelism and locality MPI has this problem as well: a PE is the unit for both concepts fails to support Harrison s non-process-centric computing Arrays more restricted than ideal no native support for sparse, associative, unstructured arrays CAF: no global-view, supports distributed arrays of local arrays UPC: as in C, 1D arrays only Chamberlain (9)

A Design Principle HPC should revisit Support the general case, optimize for the common case Claim: a lot of suffering in HPC is due to programming models that focus too much on common cases: e.g., only supporting a single mode of parallelism e.g., exposing too much about target architecture and implementation Impacts: hybrid models needed to target all modes of parallelism (HW & SW) challenges arise when architectures change (e.g., multicore, GPUs) presents challenges to adoption ( linguistic dead ends ) That said, this approach is also pragmatic particularly given community size, (relatively) limited resources and frankly, we ve achieved a lot of great science when things fit But we shouldn t stop striving for more general approaches Chamberlain (10)

How do HPCS PGAS languages differ? focus on general parallel programming support for task- and data-parallelism, concurrency, nestings of these support multiple granularities of software and hardware parallelism distinct concepts for parallelism vs. architectural locality tasks are units of parallel work locales/places are units of architectural locality each locale/place can execute multiple tasks post-spmd (APGAS) programming/execution models user is not forced to code in an SPMD style (but may) ability to run multiple threads within each process/node rich support for arrays multidimensional, sparse, associative, unstructured user-defined distributions and memory layouts Chamberlain (11)

Implementing PGAS Single-sided communication Remote Task Creation Thread Safety Portability Traditional PGAS -- -- HPCS PGAS Single-sided communication Remote Task Creation Thread Safety Portability MPI-2 ~ SHMEM ARMCI ~ GASNet MPI-3???? Chamberlain (12)

MPI-3 for PGAS Two key working groups: Remote Memory Access (Single-sided Communication) Goal: Do 1-sided communication right (fix MPI-2 s support) Status: active References: Investigating High Performance RMA Interfaces for the MPI-3 Standard, Tipparaju et al., ICPP 2009 https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/rmawikipage Challenges: memory consistency model (how strict or weak?) Concern: are ARMCI and GASNet teams sufficiently involved? Active Messages Goal: add support for active messages to MPI Status: inactive References: http://meetings.mpi-forum.org/mpi3.0_am.php Note: Active Messages are a nice fallback for implementing RMA in cases that the hardware doesn t, so this shouldn t get left behind Chamberlain (13)

MPI-3 for PGAS: Great Potential Would provide a long-overdue standard for 1-sided comm. Would result in multiple implementations of the same API Would result in vendor-optimized implementations Would support standardized single-sided benchmarks Could drive hardware designs in favor of global address spaces, better support for single-sided communication Would greatly help with interoperability between distinct programming models Would permit the ARMCI/GASNet teams to claim victory and work on things other than keeping up with each new network Failing this, a merge of ARMCI and GASNet would be the next best thing for users like us, helping with many of the above Chamberlain (14)

What about GA and HPCS languages? First, let me touch on a few relevant Chapel features Chamberlain (15)

Chapel Domains and Arrays Domains: first-class index sets describe size/shape of arrays (potentially distributed) describe iteration spaces describe index sets for high-level operations var D: domain(2) = [1..4, 1..8]; var Inner: subdomain(d) = [2..3, 2..7]; D Inner Arrays: mappings from domain indices to variables multidimensional: arbitrary dimensions and element types associative: map from arbitrary values to elements (hash table-like) unstructured: support irregular (pointer-based) data structures sparse: support sparse subsets of any of the above Chamberlain (16) var A, B: [D] real; A B

Chapel Distributions and Layouts (Domain Maps) Goal: separate algorithm from implementation for parallel, distributed data structures Approach: permit users to author their own distributions & layouts, written in Chapel rather than a simple mapping, use a method-based approach authors create classes that implement domains and arrays: how they re mapped between locales how they re stored in memory how to access them, iterate over them, slice them, etc. implement standard domain maps using the identical framework to avoid performance cliffs between user cases and built-in a work-in-progress: operational today, but still evolving and being optimized For a general overview, see upcoming HotPAR paper: User-Defined Distributions and Layouts in Chapel: Philosophy and Framework Chamberlain (17)

Interoperability in Chapel A crucial goal for the adoption of Chapel (or any new language) is general interoperability with existing languages to preserve legacy application code to re-use existing libraries to permit existing kernels to be rewritten using new technologies to support differing tastes and diverse needs within a single app. Our initial focus is primarily on supporting calls from Chapel to C to reuse key libraries We believe user-defined distributions will play a key role in interoperating with existing distributed data structures in situ e.g., I want to view my distributed MPI data structures as a global array within Chapel Chamberlain (18)

Global Arrays and (HPCS) PGAS languages At first blush, there are both synergies and mismatches: Similarities: global-view data structures permit users to think in logical terms rather than physical Differences: SPMD vs. post-spmd programming/execution models moving data only vs. moving data and computation varying degrees of richness in data structures limitations on rank, element type, flavor, distributions different levels of maturity GA is here, in-use, and successful HPCS languages are still emerging and proving themselves In spite of differences, let me propose four possible interaction models Chamberlain (19)

1) Support GA-based distributions Concept: Use Chapel s interoperability features to create a distribution that wraps Global Arrays benefit from effort invested in GA and preserve it provide language-based support for GA-style computations var D: domain(2) dmapped GA( ) = [1..4, 1..8]; Chamberlain (20)

2) Support GA-HPCS Interoperability Concept: Investigate what it would take to support Chapel/X10 and GA within a single program in terms of mechanics like runtime software layers in terms of language semantics at what granularity could languages execute within an app.? could they share data structures? what types of control flow exchanges could occur? Note: Tom Epperly and Jim McGraw from LLNL (Babel team) are working on organizing an interoperability workshop this summer/fall to look at questions like this across today s major parallel models (MPI, OpenMP, PGAS, HPCS, etc.) Please contact them if interested in participating. Chamberlain (21)

3) Implement Global Arrays over HPCS Language Fast-forward ten years into the future and imagine that Chapel and X10 are unmitigated successes (!) Yet, legacy GA code still exists Concept: Implement GA in terms of Chapel/X10 removes need to duplicate lower-level layers of software preserves familiar interface for existing codes and users A General Challenge: How do we reduce our need to maintain every programming model going forward? What is the right way to retire a programming model? As indicated before, I don t think the answer is cease to make new ones Chamberlain (22)

4) Exchange of Information Many of GA s emerging directions mesh with themes being pursued in the HPCS languages task queuing for dynamic load balancing/work stealing sparsity user-defined data structures next-generation applications petascale and exascale architectures Concept: Exchange information in these common areas Better: architect code so that it can be shared Also: Perform cross-language comparisons of flexibility, readability, maintainability in application codes and idioms Chamberlain (23)

Summary GA and PGAS languages have similar needs single-sided communication active messages (in GA, for future idioms and architectures) in my opinion, support for both these things in MPI-3 would represent a huge step forward for HPC Yet, there are also differences GA is more evolutionary: what works well and what can we add next? HPCS strives for revolution: what might an ideal parallel language be like? We have many common challenges to work on more advanced computational idioms dynamic load balancing resiliency exascale s increasingly heterogeneous/hierarchical nodes DOE workshops have identified desire for evolutionary and revolutionary, yet existing, programming models Chamberlain (24)

For more information on HPCS languages Chapel: chapel_info@cray.com http://chapel.cray.com https://sourceforge.net/projects/chapel/ X10: http://x10.codehaus.org/ http://www.research.ibm.com/x10/ Chamberlain (25)