Center for Shock Wave-processing of! Advanced Reactive Materials! (C-SWARM) Computer Science! Andrew Lumsdaine

Similar documents

Data Centric Systems (DCS)

Building Platform as a Service for Scientific Applications

Scientific Computing Programming with Parallel Objects

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

BSC vision on Big Data and extreme scale computing

Managing Data Center Power and Cooling

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Appendix 1 ExaRD Detailed Technical Descriptions

A Multi-layered Domain-specific Language for Stencil Computations

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

How To Understand The Concept Of A Distributed System

Hank Childs, University of Oregon

Software Development around a Millisecond

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

MAGENTO HOSTING Progressive Server Performance Improvements

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Optimizing Configuration and Application Mapping for MPSoC Architectures

Visualization and Data Analysis

Parallel Computing. Benson Muite. benson.

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

1. Memory technology & Hierarchy

High Performance Computing. Course Notes HPC Fundamentals

Beyond Embarrassingly Parallel Big Data. William Gropp

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Software and the Concurrency Revolution

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

An Open Architecture through Nanocomputing

Resource Utilization of Middleware Components in Embedded Systems

Principles and characteristics of distributed systems and environments

Knowledge-based Expressive Technologies within Cloud Computing Environments

HPC Programming Framework Research Team

In situ data analysis and I/O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN

HPC Deployment of OpenFOAM in an Industrial Setting

Optimizing Shared Resource Contention in HPC Clusters

MAQAO Performance Analysis and Optimization Tool

Pros and Cons of HPC Cloud Computing

Easier - Faster - Better

Solvers, Algorithms and Libraries (SAL)

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

End-user Tools for Application Performance Analysis Using Hardware Counters

The Ultimate in Scale-Out Storage for HPC and Big Data

Energy-Efficient, High-Performance Heterogeneous Core Design

Keys to node-level performance analysis and threading in HPC applications

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cluster, Grid, Cloud Concepts

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Sequential Performance Analysis with Callgrind and KCachegrind

Control 2004, University of Bath, UK, September 2004

Data Center and Cloud Computing Market Landscape and Challenges

International Workshop on Field Programmable Logic and Applications, FPL '99

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

High Performance Computing in the Multi-core Area

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Next Generation GPU Architecture Code-named Fermi

Big data management with IBM General Parallel File System

NSF Workshop: High Priority Research Areas on Integrated Sensor, Control and Platform Modeling for Smart Manufacturing

Computational grids will provide a platform for a new generation of applications.

Overlapping Data Transfer With Application Execution on Clusters

Symmetric Multiprocessing

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

OS/Run'me and Execu'on Time Produc'vity

Big Graph Processing: Some Background

Masters in Computing and Information Technology

EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALLEL JOBS IN CLUSTERS

Basin simulation for complex geological settings

Using On-chip Networks to Minimize Software Development Costs

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

MPI / ClusterTools Update and Plans

Thread level parallelism

Service-Oriented Architectures

Cray: Enabling Real-Time Discovery in Big Data

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Systolic Computing. Fundamentals

Part I Courses Syllabus

CHAPTER 1 INTRODUCTION

Switched Interconnect for System-on-a-Chip Designs

Masters in Human Computer Interaction

Masters in Advanced Computer Science

Multi-GPU Load Balancing for Simulation and Rendering

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Masters in Artificial Intelligence

HPC enabling of OpenFOAM R for CFD applications

Masters in Networks and Distributed Systems

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Load Balancing in Distributed Data Base and Distributed Computing System

Tools for Exascale Computing: Challenges and Strategies

Trends in High-Performance Computing for Power Grid Applications

Outline. Institute of Computer and Communication Network Engineering. Institute of Computer and Communication Network Engineering

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Parallel Programming Survey

Distributed Systems LEEC (2005/06 2º Sem.)

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Performance Monitoring of Parallel Scientific Applications

Service-Oriented Architecture and Software Engineering

Transcription:

Center for Shock Wave-processing of! Advanced Reactive Materials! (C-SWARM) Computer Science! Andrew Lumsdaine!

Overview Challenges and Goals Exascale Overview Overall Computer Science Strategy ParalleX and HPX Runtime System Active System Libraries 2

C-SWARM is a Good CS Problem Problem sizes that can challenge any capability machine More than just an adaptive mesh Multiple zones of computation, with moving boundaries Significantly different computational paradigms in each Data-driven algorithmic feedback to ensure accuracy Not just dynamic load-balancing but also dynamic data migration and dynamic control at run-time No single current architecture/software model is a good fit Different kernels need different computational mixes Fine-grained variation in load and computation scheduling Significant memory/communication demands 3

Three Key CS Questions How can we guide C-SWARM development to mesh most effectively with future Exascale systems? How can we manage the highly dynamic C-SWARM computations to run efficiently on such systems? How can we simplify the production of the C- SWARM code in ways that provide continuing and long-term improvements in performance and productivity? 4

Overview Challenges and Goals Exascale Overview Overall Computer Science Strategy ParalleX and HPX Runtime System Active System Libraries 5

Time Technology Events Have Radically Changed the Architecture Base of Future Exascale Candidates Moore s Law: transistor size and intrinsic delay continue to decrease Single core microprocessors: more capable & faster power increase offset by lower voltages Memory: more memory/chip Due to density increase and bigger chips Memory speed grows slowly System Interconnect: track clock Wire driven 2004 Today The rise of multi-core: More, simpler, cores per die Slower clocks Relatively constant off-die bandwidth Memory: Slow density increase Slow grow in off-memory bandwidth Interconnect: Complex, power consuming wire Very complex fiber optics Operating Voltage stopped decreasing Coolable max power/chip hit stops Off-chip I/O maxed out Economics of DRAM inhibited bigger chips Wire interconnect peaked 2020 6

Real NNSA Codes are also NOT LINPACK How are applications changing? What we traditionally care about From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007 Our(Systems(Are(( Designed(for(Here( Informatics Applications But$NNSA$Apps$(&$C-SWARM)$ Going$This$Way$ What industry cares about Original(Chart(from(R.(Murphy,(SNL,(June(2010( 7

Getting C-SWARM to Exascale Target architectures will become Very massively parallel Heavily heterogeneous With many FPUs per core And at best small memory and limited per core bandwidth And power dissipation a 1 st class design constraint At limits of capability machines 1 billion+ ops must be managed every cycle We must understand key performance metrics and how they scale across emerging architectures We must understand optimal policies for scheduling and load balancing We must focus on designing C-SWARM around memory and communications activity With efficient flops a secondary concern 8

Answering the Metrics Vs Architecture Question Cross cutting theme in C-SWARM Work in tandem with algorithm and application designers Understand scaling of basic kernels Select/instrument/measure metrics that reflect tall poles in kernel execution, especially memory & node-node Develop architecture-based models for predicting future C-SWARM performance Explore alternative policies for scheduling kernels Validate against execution on C-SWARM clusters Use continuing ties to other DOE/NNSA Exascale initiatives to expand applicability 9

Overview Challenges and Goals Exascale Overview Overall Computer Science Strategy ParalleX and HPX Runtime System Active System Libraries 10

Computer Science Contributions to C-SWARM Empower scientists to do science (not computer science) Develop software infrastructure to make C-SWARM applications possible on current and next-generation (Exascale) HPC platforms Efficiency and scalability At the same time Productivity and reliability Separate domain science from details of runtime system, computer architecture via advanced software libraries (ASLib) and run-time interfaces (XPI) 11

Quotes What is the Need? I spend 80% of my time trying to trick MPI into doing what it doesn t want to do. 20% of existing code is physics Our response No tricks needed with HPX runtime control needs of problem are met by mechanisms of runtime system Active System Libraries facilitates separation of concerns between science description and machine-oriented optimization 12

Algorithms Implementation Model Scientist writes software in domain-specific fashion Computer scientist specifies mapping from high-level (domain) to low-level (exec target) Compiler/Library framework applies mappings Runtime system carries out load-balancing, fault tolerance, adaptive tuning/optimization H/W accessed via interface to runtime system Scientist Data Structures Meta-data Scientific Program Scientist Data Domain Algorithms Abstract Code Specialization Hints Data Machine Code Hardware Information Specialized Source Code Environment Machine Code Metaprogramming Framework Domain-Specific Compiler metaprogram Active Libraries Results Compiler Libraries Results Environment Hardware Information Computer Scientist Conventional C-SWARM 13

Computer Science Strategy Apply and extend advanced execution model with essential properties for computational challenges For dramatic improvements in efficiency and scalability Deliver runtime system for dynamic resource management and task scheduling Exploit runtime information for continuous and adaptive control Develop domain-specific hierarchical programming interface For rapid algorithm development and testing To enable separate optimization (parameterization and composition) Support UQ and V&V Phased development for immediate impact on project and long term improvements in performance and productivity Leverage on-going research for mutual benefit of DOE programs 14

Concrete Computer Science Deliverables ParalleX execution model Cross-cutting guiding principles for co-design of application and system software HPX runtime system Dynamic adaptive resource management and task scheduling Advanced control policies and parallelism discovery XPI low-level programming interface Stable interface to HPX Target for high-level programming methods ASLib(s) Separation of concerns Generic libraries (compositional) Domain-specific language (DSL) Semantics necessary for physics, hiding system-specific issues Optimization framework 15

Overview Challenges and Goals Exascale Overview Overall Computer Science Strategy ParalleX and HPX Runtime System Active System Libraries 16

A Quick ParalleX Overview Localities Synchronous Domains AGAS Active Global Address Space ParalleX Processes with capabilities protection Computational Complexes threads & fine grain dataflow Local Control Objects synchronization and global distributed control state Distributed control operation global mutable data structures Parcels message-driven execution and continuation migration Percolation heterogeneous control Micro-checkpointing compute-validate-commit Self-aware introspection and declarative management 17

HPX Runtime System ParalleX execution model provides conceptual framework for C-SWARM software co-design and integration Attacks performance degradation for efficiency and scalability Starvation, latency, overhead, contention, uncertainty of asynchrony Addresses critical C-SWARM needs for dynamic adaptive system software Provides support for multi-scale multi-physics time-varying application Dynamic adaptive resource management and task scheduling for unprecedented efficiency and scalability Optimized scheduling and allocation policies Reliability through micro-checkpointing and compute-validate-commit cycle XPI low-level programming interface Low-risk means of building C-SWARM on top of HPX Leverages and complements prior and ongoing work of other programs 18

HPX Addresses C-SWARM Computational Needs Starvation Discover parallelism within adaptive runtime data structures Lightweight dynamic threads deliver new parallelism relative to static MPI processes Load balance to match algorithm parallelism to physical resource Pipeline to overlap success phases of computation Latency Overlap computation and communication (multithreading) Message-driven computation (move work to data) Overheads Active global address space Powerful but lightweight synchronization constructs Lightweight thread context switching 19

Overview Challenges and Goals Exascale Overview Overall Computer Science Strategy ParalleX and HPX Runtime System Active System Libraries 20

Enabling Software Technology Maxwell s compatibility criterion Retains Lagrangian representation in WAMR solver Enhance GFEM for PGFem3D Coupled with Lagrangian face-offsetting tracking method Surfaces represented explicitly Immersed Boundary Method for WAMR Unstructured triangulated surface grid Riemann problem in a local coordinate system Enhance Multi-time / Multi-domain Geometric Integrators Mixed mode of integration (explicit & implicit) Asynchronous time steps Enhance GCTH and Develop MPU Nested coupling algorithm between M&m 21

Computational and Computer Science Synergies BSP-based programming with MPI is static and inefficient. Time is spent on data-structures, domain decomposition, load balancing, communications, etc. Macro-scale Meso-scale Micro-scale shock zone transition zone inert zone O(0.1 m) O(0.1 mm) O(0.1 μm) Macro-continuum Micro-continuum Dynamic adaptive resource management and task scheduling - ParalleX execution model Reliable runtime system - HPX embodiment of ParalleX Productive code development - XPI and ASLibs Starvation, latency, overhead, contention, uncertainty of asynchrony, etc. C-SWARM software suite reliability Center for Shock Wave-procession Wave-processing of Advanced Reactive Materials (C-SWARM) 22

Computational and Computer Science Synergies Starvation t = 100 µs t = 200 µs Number and position of collocation points is not know a priori Extensive use of adaptive space/time/model refinements Lightweight user threads Exploiting new forms of parallelism Center for Shock Wave-procession Wave-processing of Advanced Reactive Materials (C-SWARM) 23

Computational and Computer Science Synergies Latency Meso-scale ensemble 1 Meso-scale ensemble 2 Macro-scale shock Order of solution of cells is not known a priori Concurrent coupling based on nested iterations Figure 5.6. The representative ellipsoidal pack for the 75 % wt rice, 25 % wt black mustard pack. Particles are identified as either rice (purple) or mustard (yellow) based oﬀ of their average grayscale values. Notice that that individual particles may be traced back to their voxel pack equivalents in Figure 5.4(a). Figure 5.6. The representative ellipsoidal pack for the 75 % wt rice, 25 % wt black mustard pack. Particles are identified as either rice (purple) or mustard (yellow) based oﬀ of their average grayscale values. Notice that that individual particles may be traced back to their voxel pack equivalents in Figure 5.4(a). Multi-threaded local execution control Message-driven computation that moves work to data 63 Center for Shock Wave-processing Wave-procession of Advanced Reactive Materials (C-SWARM) 63 24

Computational and Computer Science Synergies Overhead Macro-scale Implicit t2 Explicit t1 Meso-scale RUC required tolerance passed shock RUC fails to reach required tolerance Asynchronous time integration (explicit and implicit) Error control in space time and constitutive update Light-weight synchronization like dataflow Percolation, a form of controlled workflow management Center for Shock Wave-processing Wave-procession of Advanced Reactive Materials (C-SWARM) 25

Software Integration Framework Macro-Continuum WAMR Maxwell s compatibility Immersed boundary Variational Integrators Constitutive Update MPU WAMR - HPX Micro-Continuum PGFem3D Phase-transformation Chemical kinetics L12 plasticity Transient GCTH MPU PGFem3D - HPX DAKOTA V&V/UQ Exascale Architecture ParalleX model HPX XPI I 2 - ASLib I 2 - MPI Data Profiling I/O Visualization I 2 -MPI/I 2 -ASLib/I 2 -XPI high-level abstractions of interface data physics-aware input data-aware numeric-aware V&V/UQ-aware visualization-aware Typically solvers exchange data horizontally We propose vertical coupling across scales Modularity preserved with exceptional scaling properties Similar to the Roccom module - ASC CSAR center Center for Shock Wave-procession Wave-processing of Advanced Reactive Materials (C-SWARM) 26

Active System Libraries Metaprogramming-based interfaces to low-level libraries Rather than just calling functions, library and calling code influence each other Code generation and customized compilation enable usability and performance Benefits: Check uses of library and better diagnose errors Configure library to specific computer system, use case Domain-specific notation Optimize code using library Techniques: C++ template metaprogramming (cf. Parallel Boost Graph, AM++, STAPL, ) Compiler infrastructures (e.g., ROSE, LLVM) Run-time code generation (e.g., SEJITS, CorePy) 27

Progressive Implementation Strategy Basic domain specific libraries layered on XPI Generic programming approach to separate concerns Allow simultaneous development of M&m and HPX Provide declarative semantics of M&m codes Bridge abstraction-performance divide Provide insulation layer between the C-SWARM framework, the HPX runtime system, and the hardware Evolve to DSL Reliability through micro-checkpointing and computevalidate-commit cycle 28

Center for Shock Wave-processing of! Advanced Reactive Materials! (C-SWARM) Exascale Plan! Andrew Lumsdaine