UINTAH-X A Software Approach that leads towards Exascale

Similar documents

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Large-Scale Reservoir Simulation and Big Data Visualization

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

HPC enabling of OpenFOAM R for CFD applications

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Data Centric Systems (DCS)

Next Generation GPU Architecture Code-named Fermi

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Turbomachinery CFD on many-core platforms experiences and strategies

Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Parallel Computing. Benson Muite. benson.

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

HPC Wales Skills Academy Course Catalogue 2015

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

A Case Study - Scaling Legacy Code on Next Generation Platforms

HPC Deployment of OpenFOAM in an Industrial Setting

Multicore Parallel Computing with OpenMP

Introduction to Cloud Computing

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Chapter 2 Parallel Architecture, Software And Performance

Stream Processing on GPUs Using Distributed Multimedia Middleware

Overview of HPC Resources at Vanderbilt

Scientific Computing Programming with Parallel Objects

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

HPC Programming Framework Research Team

A Multi-layered Domain-specific Language for Stencil Computations

GPGPU accelerated Computational Fluid Dynamics

Kriterien für ein PetaFlop System

OpenMP Programming on ScaleMP

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Large-Scale Visualization

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Big Data Management in the Clouds and HPC Systems

OpenACC Programming and Best Practices Guide

Scalability and Classifications

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

BSC vision on Big Data and extreme scale computing

10- High Performance Compu5ng

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Principles and characteristics of distributed systems and environments

High Performance Computing. Course Notes HPC Fundamentals

In-situ Visualization

An Introduction to Parallel Computing/ Programming

Spring 2011 Prof. Hyesoon Kim

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

HPC with Multicore and GPUs

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Mesh Generation and Load Balancing

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

Addressing Big Data Challenges in Simulation-based Science

NA-ASC-128R-14-VOL.2-PN

Solvers, Algorithms and Libraries (SAL)

Parallel Computing for Data Science

Fast Multipole Method for particle interactions: an open source parallel library component

Performance Monitoring of Parallel Scientific Applications

Overlapping Data Transfer With Application Execution on Clusters

CUDA programming on NVIDIA GPUs

Hybrid Software Architectures for Big

High Performance Computing in CST STUDIO SUITE

Architectures for Big Data Analytics A database perspective

Lecture 2 Parallel Programming Platforms

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

HIGH PERFORMANCE BIG DATA ANALYTICS

Past, Present and Future Scalability of the Uintah Software

Big Data Systems CS 5965/6965 FALL 2015

Distributed communication-aware load balancing with TreeMatch in Charm++

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Part I Courses Syllabus

Case Study on Productivity and Performance of GPGPUs

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Trends in High-Performance Computing for Power Grid Applications

Petascale Software Challenges. William Gropp

Transcription:

UINTAH-X A Software Approach that leads towards Exascale Martin Chuck James Matt Valerio Berzins Hansen Sutherland Might Pascucci DOE: for CSAFE project (97-10), DOE NETL, DOE NNSA, DOE INCITE & ALCC NSF for SDCI and PetaApps, XSEDE, Argonne Oak Ridge www.uintah.utah.edu Planned CS Contrib

UINTAH-X A Software Approach that leads towards Exascale Planned CS Contrib

UINTAH-X Advancing Progress towards Exascale Computing by CS contributions in: 1. UINTAH-X RTS Runtime System (Martin) Will show that a DAG asynchronous RTS will give scalable exascale solutions for the Alstom problem when combined with code from 2. TASC (WASATCH NEBO)- Domain Specific Language (Matt / James) - will generate efficient DAG extensible exascale code from high-level complex physics (Alstom) that in turn uses 3. Visus (PIDX) and Visit for I/O, Analytics & Vis (Chuck/Valerio) - performs in-situ visualization and efficient I/O of exascale solutions of e.g. Alstom Boiler. 4. Related Work: Synergistic work is e.g. fault tolerance formal methods for debugging, languages (Martin, James and Matt with NSF funding)

Uintah-X Architecture Problem Specification via ARCHES or WASATCH/NEBO EDSL Abstract task-graph program executes on: Runtime System with: Asynchronous out-of-order execution Work stealing Overlap communication & computation Tasks running on cores and accelerators Scalable I/O via Visus PIDX Data Analysis and Visualization Simulation Controller ARCHES PIDX Runtime System Scheduler WASATCH NEBO Load Balancer VisIT Exascale programming will require prioritization of critical-path and non-critical path tasks, adaptive directed acyclic graph scheduling of critical-path tasks, and adaptive rebalancing of all tasks... [Brown et al. Exascale Report]

ARCHES or WASATCH/NEBO xml ask Compile Run Time (each timestep) RUNTIME SYSTEM Calculate Residuals Solve Equations Parallel I/O Visus PIDX VisIt UINTAH-X ARCHITECTURE

Uintahx Runtime System Task Graph ExecuBon Layer (runs on mul?ple cores) Thread 3 Thread 2 Thread 1 Data Management Internal Task Queue Select Task & Post MPI Receives recv MPI_ IRecv Comm Records Check Records & Find Ready Tasks valid Data Warehouse (one per- node) MPI_ Test Network External Task Queue Select Task & Execute Task get put Shared Objects (one per node) Post Task MPI Sends send MPI_ ISend

Unified Heterogeneous Scheduler & Runtime Device Kernels Running Device Task Running Device Task PUT GET Device Data Warehouse CPU Threads Completed tasks Running CPU Task Running CPU Task Signals PUT GET H2D copy D2H copy MPI sends MPI recvs Task Graph Device ready tasks Running CPU Task ready tasks PUT GET Data Warehouse (variables) Network Shared Scheduler Objects (host MEM) Internal ready tasks Device Task Queues Device-enabled tasks CPU Task Queues MPI Data Ready

MPM AMR ICE Strong Scaling Debugging what should have been this run caused DDT to crash Mira *one mesh patch for every 3 cores * Complex fluid-structure interaction problem with adaptive mesh refinement, see SC13 paper NSF funding.

Scalability is at least partially achieved by not executing tasks in order e.g. AMR fluid-structure interaction Straight line represents given order of tasks Green X shows when a task is actually executed. Above the line means late execution while below the line means early execution took place. More late tasks than early ones as e.g. TASKS: 1 2 3 4 5 1 4 2 3 5 Early Late execution

Linear Solves arises from Navier Stokes Equations Full model includes turbulence, chemical reactions and radiation Arrive at pressure Poisson equation to solve for p Use Hypre Solver distributed by LLNL Many linear solvers inc. Preconditioned Conjugate Gradients on regular mesh patches used Multi-grid pre-conditioner used Careful adaptive strategies needed to get scalability CCGrid13 paper. One radiation solve per timestep

Linear Solves arises from Navier Stokes Equations Full model includes turbulence, chemical reactions and radiation Arrive at pressure Poisson equation to solve for p Use Hypre Solver distributed by LLNL Many linear solvers inc. Preconditioned Conjugate Gradients on regular mesh patches used Multi-grid pre-conditioner used Careful adaptive strategies needed to get scalability CCGrid13 paper. One radiation solve Every 10 timesteps

NVIDIA AMGX Linear Solvers on GPUs Fast, scalable iterative gpu linear solvers for packages e.g., Flexible toolkit provides GPU accelerated Ax = b solver Simple API for multiple apps domains. Multiple GPUs (maybe thousands) with scaling Key Features Ruge-Steuben algebraic MG Krylov methods: CG, GMRES, BiCGStab, Smoothers and Solvers: Block- Jacobi, Gauss- Seidel, incomplete LU, Flexible composition system Scalar or coupled block Systems, MPI support OpenMP support, Flexible and high level C API, Free for non-commercial use Utah access via Utah CUDA COE.

Uintah-X Parting Thoughts Uintah DAG-based approach Solve wide range of problems with different communication requirements: Fluid-Structure interaction(mpmice), Combustion(Arches), Radiation(RMCRT) Excellent scaling on many Peta-scale supercomputers: Kraken, Titan, Stampede, Mira.. Ability to use heterogeneous resources: GPU and MIC Schedule tasks dynamically according to run-time MPI communications costs Future work: Task stealing across CPU/device queues Improve GPU/MIC memory management

Wasatch Energy equation Expressions Example Enthalpy diffusive flux Expression Tree Dependency specification Execution order

Wasatch Energy equation Expressions Example Enthalpy diffusive flux Create individual expressions and register them. Expression Tree Dependency specification Execution order

Wasatch Energy equation Expressions Example Enthalpy diffusive flux Create individual expressions and register them. Expression Tree Dependency specification Execution order Request calculation of any # of fields. The graph to accomplish this is automatically created!

Wasatch Energy equation Expressions Example Enthalpy diffusive flux Create individual expressions and register them. Expression Tree Dependency specification Execution order Request calculation of any # of fields. The graph to accomplish this is automatically created!

Wasatch Energy equation Expressions Example Enthalpy diffusive flux Create individual expressions and register them. Expression Tree Dependency specification Execution order Request calculation of any # of fields. The graph to accomplish this is automatically created!

TM Nebo: Embedded Domain Specific Language for HPC

TM TM Why an EDSL?

TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. Efficiency Matlab Expressiveness

TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ High performance should match hand-tuned code in performance Efficiency Matlab Expressiveness

TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ High performance should match hand-tuned code in performance Extensible Efficiency Matlab insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Model Architecture Discretization

TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ High performance should match hand-tuned code in performance Extensible Efficiency Matlab insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Model Discretization Architecture

TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ EDSL High performance should match hand-tuned code in performance Extensible Efficiency Matlab insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Model EDSL Discretization Architecture

TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ EDSL High performance should match hand-tuned code in performance Extensible Efficiency Matlab insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Plays well with others allow programmer to write in C++ and inter-operate with EDSL not an all-or-none approach allows concurrent development of EDSL and application codes Adapt to appropriate data structures without impacting application programmer. Model EDSL Discretization Architecture

TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ EDSL High performance should match hand-tuned code in performance Efficiency Matlab Extensible insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Plays well with others allow programmer to write in C++ and inter-operate with EDSL not an all-or-none approach allows concurrent development of EDSL and application codes Adapt to appropriate data structures without impacting application programmer. Model C++ Template Metaprogramming embedded in C++ compile-time reasoning about the problem Discretization EDSL Architecture

TM Chained Stencil Operations = r q = r ( rt )

TM Chained Stencil Operations = r q = r ( rt ) phi <<= divx( interpx(lambda) * gradx(temperature) ) + divy( interpy(lambda) * grady(temperature) + divz( interpz(lambda) * gradz(temperature); One inlined grid loop, no temporaries. Better parallel performance than without chaining. Compile-time consistency checking (field-operator and field-field compatibility).

TM Performance & Scalability Speedup using DSL* relative to other Uintah codes ARCHES ICE 9.8 9.0 10 7.5 4.2 2.5 3.2 1.31.7 8^3 6.2 32^3 5.3 5.8 128^3 *Comparison to ICE and ARCHES, sister codes in Uintah, on a 3D Taylor-Green vortex problem. Run on a single processor. 0 5 2.5

TM Performance & Scalability Speedup using DSL* relative to 4.2 2.5 3.2 1.31.7 8^3 other Uintah codes ARCHES 6.2 32^3 ICE 9.0 5.3 5.8 128^3 *Comparison to ICE and ARCHES, sister codes in Uintah, on a 3D Taylor-Green vortex problem. Run on a single processor. 9.8 0 5 2.5 10 7.5 Scalability 5 4.5 4 3.5 3 2.5 2 1.5 1 Weak scaling on Titan Wasatch/Uintah 32 3 43 3 64 3 128 3 0.5 10 100 1,000 10,000 100,000 #proc 1 indicates perfect weak scaling

TM Performance & Scalability Speedup using DSL* relative to 4.2 2.5 3.2 1.31.7 8^3 other Uintah codes ARCHES 6.2 32^3 ICE 9.0 5.3 5.8 128^3 *Comparison to ICE and ARCHES, sister codes in Uintah, on a 3D Taylor-Green vortex problem. Run on a single processor. 9.8 0 5 2.5 10 7.5 Scalability 5 4.5 4 3.5 3 2.5 2 1.5 1 Weak scaling on Titan Wasatch/Uintah 32 3 43 3 64 3 128 3 2.2 trillion DOF 0.5 10 100 1,000 10,000 100,000 #proc 1 indicates perfect weak scaling

TM Multicore & GPU Performance r ( rt )

TM Multicore & GPU Performance r ( rt ) phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) );

TM Multicore & GPU Performance r ( rt ) 2 4 6 8 10 12 GPU phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) ); 64x64x64 1.3 2.4 2.6 3.3 3.3 3.3 13.8 128x128x128 1.9 3.9 4.9 6.8 7.8 6.0 26.0

TM Multicore & GPU Performance r ( rt ) 2 4 6 8 10 12 GPU phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) ); 64x64x64 1.3 2.4 2.6 3.3 3.3 3.3 13.8 128x128x128 1.9 3.9 4.9 6.8 7.8 6.0 26.0 r ( u + J )

TM Multicore & GPU Performance r ( rt ) 2 4 6 8 10 12 GPU phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) ); 64x64x64 1.3 2.4 2.6 3.3 3.3 3.3 13.8 128x128x128 1.9 3.9 4.9 6.8 7.8 6.0 26.0 r ( u + J ) rhs <<= -divopx_( xconvflux_ + xdiffflux_ ) -divopy_( yconvflux_ + ydiffflux_ ) -divopz_( zconvflux_ + zdiffflux_ );

TM Multicore & GPU Performance r ( rt ) 2 4 6 8 10 12 GPU phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) ); 64x64x64 1.3 2.4 2.6 3.3 3.3 3.3 13.8 128x128x128 1.9 3.9 4.9 6.8 7.8 6.0 26.0 r ( u + J ) 2 4 6 8 10 12 GPU rhs <<= -divopx_( xconvflux_ + xdiffflux_ ) -divopy_( yconvflux_ + ydiffflux_ ) -divopz_( zconvflux_ + zdiffflux_ ); 64x64x64 1.8 2.9 2.9 3.4 3.6 3.6 16.3 128x128x128 2.0 3.6 5.0 6.5 6.1 4.8 13.5

TM Multicore & GPU - pointwise calculations

TM Multicore & GPU - pointwise calculations delg <<= 16.0 * PI / 3.0 * molecularvolume * molecularvolume * surfaceeng * surfaceeng * surfaceeng / (KB * KB * temperature * temperature * log(supersat) * log(supersat)) + KB * temperature * log(supersat) - surfaceeng * pow(36.0 * PI * molecularvolume * molecularvolume, 1.0 / 3.0); ic <<= 32.0 * PI / 3.0 * molecularvolume * molecularvolume * surfaceeng * surfaceeng * surfaceeng / (KB * KB * KB * temperature * temperature * temperature * log(supersat) * log(supersat) * log(supersat)); 2 4 6 8 10 12 GPU 64x64x64 1.8 2.3 3.6 4.9 5.9 5.5 27.3 128x128x128 1.9 3.6 5.2 6.0 7.3 6.6 37.4 z <<= sqrt( delg / (3.0 * PI * KB * temperature * ic * ic)); kf <<= diffusioncoef * pow(48.0 * PI * PI * molecularvolume * ic, 1.0 / 3.0); N1 <<= NA * eqconc * supersat; result <<= cond(supersat > 1.0, z * kf * N1 * N1 * exp( - delg / KB / temperature)) (0.0);

TM TM (E)DSL Parting Thoughts

TM TM (E)DSL Parting Thoughts Hierarchical parallelization is key for scalability Domain decomposition (SIMD) should allow a node to do computation on interior while waiting on communication from neighbors Task decomposition (MIMD) decompose the solution into a DAG that can be scheduled asynchronously Vectorized parallel (SIMD) break grid operations across multicore, GPU, etc.

TM TM (E)DSL Parting Thoughts Hierarchical parallelization is key for scalability Domain decomposition (SIMD) should allow a node to do computation on interior while waiting on communication from neighbors Task decomposition (MIMD) decompose the solution into a DAG that can be scheduled asynchronously Vectorized parallel (SIMD) break grid operations across multicore, GPU, etc. DAG representation is a scalable abstraction that: handles problem complexity gracefully provides convenient separation of the problem s structure from the data allows sophisticated scheduling algorithms to optimize scalability & performance

TM TM (E)DSL Parting Thoughts Hierarchical parallelization is key for scalability Domain decomposition (SIMD) should allow a node to do computation on interior while waiting on communication from neighbors Task decomposition (MIMD) decompose the solution into a DAG that can be scheduled asynchronously Vectorized parallel (SIMD) break grid operations across multicore, GPU, etc. DAG representation is a scalable abstraction that: handles problem complexity gracefully provides convenient separation of the problem s structure from the data allows sophisticated scheduling algorithms to optimize scalability & performance (E)DSLs are very useful future-proofing: separate intent from implementation EDSLs allow seamless transition of a code base template metaprogramming pushes work from run-time to compile-time for more efficiency

We build our data management solubon on most advanced technology in big data streaming analybcs and visualizabon Live demonstration at SC12 & 13: ~4TB per time step (100s of PF 3D timesteps generated on Intrepid) Steaming live from ANL visualization cluster Interactive, immersive, analysis and visualization Infrastructure that scales gracefully with available hardware resources 22 Cores available

High Performance Data Movements for Real- Time Monitoring of Large Scale SimulaBons Hopper (fat tree) Intrepid (torus) Scale simula?on dumps to 130K cores with beher performance than state of the art libraries while enabling real-?me, remote visualiza?on Simula4on ViSUS IDX File Format End User Real- Bme Data Analysis and Remote VisualizaBon Compute Nodes Storage Nodes User Feedback [SC12a] Efficient Data Restructuring and Aggregation for IO Acceleration in PIDX

Flexible DMAV Architecture that Allows ExploiBng the Possible Exascale Hardware Available Loca-on of the compute resources Same cores as the simula?on Dedicated cores Dedicated nodes External resource Data access, placement, and persistence Shared data structures Shared memory Non- vola?le on node storage (NVRAM) Network transfer Synchroniza-on and scheduling Execute synchronously with simula?on every n th simula?on?me step Execute asynchronously Node 1 Sharing cores with the simulation Using distinct cores on same node Node 2 CPUs DRAM NVRAM SSD Hard Disk... Node N Processing*data*on*remote*nodes* CPUs Staging option 1 DRAM Staging option 2 NVRAM Staging option 3 SSD Hard Disk 24 Network

SystemaBc ExploraBon of the Efficiency and Cost of Data Movements for In- situ Data Structures TXYZ Different mapping of data structure to nodes tiltxy Sequoia (LLNL) PF3D Simulation [SC12a] Mapping Applications with Collectives over Sub- Communicators on Torus Networks

For Analysis and VisualizaBon Algorithms We Determine the EffecBveness of In- Situ Task AllocaBons in- situ + in- transit workflows enable matching algorithms with available architectures PVR = Parallel Volume Rendering SSA = Streaming Statistical Analysis RTC = Reduced Topology Computation 26 4896 cores total (4480 simula?on/in situ; 256 in transit; 160 task scheduling/data movement) Simula?on size: 1600x1372x430 ; All measurements are per simula?on?me step [SC12b] Combining In-Situ and In-Transit Processing to Enable Extreme-Scale Scientific Analysis

Study of High Dimensional Space of Input, Tunable, and Output Parameters on For (auto)tuning Independent Variables (Input) n: Initial set of nodes V: Variable count D: Data size/resolution M : Machine specs/memory Parameters (Tunable) n': nodes Participating in Aggregation A: aggregator count a f : aggregation factor F: File counts Dependent Variables (Output) Network throughput I/O throughput Combined throughput Intrepid: BlueGene/P Peak performance : 557 Tera-flops/sec RAM : 2GiB / Node (4 cores) File System: GPFS Batch/Job Allocation : Contiguous allocation of nodes Hopper: Cray XE6 Peak performance: 1.28 Peta-flops/sec RAM: 36GiB/Node (24 cores) File System : Lustre Batch/Job Allocation: Noncontiguous allocation of nodes [SC13a] Characterization and Modeling of PIDX Parallel I/O for Performance Optimization

Study of High Dimensional Space of Input, Tunable, and Output Parameters on For (auto)tuning Independent Variables (Input) n: Initial set of nodes V: Variable count D: Data size/resolution M : Machine specs/memory Parameters (Tunable) n': nodes Participating in Aggregation A: aggregator count a f : aggregation factor F: File counts Dependent Variables (Output) Network throughput I/O throughput Combined throughput Intrepid: BlueGene/P Peak performance : 557 Tera-flops/sec RAM : 2GiB / Node (4 cores) File System: GPFS Batch/Job Allocation : Contiguous allocation of nodes Hopper: Cray XE6 Peak performance: 1.28 Peta-flops/sec RAM: 36GiB/Node (24 cores) File System : Lustre Batch/Job Allocation: Noncontiguous allocation of nodes [SC13a] Characterization and Modeling of PIDX Parallel I/O for Performance Optimization

We Evaluate the Feasibility of Exascale In- Situ Feature ExtracBon and Tracking and its Impact on Power ConsumpBon Analysis algorithms vs simula?on: different communica?on/instruc?on profile/memory access paherns different communica?on paherns Develop power models with machine independent characteris?cs (Byfl and MPI) Validate power model using empirical studies with instrumented placorms Extrapolate to Titan [SC13b] Exploring Power Behaviors and Trade-offs of In-situ Data Analytics

UINTAH-X A Software Approach that leads towards Exascale Planned CS Contrib