UINTAH-X A Software Approach that leads towards Exascale
|
|
|
- Lesley Hines
- 10 years ago
- Views:
Transcription
1 UINTAH-X A Software Approach that leads towards Exascale Martin Chuck James Matt Valerio Berzins Hansen Sutherland Might Pascucci DOE: for CSAFE project (97-10), DOE NETL, DOE NNSA, DOE INCITE & ALCC NSF for SDCI and PetaApps, XSEDE, Argonne Oak Ridge Planned CS Contrib
2 UINTAH-X A Software Approach that leads towards Exascale Planned CS Contrib
3 UINTAH-X Advancing Progress towards Exascale Computing by CS contributions in: 1. UINTAH-X RTS Runtime System (Martin) Will show that a DAG asynchronous RTS will give scalable exascale solutions for the Alstom problem when combined with code from 2. TASC (WASATCH NEBO)- Domain Specific Language (Matt / James) - will generate efficient DAG extensible exascale code from high-level complex physics (Alstom) that in turn uses 3. Visus (PIDX) and Visit for I/O, Analytics & Vis (Chuck/Valerio) - performs in-situ visualization and efficient I/O of exascale solutions of e.g. Alstom Boiler. 4. Related Work: Synergistic work is e.g. fault tolerance formal methods for debugging, languages (Martin, James and Matt with NSF funding)
4 Uintah-X Architecture Problem Specification via ARCHES or WASATCH/NEBO EDSL Abstract task-graph program executes on: Runtime System with: Asynchronous out-of-order execution Work stealing Overlap communication & computation Tasks running on cores and accelerators Scalable I/O via Visus PIDX Data Analysis and Visualization Simulation Controller ARCHES PIDX Runtime System Scheduler WASATCH NEBO Load Balancer VisIT Exascale programming will require prioritization of critical-path and non-critical path tasks, adaptive directed acyclic graph scheduling of critical-path tasks, and adaptive rebalancing of all tasks... [Brown et al. Exascale Report]
5 ARCHES or WASATCH/NEBO xml ask Compile Run Time (each timestep) RUNTIME SYSTEM Calculate Residuals Solve Equations Parallel I/O Visus PIDX VisIt UINTAH-X ARCHITECTURE
6 Uintahx Runtime System Task Graph ExecuBon Layer (runs on mul?ple cores) Thread 3 Thread 2 Thread 1 Data Management Internal Task Queue Select Task & Post MPI Receives recv MPI_ IRecv Comm Records Check Records & Find Ready Tasks valid Data Warehouse (one per- node) MPI_ Test Network External Task Queue Select Task & Execute Task get put Shared Objects (one per node) Post Task MPI Sends send MPI_ ISend
7 Unified Heterogeneous Scheduler & Runtime Device Kernels Running Device Task Running Device Task PUT GET Device Data Warehouse CPU Threads Completed tasks Running CPU Task Running CPU Task Signals PUT GET H2D copy D2H copy MPI sends MPI recvs Task Graph Device ready tasks Running CPU Task ready tasks PUT GET Data Warehouse (variables) Network Shared Scheduler Objects (host MEM) Internal ready tasks Device Task Queues Device-enabled tasks CPU Task Queues MPI Data Ready
8 MPM AMR ICE Strong Scaling Debugging what should have been this run caused DDT to crash Mira *one mesh patch for every 3 cores * Complex fluid-structure interaction problem with adaptive mesh refinement, see SC13 paper NSF funding.
9 Scalability is at least partially achieved by not executing tasks in order e.g. AMR fluid-structure interaction Straight line represents given order of tasks Green X shows when a task is actually executed. Above the line means late execution while below the line means early execution took place. More late tasks than early ones as e.g. TASKS: Early Late execution
10 Linear Solves arises from Navier Stokes Equations Full model includes turbulence, chemical reactions and radiation Arrive at pressure Poisson equation to solve for p Use Hypre Solver distributed by LLNL Many linear solvers inc. Preconditioned Conjugate Gradients on regular mesh patches used Multi-grid pre-conditioner used Careful adaptive strategies needed to get scalability CCGrid13 paper. One radiation solve per timestep
11 Linear Solves arises from Navier Stokes Equations Full model includes turbulence, chemical reactions and radiation Arrive at pressure Poisson equation to solve for p Use Hypre Solver distributed by LLNL Many linear solvers inc. Preconditioned Conjugate Gradients on regular mesh patches used Multi-grid pre-conditioner used Careful adaptive strategies needed to get scalability CCGrid13 paper. One radiation solve Every 10 timesteps
12 NVIDIA AMGX Linear Solvers on GPUs Fast, scalable iterative gpu linear solvers for packages e.g., Flexible toolkit provides GPU accelerated Ax = b solver Simple API for multiple apps domains. Multiple GPUs (maybe thousands) with scaling Key Features Ruge-Steuben algebraic MG Krylov methods: CG, GMRES, BiCGStab, Smoothers and Solvers: Block- Jacobi, Gauss- Seidel, incomplete LU, Flexible composition system Scalar or coupled block Systems, MPI support OpenMP support, Flexible and high level C API, Free for non-commercial use Utah access via Utah CUDA COE.
13 Uintah-X Parting Thoughts Uintah DAG-based approach Solve wide range of problems with different communication requirements: Fluid-Structure interaction(mpmice), Combustion(Arches), Radiation(RMCRT) Excellent scaling on many Peta-scale supercomputers: Kraken, Titan, Stampede, Mira.. Ability to use heterogeneous resources: GPU and MIC Schedule tasks dynamically according to run-time MPI communications costs Future work: Task stealing across CPU/device queues Improve GPU/MIC memory management
14 Wasatch Energy equation Expressions Example Enthalpy diffusive flux Expression Tree Dependency specification Execution order
15 Wasatch Energy equation Expressions Example Enthalpy diffusive flux Create individual expressions and register them. Expression Tree Dependency specification Execution order
16 Wasatch Energy equation Expressions Example Enthalpy diffusive flux Create individual expressions and register them. Expression Tree Dependency specification Execution order Request calculation of any # of fields. The graph to accomplish this is automatically created!
17 Wasatch Energy equation Expressions Example Enthalpy diffusive flux Create individual expressions and register them. Expression Tree Dependency specification Execution order Request calculation of any # of fields. The graph to accomplish this is automatically created!
18 Wasatch Energy equation Expressions Example Enthalpy diffusive flux Create individual expressions and register them. Expression Tree Dependency specification Execution order Request calculation of any # of fields. The graph to accomplish this is automatically created!
19 TM Nebo: Embedded Domain Specific Language for HPC
20 TM TM Why an EDSL?
21 TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. Efficiency Matlab Expressiveness
22 TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ High performance should match hand-tuned code in performance Efficiency Matlab Expressiveness
23 TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ High performance should match hand-tuned code in performance Extensible Efficiency Matlab insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Model Architecture Discretization
24 TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ High performance should match hand-tuned code in performance Extensible Efficiency Matlab insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Model Discretization Architecture
25 TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ EDSL High performance should match hand-tuned code in performance Extensible Efficiency Matlab insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Model EDSL Discretization Architecture
26 TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ EDSL High performance should match hand-tuned code in performance Extensible Efficiency Matlab insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Plays well with others allow programmer to write in C++ and inter-operate with EDSL not an all-or-none approach allows concurrent development of EDSL and application codes Adapt to appropriate data structures without impacting application programmer. Model EDSL Discretization Architecture
27 TM TM Why an EDSL? Expressive syntax (matlab-style array operations) Programmer expresses intent (problem structure) - not implementation. C++ EDSL High performance should match hand-tuned code in performance Efficiency Matlab Extensible insulate programmer from architecture changes (e.g. multicore GPU...) Expressiveness EDSL back-end compiles into code for target architecture Plays well with others allow programmer to write in C++ and inter-operate with EDSL not an all-or-none approach allows concurrent development of EDSL and application codes Adapt to appropriate data structures without impacting application programmer. Model C++ Template Metaprogramming embedded in C++ compile-time reasoning about the problem Discretization EDSL Architecture
28 TM Chained Stencil Operations = r q = r ( rt )
29 TM Chained Stencil Operations = r q = r ( rt ) phi <<= divx( interpx(lambda) * gradx(temperature) ) + divy( interpy(lambda) * grady(temperature) + divz( interpz(lambda) * gradz(temperature); One inlined grid loop, no temporaries. Better parallel performance than without chaining. Compile-time consistency checking (field-operator and field-field compatibility).
30 TM Performance & Scalability Speedup using DSL* relative to other Uintah codes ARCHES ICE ^ ^ ^3 *Comparison to ICE and ARCHES, sister codes in Uintah, on a 3D Taylor-Green vortex problem. Run on a single processor
31 TM Performance & Scalability Speedup using DSL* relative to ^3 other Uintah codes ARCHES ^3 ICE ^3 *Comparison to ICE and ARCHES, sister codes in Uintah, on a 3D Taylor-Green vortex problem. Run on a single processor Scalability Weak scaling on Titan Wasatch/Uintah ,000 10, ,000 #proc 1 indicates perfect weak scaling
32 TM Performance & Scalability Speedup using DSL* relative to ^3 other Uintah codes ARCHES ^3 ICE ^3 *Comparison to ICE and ARCHES, sister codes in Uintah, on a 3D Taylor-Green vortex problem. Run on a single processor Scalability Weak scaling on Titan Wasatch/Uintah trillion DOF ,000 10, ,000 #proc 1 indicates perfect weak scaling
33 TM Multicore & GPU Performance r ( rt )
34 TM Multicore & GPU Performance r ( rt ) phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) );
35 TM Multicore & GPU Performance r ( rt ) GPU phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) ); 64x64x x128x
36 TM Multicore & GPU Performance r ( rt ) GPU phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) ); 64x64x x128x r ( u + J )
37 TM Multicore & GPU Performance r ( rt ) GPU phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) ); 64x64x x128x r ( u + J ) rhs <<= -divopx_( xconvflux_ + xdiffflux_ ) -divopy_( yconvflux_ + ydiffflux_ ) -divopz_( zconvflux_ + zdiffflux_ );
38 TM Multicore & GPU Performance r ( rt ) GPU phi <<= divx( -interpx(lambda) * gradx(t) ) + divy( -interpy(lambda) * grady(t) ) + divz( -interpz(lambda) * gradz(t) ); 64x64x x128x r ( u + J ) GPU rhs <<= -divopx_( xconvflux_ + xdiffflux_ ) -divopy_( yconvflux_ + ydiffflux_ ) -divopz_( zconvflux_ + zdiffflux_ ); 64x64x x128x
39 TM Multicore & GPU - pointwise calculations
40 TM Multicore & GPU - pointwise calculations delg <<= 16.0 * PI / 3.0 * molecularvolume * molecularvolume * surfaceeng * surfaceeng * surfaceeng / (KB * KB * temperature * temperature * log(supersat) * log(supersat)) + KB * temperature * log(supersat) - surfaceeng * pow(36.0 * PI * molecularvolume * molecularvolume, 1.0 / 3.0); ic <<= 32.0 * PI / 3.0 * molecularvolume * molecularvolume * surfaceeng * surfaceeng * surfaceeng / (KB * KB * KB * temperature * temperature * temperature * log(supersat) * log(supersat) * log(supersat)); GPU 64x64x x128x z <<= sqrt( delg / (3.0 * PI * KB * temperature * ic * ic)); kf <<= diffusioncoef * pow(48.0 * PI * PI * molecularvolume * ic, 1.0 / 3.0); N1 <<= NA * eqconc * supersat; result <<= cond(supersat > 1.0, z * kf * N1 * N1 * exp( - delg / KB / temperature)) (0.0);
41 TM TM (E)DSL Parting Thoughts
42 TM TM (E)DSL Parting Thoughts Hierarchical parallelization is key for scalability Domain decomposition (SIMD) should allow a node to do computation on interior while waiting on communication from neighbors Task decomposition (MIMD) decompose the solution into a DAG that can be scheduled asynchronously Vectorized parallel (SIMD) break grid operations across multicore, GPU, etc.
43 TM TM (E)DSL Parting Thoughts Hierarchical parallelization is key for scalability Domain decomposition (SIMD) should allow a node to do computation on interior while waiting on communication from neighbors Task decomposition (MIMD) decompose the solution into a DAG that can be scheduled asynchronously Vectorized parallel (SIMD) break grid operations across multicore, GPU, etc. DAG representation is a scalable abstraction that: handles problem complexity gracefully provides convenient separation of the problem s structure from the data allows sophisticated scheduling algorithms to optimize scalability & performance
44 TM TM (E)DSL Parting Thoughts Hierarchical parallelization is key for scalability Domain decomposition (SIMD) should allow a node to do computation on interior while waiting on communication from neighbors Task decomposition (MIMD) decompose the solution into a DAG that can be scheduled asynchronously Vectorized parallel (SIMD) break grid operations across multicore, GPU, etc. DAG representation is a scalable abstraction that: handles problem complexity gracefully provides convenient separation of the problem s structure from the data allows sophisticated scheduling algorithms to optimize scalability & performance (E)DSLs are very useful future-proofing: separate intent from implementation EDSLs allow seamless transition of a code base template metaprogramming pushes work from run-time to compile-time for more efficiency
45 We build our data management solubon on most advanced technology in big data streaming analybcs and visualizabon Live demonstration at SC12 & 13: ~4TB per time step (100s of PF 3D timesteps generated on Intrepid) Steaming live from ANL visualization cluster Interactive, immersive, analysis and visualization Infrastructure that scales gracefully with available hardware resources 22 Cores available
46 High Performance Data Movements for Real- Time Monitoring of Large Scale SimulaBons Hopper (fat tree) Intrepid (torus) Scale simula?on dumps to 130K cores with beher performance than state of the art libraries while enabling real-?me, remote visualiza?on Simula4on ViSUS IDX File Format End User Real- Bme Data Analysis and Remote VisualizaBon Compute Nodes Storage Nodes User Feedback [SC12a] Efficient Data Restructuring and Aggregation for IO Acceleration in PIDX
47 Flexible DMAV Architecture that Allows ExploiBng the Possible Exascale Hardware Available Loca-on of the compute resources Same cores as the simula?on Dedicated cores Dedicated nodes External resource Data access, placement, and persistence Shared data structures Shared memory Non- vola?le on node storage (NVRAM) Network transfer Synchroniza-on and scheduling Execute synchronously with simula?on every n th simula?on?me step Execute asynchronously Node 1 Sharing cores with the simulation Using distinct cores on same node Node 2 CPUs DRAM NVRAM SSD Hard Disk... Node N Processing*data*on*remote*nodes* CPUs Staging option 1 DRAM Staging option 2 NVRAM Staging option 3 SSD Hard Disk 24 Network
48 SystemaBc ExploraBon of the Efficiency and Cost of Data Movements for In- situ Data Structures TXYZ Different mapping of data structure to nodes tiltxy Sequoia (LLNL) PF3D Simulation [SC12a] Mapping Applications with Collectives over Sub- Communicators on Torus Networks
49 For Analysis and VisualizaBon Algorithms We Determine the EffecBveness of In- Situ Task AllocaBons in- situ + in- transit workflows enable matching algorithms with available architectures PVR = Parallel Volume Rendering SSA = Streaming Statistical Analysis RTC = Reduced Topology Computation cores total (4480 simula?on/in situ; 256 in transit; 160 task scheduling/data movement) Simula?on size: 1600x1372x430 ; All measurements are per simula?on?me step [SC12b] Combining In-Situ and In-Transit Processing to Enable Extreme-Scale Scientific Analysis
50 Study of High Dimensional Space of Input, Tunable, and Output Parameters on For (auto)tuning Independent Variables (Input) n: Initial set of nodes V: Variable count D: Data size/resolution M : Machine specs/memory Parameters (Tunable) n': nodes Participating in Aggregation A: aggregator count a f : aggregation factor F: File counts Dependent Variables (Output) Network throughput I/O throughput Combined throughput Intrepid: BlueGene/P Peak performance : 557 Tera-flops/sec RAM : 2GiB / Node (4 cores) File System: GPFS Batch/Job Allocation : Contiguous allocation of nodes Hopper: Cray XE6 Peak performance: 1.28 Peta-flops/sec RAM: 36GiB/Node (24 cores) File System : Lustre Batch/Job Allocation: Noncontiguous allocation of nodes [SC13a] Characterization and Modeling of PIDX Parallel I/O for Performance Optimization
51 Study of High Dimensional Space of Input, Tunable, and Output Parameters on For (auto)tuning Independent Variables (Input) n: Initial set of nodes V: Variable count D: Data size/resolution M : Machine specs/memory Parameters (Tunable) n': nodes Participating in Aggregation A: aggregator count a f : aggregation factor F: File counts Dependent Variables (Output) Network throughput I/O throughput Combined throughput Intrepid: BlueGene/P Peak performance : 557 Tera-flops/sec RAM : 2GiB / Node (4 cores) File System: GPFS Batch/Job Allocation : Contiguous allocation of nodes Hopper: Cray XE6 Peak performance: 1.28 Peta-flops/sec RAM: 36GiB/Node (24 cores) File System : Lustre Batch/Job Allocation: Noncontiguous allocation of nodes [SC13a] Characterization and Modeling of PIDX Parallel I/O for Performance Optimization
52 We Evaluate the Feasibility of Exascale In- Situ Feature ExtracBon and Tracking and its Impact on Power ConsumpBon Analysis algorithms vs simula?on: different communica?on/instruc?on profile/memory access paherns different communica?on paherns Develop power models with machine independent characteris?cs (Byfl and MPI) Validate power model using empirical studies with instrumented placorms Extrapolate to Titan [SC13b] Exploring Power Behaviors and Trade-offs of In-situ Data Analytics
53 UINTAH-X A Software Approach that leads towards Exascale Planned CS Contrib
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
Large-Scale Reservoir Simulation and Big Data Visualization
Large-Scale Reservoir Simulation and Big Data Visualization Dr. Zhangxing John Chen NSERC/Alberta Innovates Energy Environment Solutions/Foundation CMG Chair Alberta Innovates Technology Future (icore)
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE
1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France
So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell
So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell R&D Manager, Scalable System So#ware Department Sandia National Laboratories is a multi-program laboratory managed and
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al
Uintah Framework Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al Uintah Parallel Computing Framework Uintah - far-sighted design by Steve Parker
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
HPC Deployment of OpenFOAM in an Industrial Setting
HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak [email protected] Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Data Analytics at NERSC. Joaquin Correa [email protected] NERSC Data and Analytics Services
Data Analytics at NERSC Joaquin Correa [email protected] NERSC Data and Analytics Services NERSC User Meeting August, 2015 Data analytics at NERSC Science Applications Climate, Cosmology, Kbase, Materials,
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX, 14.12.12
(Toward) Radiative transfer on AMR with GPUs Dominique Aubert Université de Strasbourg Austin, TX, 14.12.12 A few words about GPUs Cache and control replaced by calculation units Large number of Multiprocessors
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect [email protected]
HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect [email protected] EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training
HPC Programming Framework Research Team
HPC Programming Framework Research Team 1. Team Members Naoya Maruyama (Team Leader) Motohiko Matsuda (Research Scientist) Soichiro Suzuki (Technical Staff) Mohamed Wahib (Postdoctoral Researcher) Shinichiro
A Multi-layered Domain-specific Language for Stencil Computations
A Multi-layered Domain-specific Language for Stencil Computations Christian Schmitt, Frank Hannig, Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Workshop ExaStencils 2014,
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Kriterien für ein PetaFlop System
Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.
Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development
Parallel Large-Scale Visualization
Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Big Data Management in the Clouds and HPC Systems
Big Data Management in the Clouds and HPC Systems Hemera Final Evaluation Paris 17 th December 2014 Shadi Ibrahim [email protected] Era of Big Data! Source: CNRS Magazine 2013 2 Era of Big Data! Source:
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Petascale Software Challenges. Piyush Chaudhary [email protected] High Performance Computing
Petascale Software Challenges Piyush Chaudhary [email protected] High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
BSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
10- High Performance Compu5ng
10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni ([email protected]) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED
Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory 21 st Century Research Continuum Theory Theory embodied in computation Hypotheses tested through experiment SCIENTIFIC METHODS
Principles and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
In-situ Visualization
In-situ Visualization Dr. Jean M. Favre Scientific Computing Research Group 13-01-2011 Outline Motivations How is parallel visualization done today Visualization pipelines Execution paradigms Many grids
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms P. E. Vincent! Department of Aeronautics Imperial College London! 25 th March 2014 Overview Motivation Flux Reconstruction Many-Core
Mesh Generation and Load Balancing
Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado
AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS
AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS by M. Storti, L. Dalcín, R. Paz Centro Internacional de Métodos Numéricos en Ingeniería - CIMEC INTEC, (CONICET-UNL), Santa Fe, Argentina
Addressing Big Data Challenges in Simulation-based Science
Addressing Big Data Challenges in Simulation-based Science Manish Parashar* Rutgers Discovery Informatics Institute (RDI2) NSF Center for Cloud and Autonomic Computing (CAC) Department of Electrical and
NA-ASC-128R-14-VOL.2-PN
L L N L - X X X X - X X X X X Advanced Simulation and Computing FY15 19 Program Notebook for Advanced Technology Development & Mitigation, Computational Systems and Software Environment and Facility Operations
Solvers, Algorithms and Libraries (SAL)
Solvers, Algorithms and Libraries (SAL) D. Knoll (LANL), J. Wohlbier (LANL), M. Heroux (SNL), R. Hoekstra (SNL), S. Pautz (SNL), R. Hornung (LLNL), U. Yang (LLNL), P. Hovland (ANL), R. Mills (ORNL), S.
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Hybrid Software Architectures for Big Data. [email protected] @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data [email protected] @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
Past, Present and Future Scalability of the Uintah Software
Past, Present and Future Scalability of the Uintah Software Martin Berzins [email protected] John Schmidt [email protected] Alan Humphrey [email protected] Qingyu Meng [email protected] ABSTRACT The
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
Performance Evaluation of Amazon EC2 for NASA HPC Applications!
National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
