Software and Performance Engineering of the Community Atmosphere Model
|
|
|
- Harry Dalton
- 10 years ago
- Views:
Transcription
1 Software and Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory Science Team Meeting Climate Change Prediction Program April 24-26, 2006 Royal Sonesta Hotel Boston Cambridge, MA 1
2 Overview SciDAC-sponsored Community Atmosphere Model (CAM) software and performance engineering activities since last Science Team Meeting: Introducing and maintaining performance optimizations on both vector and nonvector computer systems as code evolves Improving portability and performance portability Measuring and analyzing performance Poster Outline CAM modifications and impacts CAM performance tuning options CAM performance and performance analysis Performance issues and future activities 2
3 Community Atmosphere Model (CAM) Atmospheric global circulation model Timestepping code with two primary phases per timestep Dynamics: advances evolution equations for atmospheric flow Physics: approximates subgrid phenomena, such as precipitation, clouds, radiation, turbulent mixing, Multiple options for dynamics: Spectral Eulerian (EUL) dynamical core (dycore) Spectral semi-lagrangian (SLD) dycore Finite-Volume semi-lagrangian (FV) dycore all using tensor product latitude x longitude x vertical level grid over the sphere, but not same grid, same placement of variables on grid, or same domain decomposition in parallel implementation Separate data structures for dynamics and physics and explicit data movement between them each timestep (in a coupler ) Developed at NCAR, with contributions from DOE and NASA 3
4 CAM Performance Experiments 1. Spectral Eulerian dycore running on T85L26 computational grid 128x256x26 (latitude by longitude by vertical) grid Current production dynamical core and grid resolution in CCSM 2. Finite Volume dycore running on 1.9x2.5 degree horizontal grid with 26 vertical levels 96x144x26 (latitude by longitude by vertical) grid Finite volume dycore is the preferred (required among current options) dycore for atmospheric chemistry due to its conservation properties. 1.9x2.5 degree resolution is the initial CCSM production grid size. 3. Finite Volume dycore running on 0.5x0.625 degree horizontal grid ( D grid ) with 26 vertical levels 361x576x26 (latitude by longitude by vertical) grid 15 times larger than FV production grid resolution. 4
5 Experimental Platforms Cray X1 at Oak Ridge National Laboratory (ORNL): way vector SMP nodes and a 4-D hypercube interconnect. Each processor has 8 64-bit floating point vector units running at 800 MHz. Cray X1E at ORNL: way vector SMP nodes and a 4-D hypercube interconnect. Each processor has 8 64-bit floating point vector units running at 1.13 GHz. Cray XT3 at ORNL: 5294 single processor nodes (2.4 GHz AMD Opteron) and a 3-D torus interconnect. Earth Simulator: way vector SMP nodes and a 640x640 single-stage crossbar interconnect. Each processor has 8 64-bit floating point vector units running at 500 MHz. IBM p575 cluster at the National Energy Research Scientific Computing Center (NERSC): way p575 SMP nodes (1.9 GHz POWER5) and an HPS interconnect with 1 two-link network adapter per node. IBM p690 cluster at ORNL: way p690 SMP nodes (1.3 GHz POWER4) and an HPS interconnect with 2 two-link network adapters per node. IBM SP at NERSC: 184 Nighthawk II 16-way SMP nodes (375MHz POWER3-II) and an SP Switch2 with two network adapters per node. Itanium2 cluster at Lawrence Livermore National Laboratory (LLNL): way Tiger4 nodes (1.4 GHz Intel Itanium 2) and a Quadrics QsNetII Elan4 interconnect. SGI Altix 3700 at ORNL: way SMP nodes and a NUMAflex fat-tree interconnect. Each processor is a 1.5 GHz Itanium 2 with a 6 MB L3 cache. SGI Altix 3700 Bx2 at NASA: way SMP nodes and a NUMAflex fat-tree interconnect. Each processor is a 1.6 GHz Itanium 2 with a 9 MB L3 cache. 5
6 CAM Modifications 11/1/04 (cam3_0_21): Additional communication optimization options in spectral dycores and in physics load balancing. 1/6/05 (cam3_0_24): Load balancing and communication optimizations in spectral dycores and physics when using reduced horizontal grid. 1/20/05 (cam3_0_27): Improved support for math library FFTs; Additional vectorization in the FV dycore; Bug fixes for X1. 3/15/05 (cam3_0_34 / cam3.1): Restoration of X1 performance lost in earlier check-in; Serial performance optimizations; Additional vectorization in physics; Support for SSP mode on X1. 8/5/05 (cam3_2_11): Improved support for math library FFTs; Additional vectorization and streaming in the FV dycore. 6
7 CAM Modifications 9/6/05 (cam3_2_19): Restoration of X1E performance lost after radiation routine refactoring; Initial XT3 optimizations. 9/29/05 (cam3_2_23): Explicit typing of variables and constants to improve portability. 10/12/05 (cam3_2_26): Optimization and vectorization of new code in FV dycore (added as part of GEOS5/CAM merge). Vectorization of buffer copies in communication utility layer. 11/29/05 (cam3_2_38): Restoration of FV optimization options lost in GEOS5/CAM merge. 4/2/06 (cam3_2_57): Optimization and vectorization of code in CAM, CLM2, and MCT used in new atmosphere/land interface. 7
8 CAM Performance History: X1E Performance impact of each SciDAC check-in on the Cray X1E. Not all check-ins improved performance, nor were expected to - some improved portability, added new performance tuning options, or fixed bugs. Maintaining performance as CAM evolves is (and will be) as important as further improving current performance. 8
9 CAM Performance History: XT3 Performance impact of each SciDAC check-in on the Cray XT3. Beyond initial optimization (cam3_2_19), little attention has been paid to XT3 performance. This will change as we move to higher resolutions and increase exploitable parallelism in the physics. 9
10 CAM Performance Portability Goals 1) Maximize single processor performance, e.g. a) Optimize memory access patterns b) Maximize vectorization or other fine-grain parallelism 2) Minimize parallel overhead, e.g. a) Minimize communication costs b) Minimize load imbalance c) Minimize redundant computation for a range of target systems, a range of problem specifications (grid size, physical processes, ) a range of processor counts while preserving maintainability and extensibility. No optimal solution for all desired (platform,problem,processor count) specifications. Approach: compile-time and runtime optimization options. 10
11 CAM Performance Optimization Options 1. Physics data structures Index range, dimension declaration 2. Physics load balance Variety of load balancing options, with different communication overheads SMP-aware load balancing options 3. Communication options MPI protocols (two-sided and one-sided) Co-Array Fortran SHMEM protocols and choice of pt-2-pt implementations or collective communication operators 4. OpenMP parallelism Instead of some MPI parallelism In addition to MPI parallelism 5. Aspect ratio of dynamics 2D domain decomposition (FV-only) 11
12 Performance Tuning Physics Data Structures pcols parameter determines vector length and cache locality in physics. Altix: minimum at pcols = 8 p575 : minimum at pcols = 80 p690: minimum at pcols = 24 X1E : minimum at pcols = 514 or 1026 XT3: minimum at pcols = 34 pcols <= 4 bad for all systems. 12
13 Evaluating Load Balancing: FV / D Grid On the IBM p690, full load balancing is always best, but no load balancing is, at worst, only 7% slower. On the Cray XT3, full load balancing is usually best, but pairwise exchange load balancing is competitive. No load balancing is usually more than 4% slower, and as much as 12% slower. The best example of the advantage of full load balancing is on the Cray X1E. It is so much better that it was clear early on in the tuning process and the other load balancing options were pruned from the search tree. 13
14 1D vs. 2D Decompositions: FV / D Grid 2D decomposition is superior to 1D when number of MPI processes decomposing latitude in 1D exceeds some limit, ~70 on two Cray systems (not using OpenMP). Similarly, (P/7)x7 decomposition outperforms (P/4)x4 when the number of MPI processes decomposing latitude in (P/4)x4 exceeds ~70. On IBM p690, OpenMP and the 1D decomposition is superior to 2D decompositions up to 672 processors, even when the optimum for 1D uses more threads per process than the optimum for 2D. (Note: 1D never uses more than 84 MPI processes.) 14
15 MPI vs. OpenMP: FV / D Grid on p690 cluster Primary utility of OpenMP is to extend scalability. For 2D decomposition, fewer threads is generally better for the same number of processors. For 1D decomposition, it is more important not to let the number of processors decomposing latitude exceed approx. 64 than it is to minimize the number of threads. 15
16 Performance Tuning Impact: EUL / T85L26 CCM mode - tuning option settings representative of what was used in CAM s predecessor. Big win on IBM, primarily due to ability to use OpenMP parallelism in physics when dynamics parallelism exhausted (at approximately 128 processors). As vector length for CCM mode is near optimal on the X1E, performance difference is primarily due to load balancing. 16
17 Platform Comparisons: FV / D Grid Earth Simulator results courtesy of D. Parks. SP results courtesy of M. Wehner. Maximum number of MPI processes is 960. IBM systems and Earth Simulator use OpenMP to increase scalability. Recent performance optimizations backported into CAM 3.1 for these experiments. 17
18 Performance Diagnosis: X1E vs. XT3 Dynamics and physics scaling similarly on XT3. On X1E, physics scaling much better. (Dynamics best performance at 448 processors.) 18
19 Performance Diagnosis: p575 vs. X1E At large scale, dynamics performance on p575 similar to performance on X1E. However, p575 uses OpenMP and X1E does not in these experiments. P575 using a coarser dynamics domain decomposition for same processor count. 19
20 Current Limits to FV Scalability 1. Polar filter introduces load imbalances, especially on vector systems (because the small number of short FFTs do not vectorize well). 2. Requirement that at least 3 latitudes and 3 vertical levels be present in each block of domain decomposition within FV dycore. For D grid with 26 vertical levels, limit is 120x8 processor grid, or 960 MPI processes. For 1.9x2.5 degree grid, limit is 32x8 processor grid, or 256 MPI processes. 3. Physics can use many more processors, but currently limited to the same number of MPI processes as the dynamical core. OpenMP can be used to assign more processors to physics than to dynamics, mitigating this to some degree. There is also some OpenMP parallelism available within the dynamics. 4. On vector systems, additional parallelism in physics is of limited utility, as vector length drop below 220 for D grid when using more than 960 processors (and drops below 110 in radiation routines). 20
21 New Issue: Tracer Transport Atmospheric chemistry introduces not only additional cost in the physics, but also requires the advection of many new fields. Experiment measures CAM runtime when advecting additional fields for 1x1.25 grid using a 48x7 processor grid on the LLNL Itanium2 cluster (Thunder) and a 48x4 processor grid on the X1E. Cost per tracer is minimally 1-2%, so more than doubles CAM runtime when advecting 100 new fields. 21
22 New and Planned Activities 1. Dynamics Scaling Generalize dynamics/physics interface to support dycores not using lon/lat grid. Investigate new, more scalable dycores, for example, FV on a cubed sphere computational grid. 2. Physics Scaling Add support to use different numbers of MPI processes in the dynamics and in the physics (generalizing current OpenMP approach to pure MPI codes). 3. Atmospheric chemistry Investigate parallelizing advection over species and optimizing interprocess communication. Investigate parallelizing chemistry over species. Investigate 3D decomposition for chemistry. 4. Increasing model resolution exacerbates the current I/O bottlenecks and memory impact of the (few) remaining global arrays. Investigate parallel I/O on target platforms. 22
23 Source Material A. Mirin and W. Sawyer, A Scalable Implementation of a Finite-Volume Dynamical Core in the Community Atmosphere Model, International Journal for High Performance Computer Applications, 19(3), August 2005, pp W. Putman, S-J. Lin, and B-W. Shen, Cross-Platform Performance of a Portable Communication Module and the NASA Finite Volume General Circulation Model, International Journal for High Performance Computer Applications, 19(3), August 2005, pp L. Oliker, J. Carter, M. Wehner, A. Canning, S. Ethier, A. Mirin, G. Bala, D. Parks, P. Worley, S. Kitawaki, and Y. Tsuda, Leading Computational Methods on Scalar and Vector HEC Platforms, in Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking and Storage (SC05), Seattle, WA, November 12-18, P. Worley and J. Drake, Performance Portability in the Physical Parameterizations of the Community Atmospheric Model, International Journal for High Performance Computer Applications, 19(3), August 2005, pp P. Worley, Benchmarking using the Community Atmosphere Model, in Proceedings of the 2006 SPEC Benchmark Workshop, Austin, TX, January 23,
24 Acknowledgements Research sponsored by the Atmospheric and Climate Research Division and the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC and Contract No. W-7405-Eng-48 with the University of California Lawrence Livermore National Laboratory. These slides have been authored by contractors of the U.S. Government under contracts No. DE-AC05-00OR22725 and No. W-7405-Eng-48. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725, and of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC03-76SF
Performance Engineering of the Community Atmosphere Model
Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory 11th Annual CCSM Workshop June 20-22, 2006
Mariana Vertenstein. NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science Foundation
Leveraging the New CESM1 CPL7 Architecture - Current and Future Challenges Mariana Vertenstein NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
UTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Performance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation
The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation Mark Baker, Garry Smith and Ahmad Hasaan SSE, University of Reading Paravirtualization A full assessment of paravirtualization
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model
ANL/ALCF/ESP-13/1 Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model ALCF-2 Early Science Program Technical Report Argonne Leadership Computing Facility About Argonne
On-Chip Interconnection Networks Low-Power Interconnect
On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007 Outline Demand for On-Chip Networks
How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
BSC - Barcelona Supercomputer Center
Objectives Research in Supercomputing and Computer Architecture Collaborate in R&D e-science projects with prestigious scientific teams Manage BSC supercomputers to accelerate relevant contributions to
The CNMS Computer Cluster
The CNMS Computer Cluster This page describes the CNMS Computational Cluster, how to access it, and how to use it. Introduction (2014) The latest block of the CNMS Cluster (2010) Previous blocks of the
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Load Imbalance Analysis
With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization
Large Scale Simulation on Clusters using COMSOL 4.2
Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,
Performance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
Parallel Analysis and Visualization on Cray Compute Node Linux
Parallel Analysis and Visualization on Cray Compute Node Linux David Pugmire, Oak Ridge National Laboratory and Hank Childs, Lawrence Livermore National Laboratory and Sean Ahern, Oak Ridge National Laboratory
From Hypercubes to Dragonflies a short history of interconnect
From Hypercubes to Dragonflies a short history of interconnect William J. Dally Computer Science Department Stanford University IAA Workshop July 21, 2008 IAA: # Outline The low-radix era High-radix routers
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4
Center for Information Services and High Performance Computing (ZIH) Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4 PARA 2010, June 9, Reykjavík, Iceland Matthias
FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling
Center for Information Services and High Performance Computing (ZIH) FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling Symposium on HPC and Data-Intensive Applications in Earth
Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner
(Conference Report) Peter Wegner SC2004 conference Top500 List BG/L Moors Law, problems of recent architectures Solutions Interconnects Software Lattice QCD machines DESY @SC2004 QCDOC Conclusions Technical
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
High Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.
Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G. Weber, S. Ahern, & K. Joy Lawrence Berkeley National Laboratory
On the Importance of Thread Placement on Multicore Architectures
On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
Scaling Study of LS-DYNA MPP on High Performance Servers
Scaling Study of LS-DYNA MPP on High Performance Servers Youn-Seo Roh Sun Microsystems, Inc. 901 San Antonio Rd, MS MPK24-201 Palo Alto, CA 94303 USA [email protected] 17-25 ABSTRACT With LS-DYNA MPP,
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
Kashif Iqbal - PhD [email protected]
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD [email protected] ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect [email protected]
HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect [email protected] EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Performance and Scalability of the NAS Parallel Benchmarks in Java
Performance and Scalability of the NAS Parallel Benchmarks in Java Michael A. Frumkin, Matthew Schultz, Haoqiang Jin, and Jerry Yan NASA Advanced Supercomputing (NAS) Division NASA Ames Research Center,
Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters
Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters Martin J. Chorley a, David W. Walker a a School of Computer Science and Informatics, Cardiff University, Cardiff, UK Abstract
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Barry Bolding, Ph.D. VP, Cray Product Division
Barry Bolding, Ph.D. VP, Cray Product Division 1 Corporate Overview Trends in Supercomputing Types of Supercomputing and Cray s Approach The Cloud The Exascale Challenge Conclusion 2 Slide 3 Seymour Cray
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Linux tools for debugging and profiling MPI codes
Competence in High Performance Computing Linux tools for debugging and profiling MPI codes Werner Krotz-Vogel, Pallas GmbH MRCCS September 02000 Pallas GmbH Hermülheimer Straße 10 D-50321
Sourcery Overview & Virtual Machine Installation
Sourcery Overview & Virtual Machine Installation Damian Rouson, Ph.D., P.E. Sourcery, Inc. www.sourceryinstitute.org Sourcery, Inc. About Us Sourcery, Inc., is a software consultancy founded by and for
Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin. http://www.dell.com/clustering
Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin Reza Rooholamini, Ph.D. Director Enterprise Solutions Dell Computer Corp. [email protected] http://www.dell.com/clustering
Parallel Software usage on UK National HPC Facilities 2009-2015: How well have applications kept up with increasingly parallel hardware?
Parallel Software usage on UK National HPC Facilities 2009-2015: How well have applications kept up with increasingly parallel hardware? Dr Andrew Turner EPCC University of Edinburgh Edinburgh, UK [email protected]
Lustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6
WHITE PAPER PERFORMANCE REPORT PRIMERGY BX620 S6 WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6 This document contains a summary of the benchmarks executed for the PRIMERGY BX620
NA-ASC-128R-14-VOL.2-PN
L L N L - X X X X - X X X X X Advanced Simulation and Computing FY15 19 Program Notebook for Advanced Technology Development & Mitigation, Computational Systems and Software Environment and Facility Operations
An Analytical Framework for Particle and Volume Data of Large-Scale Combustion Simulations. Franz Sauer 1, Hongfeng Yu 2, Kwan-Liu Ma 1
An Analytical Framework for Particle and Volume Data of Large-Scale Combustion Simulations Franz Sauer 1, Hongfeng Yu 2, Kwan-Liu Ma 1 1 University of California, Davis 2 University of Nebraska, Lincoln
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Developing Continuous SCM/CRM Forcing Using NWP Products Constrained by ARM Observations
Developing Continuous SCM/CRM Forcing Using NWP Products Constrained by ARM Observations S. C. Xie, R. T. Cederwall, and J. J. Yio Lawrence Livermore National Laboratory Livermore, California M. H. Zhang
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
DIY Parallel Data Analysis
I have had my results for a long time, but I do not yet know how I am to arrive at them. Carl Friedrich Gauss, 1777-1855 DIY Parallel Data Analysis APTESC Talk 8/6/13 Image courtesy pigtimes.com Tom Peterka
