Software and Performance Engineering of the Community Atmosphere Model

Similar documents

Performance Engineering of the Community Atmosphere Model

Mariana Vertenstein. NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science Foundation

Performance Monitoring of Parallel Scientific Applications

Early Performance Evaluation of the Cray X1 at Oak Ridge National Laboratory

OpenMP Programming on ScaleMP

Multicore Parallel Computing with OpenMP

UTS: An Unbalanced Tree Search Benchmark

Lecture 2 Parallel Programming Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Performance of the JMA NWP models on the PC cluster TSUBAME.

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

64-Bit versus 32-Bit CPUs in Scientific Computing

IS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model

On-Chip Interconnection Networks Low-Power Interconnect

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Defense Technical Information Center Compilation Part Notice

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

BSC - Barcelona Supercomputer Center

The CNMS Computer Cluster

Parallel Programming Survey

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Load Imbalance Analysis

Large Scale Simulation on Clusters using COMSOL 4.2

Performance Characteristics of Large SMP Machines

Scalability evaluation of barrier algorithms for OpenMP

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Parallel Analysis and Visualization on Cray Compute Node Linux

A Framework For Application Performance Understanding and Prediction

From Hypercubes to Dragonflies a short history of interconnect

1 Bull, 2011 Bull Extreme Computing

Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

Supercomputing Status und Trends (Conference Report) Peter Wegner

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Distributed communication-aware load balancing with TreeMatch in Charm++

High Performance Computing

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012

On the Importance of Thread Placement on Multicore Architectures

Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture

BLM 413E - Parallel Programming Lecture 3

Scaling Study of LS-DYNA MPP on High Performance Servers

A Case Study - Scaling Legacy Code on Next Generation Platforms

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Science Goals and Solutions - NERSC Case Studies

Performance of HPC Applications on the Amazon Web Services Cloud

Software Tools for Parallel Coupled Simulations

Multi-Threading Performance on Commodity Multi-Core Processors

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Improving System Scalability of OpenMP Applications Using Large Page Support

Part I Courses Syllabus

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Recommended hardware system configurations for ANSYS users

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Petascale Software Challenges. William Gropp

LS DYNA Performance Benchmarks and Profiling. January 2009

Elements of Scalable Data Analysis and Visualization

A Pattern-Based Approach to. Automated Application Performance Analysis

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance and Scalability of the NAS Parallel Benchmarks in Java

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

IRS: Implicit Radiation Solver Version 1.0 Benchmark Runs

Overlapping Data Transfer With Application Execution on Clusters

Barry Bolding, Ph.D. VP, Cray Product Division

High Performance Computing. Course Notes HPC Fundamentals

Linux tools for debugging and profiling MPI codes

Sourcery Overview & Virtual Machine Installation

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Parallel Software usage on UK National HPC Facilities : How well have applications kept up with increasingly parallel hardware?

Lustre Networking BY PETER J. BRAAM

Architecture of Hitachi SR-8000

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

NA-ASC-128R-14-VOL.2-PN

Network Bandwidth Measurements and Ratio Analysis with the HPC Challenge Benchmark Suite (HPCC)

An Analytical Framework for Particle and Volume Data of Large-Scale Combustion Simulations. Franz Sauer 1, Hongfeng Yu 2, Kwan-Liu Ma 1

A PERFORMANCE COMPARISON USING HPC BENCHMARKS: WINDOWS HPC SERVER 2008 AND RED HAT ENTERPRISE LINUX 5

Software Development Tools for Petascale Computing

In situ data analysis and I/O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Relating Empirical Performance Data to Achievable Parallel Application Performance

Symmetric Multiprocessing

Developing Continuous SCM/CRM Forcing Using NWP Products Constrained by ARM Observations

PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS JINGJIN WU

Stream Processing on GPUs Using Distributed Multimedia Middleware

Box Leangsuksun+ * Thammasat University, Patumtani, Thailand # Oak Ridge National Laboratory, Oak Ridge, TN, USA + Louisiana Tech University, Ruston,

Introduction to application performance analysis

Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture

Impacts of Operating Systems on the Scalability of Parallel Applications

Exploration of adaptive network transfer for 100 Gbps networks Climate100: Scaling the Earth System Grid to 100Gbps Network

- An Essential Building Block for Stable and Reliable Compute Clusters

Performance Tuning of a CFD Code on the Earth Simulator

Search Strategies for Automatic Performance Analysis Tools

DIY Parallel Data Analysis

Transcription:

Software and Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory Science Team Meeting Climate Change Prediction Program April 24-26, 2006 Royal Sonesta Hotel Boston Cambridge, MA 1

Overview SciDAC-sponsored Community Atmosphere Model (CAM) software and performance engineering activities since last Science Team Meeting: Introducing and maintaining performance optimizations on both vector and nonvector computer systems as code evolves Improving portability and performance portability Measuring and analyzing performance Poster Outline CAM modifications and impacts CAM performance tuning options CAM performance and performance analysis Performance issues and future activities 2

Community Atmosphere Model (CAM) Atmospheric global circulation model Timestepping code with two primary phases per timestep Dynamics: advances evolution equations for atmospheric flow Physics: approximates subgrid phenomena, such as precipitation, clouds, radiation, turbulent mixing, Multiple options for dynamics: Spectral Eulerian (EUL) dynamical core (dycore) Spectral semi-lagrangian (SLD) dycore Finite-Volume semi-lagrangian (FV) dycore all using tensor product latitude x longitude x vertical level grid over the sphere, but not same grid, same placement of variables on grid, or same domain decomposition in parallel implementation Separate data structures for dynamics and physics and explicit data movement between them each timestep (in a coupler ) Developed at NCAR, with contributions from DOE and NASA 3

CAM Performance Experiments 1. Spectral Eulerian dycore running on T85L26 computational grid 128x256x26 (latitude by longitude by vertical) grid Current production dynamical core and grid resolution in CCSM 2. Finite Volume dycore running on 1.9x2.5 degree horizontal grid with 26 vertical levels 96x144x26 (latitude by longitude by vertical) grid Finite volume dycore is the preferred (required among current options) dycore for atmospheric chemistry due to its conservation properties. 1.9x2.5 degree resolution is the initial CCSM production grid size. 3. Finite Volume dycore running on 0.5x0.625 degree horizontal grid ( D grid ) with 26 vertical levels 361x576x26 (latitude by longitude by vertical) grid 15 times larger than FV production grid resolution. 4

Experimental Platforms Cray X1 at Oak Ridge National Laboratory (ORNL): 128 4-way vector SMP nodes and a 4-D hypercube interconnect. Each processor has 8 64-bit floating point vector units running at 800 MHz. Cray X1E at ORNL: 256 4-way vector SMP nodes and a 4-D hypercube interconnect. Each processor has 8 64-bit floating point vector units running at 1.13 GHz. Cray XT3 at ORNL: 5294 single processor nodes (2.4 GHz AMD Opteron) and a 3-D torus interconnect. Earth Simulator: 640 8-way vector SMP nodes and a 640x640 single-stage crossbar interconnect. Each processor has 8 64-bit floating point vector units running at 500 MHz. IBM p575 cluster at the National Energy Research Scientific Computing Center (NERSC): 122 8-way p575 SMP nodes (1.9 GHz POWER5) and an HPS interconnect with 1 two-link network adapter per node. IBM p690 cluster at ORNL: 27 32-way p690 SMP nodes (1.3 GHz POWER4) and an HPS interconnect with 2 two-link network adapters per node. IBM SP at NERSC: 184 Nighthawk II 16-way SMP nodes (375MHz POWER3-II) and an SP Switch2 with two network adapters per node. Itanium2 cluster at Lawrence Livermore National Laboratory (LLNL): 1024 4-way Tiger4 nodes (1.4 GHz Intel Itanium 2) and a Quadrics QsNetII Elan4 interconnect. SGI Altix 3700 at ORNL: 128 2-way SMP nodes and a NUMAflex fat-tree interconnect. Each processor is a 1.5 GHz Itanium 2 with a 6 MB L3 cache. SGI Altix 3700 Bx2 at NASA: 1024 2-way SMP nodes and a NUMAflex fat-tree interconnect. Each processor is a 1.6 GHz Itanium 2 with a 9 MB L3 cache. 5

CAM Modifications 11/1/04 (cam3_0_21): Additional communication optimization options in spectral dycores and in physics load balancing. 1/6/05 (cam3_0_24): Load balancing and communication optimizations in spectral dycores and physics when using reduced horizontal grid. 1/20/05 (cam3_0_27): Improved support for math library FFTs; Additional vectorization in the FV dycore; Bug fixes for X1. 3/15/05 (cam3_0_34 / cam3.1): Restoration of X1 performance lost in earlier check-in; Serial performance optimizations; Additional vectorization in physics; Support for SSP mode on X1. 8/5/05 (cam3_2_11): Improved support for math library FFTs; Additional vectorization and streaming in the FV dycore. 6

CAM Modifications 9/6/05 (cam3_2_19): Restoration of X1E performance lost after radiation routine refactoring; Initial XT3 optimizations. 9/29/05 (cam3_2_23): Explicit typing of variables and constants to improve portability. 10/12/05 (cam3_2_26): Optimization and vectorization of new code in FV dycore (added as part of GEOS5/CAM merge). Vectorization of buffer copies in communication utility layer. 11/29/05 (cam3_2_38): Restoration of FV optimization options lost in GEOS5/CAM merge. 4/2/06 (cam3_2_57): Optimization and vectorization of code in CAM, CLM2, and MCT used in new atmosphere/land interface. 7

CAM Performance History: X1E Performance impact of each SciDAC check-in on the Cray X1E. Not all check-ins improved performance, nor were expected to - some improved portability, added new performance tuning options, or fixed bugs. Maintaining performance as CAM evolves is (and will be) as important as further improving current performance. 8

CAM Performance History: XT3 Performance impact of each SciDAC check-in on the Cray XT3. Beyond initial optimization (cam3_2_19), little attention has been paid to XT3 performance. This will change as we move to higher resolutions and increase exploitable parallelism in the physics. 9

CAM Performance Portability Goals 1) Maximize single processor performance, e.g. a) Optimize memory access patterns b) Maximize vectorization or other fine-grain parallelism 2) Minimize parallel overhead, e.g. a) Minimize communication costs b) Minimize load imbalance c) Minimize redundant computation for a range of target systems, a range of problem specifications (grid size, physical processes, ) a range of processor counts while preserving maintainability and extensibility. No optimal solution for all desired (platform,problem,processor count) specifications. Approach: compile-time and runtime optimization options. 10

CAM Performance Optimization Options 1. Physics data structures Index range, dimension declaration 2. Physics load balance Variety of load balancing options, with different communication overheads SMP-aware load balancing options 3. Communication options MPI protocols (two-sided and one-sided) Co-Array Fortran SHMEM protocols and choice of pt-2-pt implementations or collective communication operators 4. OpenMP parallelism Instead of some MPI parallelism In addition to MPI parallelism 5. Aspect ratio of dynamics 2D domain decomposition (FV-only) 11

Performance Tuning Physics Data Structures pcols parameter determines vector length and cache locality in physics. Altix: minimum at pcols = 8 p575 : minimum at pcols = 80 p690: minimum at pcols = 24 X1E : minimum at pcols = 514 or 1026 XT3: minimum at pcols = 34 pcols <= 4 bad for all systems. 12

Evaluating Load Balancing: FV / D Grid On the IBM p690, full load balancing is always best, but no load balancing is, at worst, only 7% slower. On the Cray XT3, full load balancing is usually best, but pairwise exchange load balancing is competitive. No load balancing is usually more than 4% slower, and as much as 12% slower. The best example of the advantage of full load balancing is on the Cray X1E. It is so much better that it was clear early on in the tuning process and the other load balancing options were pruned from the search tree. 13

1D vs. 2D Decompositions: FV / D Grid 2D decomposition is superior to 1D when number of MPI processes decomposing latitude in 1D exceeds some limit, ~70 on two Cray systems (not using OpenMP). Similarly, (P/7)x7 decomposition outperforms (P/4)x4 when the number of MPI processes decomposing latitude in (P/4)x4 exceeds ~70. On IBM p690, OpenMP and the 1D decomposition is superior to 2D decompositions up to 672 processors, even when the optimum for 1D uses more threads per process than the optimum for 2D. (Note: 1D never uses more than 84 MPI processes.) 14

MPI vs. OpenMP: FV / D Grid on p690 cluster Primary utility of OpenMP is to extend scalability. For 2D decomposition, fewer threads is generally better for the same number of processors. For 1D decomposition, it is more important not to let the number of processors decomposing latitude exceed approx. 64 than it is to minimize the number of threads. 15

Performance Tuning Impact: EUL / T85L26 CCM mode - tuning option settings representative of what was used in CAM s predecessor. Big win on IBM, primarily due to ability to use OpenMP parallelism in physics when dynamics parallelism exhausted (at approximately 128 processors). As vector length for CCM mode is near optimal on the X1E, performance difference is primarily due to load balancing. 16

Platform Comparisons: FV / D Grid Earth Simulator results courtesy of D. Parks. SP results courtesy of M. Wehner. Maximum number of MPI processes is 960. IBM systems and Earth Simulator use OpenMP to increase scalability. Recent performance optimizations backported into CAM 3.1 for these experiments. 17

Performance Diagnosis: X1E vs. XT3 Dynamics and physics scaling similarly on XT3. On X1E, physics scaling much better. (Dynamics best performance at 448 processors.) 18

Performance Diagnosis: p575 vs. X1E At large scale, dynamics performance on p575 similar to performance on X1E. However, p575 uses OpenMP and X1E does not in these experiments. P575 using a coarser dynamics domain decomposition for same processor count. 19

Current Limits to FV Scalability 1. Polar filter introduces load imbalances, especially on vector systems (because the small number of short FFTs do not vectorize well). 2. Requirement that at least 3 latitudes and 3 vertical levels be present in each block of domain decomposition within FV dycore. For D grid with 26 vertical levels, limit is 120x8 processor grid, or 960 MPI processes. For 1.9x2.5 degree grid, limit is 32x8 processor grid, or 256 MPI processes. 3. Physics can use many more processors, but currently limited to the same number of MPI processes as the dynamical core. OpenMP can be used to assign more processors to physics than to dynamics, mitigating this to some degree. There is also some OpenMP parallelism available within the dynamics. 4. On vector systems, additional parallelism in physics is of limited utility, as vector length drop below 220 for D grid when using more than 960 processors (and drops below 110 in radiation routines). 20

New Issue: Tracer Transport Atmospheric chemistry introduces not only additional cost in the physics, but also requires the advection of many new fields. Experiment measures CAM runtime when advecting additional fields for 1x1.25 grid using a 48x7 processor grid on the LLNL Itanium2 cluster (Thunder) and a 48x4 processor grid on the X1E. Cost per tracer is minimally 1-2%, so more than doubles CAM runtime when advecting 100 new fields. 21

New and Planned Activities 1. Dynamics Scaling Generalize dynamics/physics interface to support dycores not using lon/lat grid. Investigate new, more scalable dycores, for example, FV on a cubed sphere computational grid. 2. Physics Scaling Add support to use different numbers of MPI processes in the dynamics and in the physics (generalizing current OpenMP approach to pure MPI codes). 3. Atmospheric chemistry Investigate parallelizing advection over species and optimizing interprocess communication. Investigate parallelizing chemistry over species. Investigate 3D decomposition for chemistry. 4. Increasing model resolution exacerbates the current I/O bottlenecks and memory impact of the (few) remaining global arrays. Investigate parallel I/O on target platforms. 22

Source Material A. Mirin and W. Sawyer, A Scalable Implementation of a Finite-Volume Dynamical Core in the Community Atmosphere Model, International Journal for High Performance Computer Applications, 19(3), August 2005, pp. 203-212. W. Putman, S-J. Lin, and B-W. Shen, Cross-Platform Performance of a Portable Communication Module and the NASA Finite Volume General Circulation Model, International Journal for High Performance Computer Applications, 19(3), August 2005, pp. 213-223. L. Oliker, J. Carter, M. Wehner, A. Canning, S. Ethier, A. Mirin, G. Bala, D. Parks, P. Worley, S. Kitawaki, and Y. Tsuda, Leading Computational Methods on Scalar and Vector HEC Platforms, in Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking and Storage (SC05), Seattle, WA, November 12-18, 2005. P. Worley and J. Drake, Performance Portability in the Physical Parameterizations of the Community Atmospheric Model, International Journal for High Performance Computer Applications, 19(3), August 2005, pp. 187-201. P. Worley, Benchmarking using the Community Atmosphere Model, in Proceedings of the 2006 SPEC Benchmark Workshop, Austin, TX, January 23, 2006. 23

Acknowledgements Research sponsored by the Atmospheric and Climate Research Division and the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC and Contract No. W-7405-Eng-48 with the University of California Lawrence Livermore National Laboratory. These slides have been authored by contractors of the U.S. Government under contracts No. DE-AC05-00OR22725 and No. W-7405-Eng-48. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725, and of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098. 24