Performance of the JMA NWP models on the PC cluster TSUBAME.



Similar documents
Status of HPC infrastructure and NWP operation in JMA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Clusters: Mainstream Technology for CAE

High Performance Computing. Course Notes HPC Fundamentals

OpenMP Programming on ScaleMP

Parallel Programming Survey

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Sun Constellation System: The Open Petascale Computing Architecture

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Turbomachinery CFD on many-core platforms experiences and strategies

Symmetric Multiprocessing

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Multicore Parallel Computing with OpenMP

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High Performance Computing

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Architecture of Hitachi SR-8000

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lattice QCD Performance. on Multi core Linux Servers

Stream Processing on GPUs Using Distributed Multimedia Middleware

64-Bit versus 32-Bit CPUs in Scientific Computing

LS DYNA Performance Benchmarks and Profiling. January 2009

Experiences with HPC on Windows

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Large Scale Simulation on Clusters using COMSOL 4.2

Multi-Threading Performance on Commodity Multi-Core Processors

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Parallel Algorithm Engineering

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Improved LS-DYNA Performance on Sun Servers

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Parallel Computing. Introduction

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

IOS110. Virtualization 5/27/2014 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to Cloud Computing

A Flexible Cluster Infrastructure for Systems Research and Software Development

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

System Models for Distributed and Cloud Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Big Data Analytics Using R

Supercomputing Status und Trends (Conference Report) Peter Wegner

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Performance Characteristics of Large SMP Machines

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

ST810 Advanced Computing

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Oracle Database Scalability in VMware ESX VMware ESX 3.5

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Altix Usage and Application Programming. Welcome and Introduction

10- High Performance Compu5ng

- An Essential Building Block for Stable and Reliable Compute Clusters

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

Building a Top500-class Supercomputing Cluster at LNS-BUAP

PRIMERGY server-based High Performance Computing solutions

benchmarking Amazon EC2 for high-performance scientific computing

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

High Performance Computing, an Introduction to

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Lustre & Cluster. - monitoring the whole thing Erich Focht

Access to the Federal High-Performance Computing-Centers

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

1 Bull, 2011 Bull Extreme Computing

Building Clusters for Gromacs and other HPC applications

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

FLOW-3D Performance Benchmark and Profiling. September 2012

Performance Engineering of the Community Atmosphere Model

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Cellular Computing on a Linux Cluster

Software and the Concurrency Revolution

Technical Computing Suite Job Management Software

With DDN Big Data Storage

Trends in High-Performance Computing for Power Grid Applications

Distributed Symmetric Multi-Processing (DSMP)

The K computer: Project overview

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Hadoop Cluster Applications

Transcription:

Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA) 2) Tokyo Institute of Technology * Presenter Today 13th Workshop on the Use of High Performance Computing in Meteorology

History of Supercomputers at JMA 100,000,000,000 10,000,000,000 1,000,000,000 100,000,000 peak performance (KFLOPS) 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 TFLOPS 1 GFLOPS HITAC M200H: 23.8 MFLOPS HITAC 8800: 4.55 MFLOPS HITAC 5020: 307 KFLOPS IBM 704: 12 KFLOPS 1 1959 1964 1969 1974 1979 1984 1989 1994 1999 2004 2009 year 2012-???? HITACHI SR11000: 27.5 TFLOPS HITACHI SR8000: 768 GFLOPS HITACHI S3800: 32 GFLOPS HITAC S810: 630 MFLOPS

Present and Future of Supercomputer In TOP500, since 2000 The proportion of Cluster has been rising remarkably. In future, PC cluster or massively Parallel Scalar? Reference:TOP500.Org Can PC Cluster be a part of the next supercomputer system for NWP?

PC Cluster compared with current Supercomputer Advantages Much less expensive! Require much less electric power! These are crucial for procurement as an operational computer.

PC Cluster compared with current Supercomputer Disadvantages to be considered Each processor is less powerful. CPU processing performance (Intel vs Power) Transfer speed from memory (Memory Bandwidth) It should be compensated with more processors/nodes. Excellent scalability is necessary. It is meaningless if a huge number of processors are required to be comparable with supercomputer in regard to expense and electric power. Poorness on Communication latency between nodes. Amount of memory

TSUBAME TSUBAME is the PC Cluster of Tokyo Institute of Technology. Absolutely different PC cluster from what used to be called PC Cluster. It can be comparable with supercomputer on number of CPUs, peak performance, amount of memory per node. Infiniband for transferring between nodes (2GB/s) By courtesy of Prof.Matsuoka (TITECH)

Our Questions Can we use the latest scalar or PC Cluster like TSUBAME as an operational NWP system (at least as a part of the system)? If we can, it brings possibility of much cost reduction, and gives us much more freedom of choice to procure the system. What should we users do to use them effectively? What is necessary for PC clusters so that NWP models can be operated on them, especially in regard with hardware and software by venders? To try to answer them, we ran JMA NWP models on TSUBAME, and investigate its performance referring to the current operational supercomputer SR11000.

Profile of TSUBAME & SR11000 Institute and machine name GSIC TSUBAME JMA SR11000 Total Performance 85TFlops 27.5TFlops Total nodes 655 nodes (16CPU/1node) 210 nodes (16CPU/1node) Peak/Node 96GFlops 135GFlops Memory 32GB/node 64GB/node Memory Bandwidth 6.4GB/s 15GB/s Speed between nodes Twoway 2GB/s Twoway 16GB/s CPU AMD Opteron 880(2.4GHz) 885(2.6GHz) POWER5+ (2.1GHz)

Single node performances JMA-NHM with 1 node, parallelized with MPI Running Time: SR11000:TSUBAME = 1:4.6 (normalized peak performance) The model is not tuned at all to be suitable for TSUBAME. If tuned, the performance might get better. Major differences between SR11000 and TSUBAME TSUBAME SR11K Memory Bandwidth 6.4GB/s 15GB/s Speed between nodes 2GB/s 16GB/s

Influence of Memory Throughput Performance Ratio 2 1 0 8CPU/node (8 nodes) 4CPU/node (16 nodes) TSUBAME SR11000 2CPU/node (32 nodes) When the number of CPU per node is changed keeping total CPU number constant (64 CPUs), the smaller number of CPU per node, the better performance TSUBAME indicates. On the other hand, performance of SR11000 is almost independent. The memory throughput of TSUBAME doesn t look sufficient. To secure better performance, memory throughput can be an essential factor to be considered in designing.

To compensate the shortage of performance More nodes (CPUs) are required. Excellent scalability must be secured. Parallelization (for TSUBAME and SR11000) Between nodes, MPI is regarded as de facto standard. Between CPUs that share single memory MPI as well SMP (thread parallelization) Automatic parallelization (only for SR11000) OpenMP (with Parallel code generator by which MPI directives can be automatically inserted.)

500 Which is better as parallelization of CPUs that share memory? Running Time (sec) 400 300 200 100 Use CPUs that share memory with MPI OpenMP Fitted Amdahl s T = a + b/n T: Running time, n: number of nodes 0 50 100 150 Number of nodes While MPI gives better performance with smaller number of nodes, increasing the number of MPI processes make it worse. OpenMP indicates so better scalability that it overturns pure MPI around 100 nodes on the running time.

SMP is more advantageous with plenty of nodes With a larger number of nodes, it is more advantageous to apply SMP than MPI. In what follows in this presentation, results with SMP (OpenMP) are presented. However, SMP has something to be improved: As the parallel code generator inserting OpenMP directives tends to divide codes into small blocks to be parallelized, the cost to synchronize all threads in the end of the blocks cannot be neglected. It is a thinkable way to regard each of whole physical process as a block to be parallelized. To do that, some modifications of algorithms within the models are needed.

Scalability depending on models JMA NWP models of which scalability are measured. GSM: Global Spectral Model (TL319L60) Spatial discritization: Spectral method Each node is required to communicate with the other all nodes for grid wave transform. JMA-NHM: Non-hydrostatic regional model (400x275L60) Spatial discritization: Finite differential method Each node is required to communicate with only adjacent nodes. This difference may reveal remarkable scalability contrasts between GSM and JMA-NHM.

Scalability on TSUBAME Performace ratio 15 10 5 Ideal JMA-NHM GSM Normalized by the running time of 8 nodes. 0 0 20 40 60 80 100 Number of nodes The scalability of JMA-NHM is pretty good, while that of GSM gets almost saturated even with smaller nodes.

Proportion of time of MPI comm. 2 normalized propotion of communication 1 JMA-NHM GSM Measured on SR11000 Normalized by the proportion of MPI communication with 8 nodes. 0 0 10 20 30 40 Number of nodes The proportion of running time occupied by MPI communication of GSM is more sharply increased than JMA- NHM with increasing number of nodes. That is one reason why the scalability of JMA-NHM is better.

600 Dependency of running time of JMA-NHM on number of nodes TSUBAME 80 SR11000 Time (s) 400 200 Time (s) 60 40 20 0 8 16 32 Number of Nodes 0 8 16 32 Number of Nodes The time for MPI communication is almost independent on number of nodes. It is because the cost of invoking MPI communication remains unchanged with number of nodes changed. To gain the performance, we should make an effort to reduce the time consumed by MPI communication.

How to reduce MPI costs (1) Exclude global communications Very expensive use only local exchanges (1:1, not 1: all, all:all) Reduce invoking MPI Cost of invoking MPI ~O(1) (constant with # of nodes) Exchanges between MPI processes should be gathered as much as possible in one communication, which may require modifications of algorithms.

How to reduce MPI costs (2) Running Time of communication (sec) 15 10 5 Number of layers to communicate=2,518,562 (Horizontal Grid number: 721 x 577) 0 0 20000 40000 60000 80000 Number of invoking MPI Measured the running time of communication in JMA-NHM, varying the number of invoking MPI keeping amount of communication constant on SR11000 with 64MPI. The cost of communication decreases with the number of invoking MPI decreased. a: time to invoke one MPI process b: time to purely transfer the whole data (without invoking MPI) To reduce time for communication, it is necessary for users to reduce the number of invoking MPI to reduce the amount of data to transfer between MPI processes (associated to b)

How to reduce MPI costs (3) 1 time step calculation communication calculation communication calculation communication New Scheduling (Ideal) To be reduced boundary inside Asynchronous Data Transfer Asynchronous Data Transfer Asynchronous Data Transfer Only values on halos are exchanged with communication. Grid points near boundary associated to halos are calculated first, and calculation for inside region and asynchronous data transfer are executed simultaneously.

Conclusions (1) TSUBAME has led us to the new generation of scalar computers and PC clusters. TSUBAME has comparable specifications with the current supercomputers. Its lessons affect the project T2K or Next- Generation Computer project in Japan, and it could be main streams of computers in the future. NWP centers like JMA should be interested in them, and look for the ways to use them effectively.

Conclusions (2) The latest PC clusters could be employed in NWP systems, but with some limitations. Tunings to be suitable for scalars are required, of course. The most problematic bottleneck is communication between nodes. Less appropriate for models with spectral scheme or fully implicit scheme. To reduce the cost, algorithm modifications are essential such as Reducing invoking MPI communications Concealment of communications with asynchronous communication.

Conclusions (3) Not only us users, but also venders might have something to do for the operation of NWP models. We would like to eagerly expect further improvements, especially on throughput of memory latency of MPI communication by venders.