Performance Analysis of a Cardiac Simulation Code Using IPM

Similar documents
Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

FLOW-3D Performance Benchmark and Profiling. September 2012

High Performance Computing in CST STUDIO SUITE

Lecture 1: the anatomy of a supercomputer

Large-Scale Reservoir Simulation and Big Data Visualization

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012

SR-IOV: Performance Benefits for Virtualized Interconnects!

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Recommended hardware system configurations for ANSYS users

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Performance of HPC Applications on the Amazon Web Services Cloud

RDMA over Ethernet - A Preliminary Study

Performance Monitoring of Parallel Scientific Applications

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Distributed communication-aware load balancing with TreeMatch in Charm++

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

OpenMP Programming on ScaleMP

Parallel Large-Scale Visualization

Multicore Parallel Computing with OpenMP

HPC enabling of OpenFOAM R for CFD applications

1 Bull, 2011 Bull Extreme Computing

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

LS DYNA Performance Benchmarks and Profiling. January 2009

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Clusters: Mainstream Technology for CAE

OpenMP and Performance

Turbomachinery CFD on many-core platforms experiences and strategies

Performance Guide. 275 Technology Drive ANSYS, Inc. is Canonsburg, PA (T) (F)

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Performance Characteristics of Large SMP Machines

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Delivering Quality in Software Performance and Scalability Testing

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

GPUs for Scientific Computing

ECLIPSE Performance Benchmarks and Profiling. January 2009

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Overlapping Data Transfer With Application Execution on Clusters

YALES2 porting on the Xeon- Phi Early results

Cluster Computing at HRI

Performance Across the Generations: Processor and Interconnect Technologies

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

LCMON Network Traffic Analysis

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

White Paper. Recording Server Virtualization

Building a Top500-class Supercomputing Cluster at LNS-BUAP

On-Demand Supercomputing Multiplies the Possibilities

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Hadoop on the Gordon Data Intensive Cluster

Achieving Performance Isolation with Lightweight Co-Kernels

CSE-E5430 Scalable Cloud Computing Lecture 2

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

IS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms

Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters*

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Current Status of FEFS for the K computer

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Part I Courses Syllabus

Fast Multipole Method for particle interactions: an open source parallel library component

High Performance Matrix Inversion with Several GPUs

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

CHAPTER FIVE RESULT ANALYSIS

Interoperability Testing and iwarp Performance. Whitepaper

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Best practices for efficient HPC performance with large models

Methodology for predicting the energy consumption of SPMD application on virtualized environments *

Performance of the JMA NWP models on the PC cluster TSUBAME.

PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS JINGJIN WU

HP ProLiant SL270s Gen8 Server. Evaluation Report

Parallel Programming Survey

Performance Analysis of a Numerical Weather Prediction Application in Microsoft Azure

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Data Mining with Hadoop at TACC

Trends in High-Performance Computing for Power Grid Applications

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Power-Aware High-Performance Scientific Computing

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

SUN ORACLE EXADATA STORAGE SERVER

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

HP reference configuration for entry-level SAS Grid Manager solutions

An Application of Hadoop and Horizontal Scaling to Conjunction Assessment. Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Transcription:

Performance Analysis of a Cardiac Simulation Code Using IPM Peter Strazdins* Computer Systems Group, Research School of Computer Science, Markus Hegland, Mathematical Sciences Institute, The Australian National University Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 14 Nov 2011 (slides available from http://cs.anu.edu.au/ Peter.Strazdins/seminars)

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 1 1 Overview the Chaste project modelling in Chaste the vayu cluster related work effect of NUMA affinity (IPM) benchmark configuration and methodology results scalability and hardware event count analysis; IPM insights conclusions and future work

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 2 2 The Chase Cardiac Simulation Project Chaste: software infrastructure for modelling the electro-mechanical properties of the heart large system of C++ code, many dependencies required resolution necessitates parallelization via MPI most computationally-intensive part is solution of a large sparse linear system once per timestep workload uses a high resolution rabbit heart (Oxford University) (2 1 GB files 4 million nodes, 24 million elements)

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 3 3 Cardiac Modelling in Chaste bi-domain equations: continuity of electrical charges within & between heart cells chemical reaction kinetics describe ion transport at cell membranes after spatial discretisation by finite element method, M u du dt = A dc i u i g(u,c) M c dt = f(u,c) A i u i + A e u e = 0 u i and u e are the electric potentials (inside & outside cells) (u = u i u e ) c: chemical composition at the cell membranes g(): current across the membranes, f(): kinetics of the ion channels results in the following linear system M u 0 ha i u k+1 0 M c 0 c k+1 = A e 0 (A i + A e ) u k+1 i M u u k hg(u k,c k ) M c c k + hf(u k,c k ) 0 separate set of equations models the mechanical properties

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 4 4 The Vayu Cluster at the NCI National Facility 1492 nodes: two 2.93 GHz X5570 quad-core Nehalems (commissioned Mar 2010) memory hierarchy: 32KB (per core) / 256KB (per core) / 8MB (per socket); 24 GB RAM single plane QDR Infiniband: latency of 2.0µs & 2600 MB/s (uni-) bandwidth per node jobs (parallel) I/O via Lustre filesystem jobs submitted via locally modified PBS; (by default) allocates 8 consecutively numbered MPI processes to each node typical snapshot: 1216 running jobs (465 suspended), 280 queued jobs, 11776 cpus in use estimating time and memory resources accurately is important! allocation for our work was a few thousand CPU hours, max. core count 2048...

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 5 5 Related Work little so far on scalability of Chaste: parallel efficiency of 53% for 64 processors (3.0 GHz Xeon cluster) most time in PetSc linear solver (Pitt-Francis et al, 2009) 100ms of bidomain activity took 9 minutes on 2048 cores (Fujitsu, 2010) several recent works on using IPM (beginning with Wright et al, 2009) efficient performance evaluation methodologies for large-scale atmosphere codes (Strazdins et, al 2011) after warming, first few timesteps accurately predicts future performance (local version of) OpenMPI with process and NUMA affinity generally gives 20% better performance reduction in the variability of non-io intensive benchmark from 12% to 3% on the average error of repeated measurements

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 6 6 Effect of NUMA Affinity Exposed by IPM Profiles profiling workloads already running at the limit of memory is hard! recent work with the Met Office UM (global atmosphere model) v7.8, no output, 32 32 process grid no affinity: groups of 4 processes (socket 0) spikes in compute times

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 7 7 Benchmark Configuration & Methodology complex build process via scons - hard to get diagnostic information! large number of dependencies (Python, HDF5, Parmetis, and PetSc) problems traced to conflicts with multiple versions installed on vayu used gccopt build with g++ 4.1.2), Boost 1.38.0 and MKL 10.1.2 duration 5.0 ms stimulation duration 0.25 ms (no delay) timesteps 0.005 ms (ODE), 0.01 ms (PDE), 1ms (output) KSP tolerance 1e-8 (absolute) KSP solver default: cg with bjacobi post-processing none activated an internal profiler in the GenericEventHandler class: enabled easy IPM profiling of the main sections benchmarking methodology: run each sim. 5 times, taking the minimum the minimal number of timesteps for accurate benchmarking (1ms was too short - only 100 iterations)

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 8 8 Results Scaling Behavior t8 is the time in seconds for 8 cores (due to memory constraints, could not run on less) Speedup (over 8 cores) 256 64 16 4 1 0.25 total (t8=2143) total (long sim) KSp (t8=1451) Ode (t8=281) InMesh (t8=54) Output (t8=2) rest (t8=92) 0.0625 8 16 32 64 128 256 512 1024 2048 Number of Cores

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 9 9 Results Scalability Analysis scalability of total time: max. 11.9 at 512 cores (from 8) scalability of ODE and KSp quite high; loss due to un-parallelized rest and inversely scaling Output (using HDF5) execution time spent for 1024 cores section %t %comm main MPI comments rest 43 30 all-gather 25% time in I/O Output 20 30 barrier high load imbalance InMesh 19 41 broadcast KSp 14 25 all-reduce (8b) Ode 1 18 all-reduce (4b) AssSys 0.8 0 slight imbalance; some I/O AssRHS 0.5 7 waitany high load imbalance note: IPM dilated the time spent in the KSp section by 50% and overall by 10%

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 10 10 Results - Hardware Event Count Analysis comparison of 2 solvers (counts 10 9, average events per process) KSp cores t job t total t KSP GFLOPS FLOPs L2$A cycles instr cg 64 528 474 269 27.9 155 36 854 763 symmlq 64 549 495 292 24.7 144 38 924 688 cg 1024 222 187 21.9 54.6 8.7 2.2 122 201 symmlq 1024 194 176 19.1 62.3 8.3 2.3 101 161 cg appears faster at 64 cores (despite more flops) but appears slightly slower at 1024 probably due to variability in runs - very high due to Output number of L2$A (estimated average cost of 4 cycles) indicates poor f.p. speeds at 1024 cores, actual job times is 30 50s longer still; only < 10s of this attributable to MPI processes startup / shutdown

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 11 11 Results: IPM Load Balance and Communication Breakdown load imbalance for Output above results fpr 1024 cores MPI calls for the KSp (symmlq)

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 12 12 Conclusions and Future Work working with such codes and systems is hard! bleeding edge technology, variability effects, pushing memory limits, large code systems... pre-installation of conflicting versions of dependencies was a pitfall! efficient & accurate performance analysis methodology applied to major components via internal profiler and IPM sections IPM was lightweight enough to be used some dilation, but gave some valuable insights (load imbalance, MPI breakdowns and event count statistics) KSp solver not the issue at high core counts! future work: investigate parallelizing the rest section further analysis of the Output section possibly system-wide optimization is needed high variability needs also to be looked into

Scala, Nov 2011 Performance Analysis of a Cardiac Simulation Code Using IPM 13 Acknowledgements! NCI NF staff: Margaret Kahn (initial setup), Judy Jenkinson, Jie Cai (IPM) James Southern (FLE) advice on benchmark configuration Fujitsu Laboratories Europe for supporting this work Questons???