Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012



Similar documents
Cloud Computing through Virtualization and HPC technologies

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

SR-IOV: Performance Benefits for Virtualized Interconnects!

Enabling Technologies for Distributed and Cloud Computing

Benchmarking Amazon s EC2 Cloud Platform

Enabling Technologies for Distributed Computing

benchmarking Amazon EC2 for high-performance scientific computing

SR-IOV In High Performance Computing

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Investigation of storage options for scientific computing on Grid and Cloud facilities

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Hadoop on the Gordon Data Intensive Cluster

RDMA over Ethernet - A Preliminary Study

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Multicore Parallel Computing with OpenMP

Cluster Implementation and Management; Scheduling

High Performance Computing in CST STUDIO SUITE

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

FLOW-3D Performance Benchmark and Profiling. September 2012

Investigation of storage options for scientific computing on Grid and Cloud facilities

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

Vocera Voice 4.3 and 4.4 Server Sizing Matrix

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Lustre * Filesystem for Cloud and Hadoop *

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud


Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO

Performance in a Gluster System. Versions 3.1.x

Scientific Computing Data Management Visions

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Parallel Programming Survey

POSIX and Object Distributed Storage Systems

Building a Parallel Cloud Storage System using OpenStack s Swift Object Store and Transformative Parallel I/O

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

SAS Business Analytics. Base SAS for SAS 9.2

Cloud Computing. Alex Crawford Ben Johnstone

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

IOS110. Virtualization 5/27/2014 1

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Distributed communication-aware load balancing with TreeMatch in Charm++

Technical Computing Suite Job Management Software

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Technology Insight Series

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Virtual Compute Appliance Frequently Asked Questions

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Quantum StorNext. Product Brief: Distributed LAN Client

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Monitoring Databases on VMware

Client-aware Cloud Storage

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

White Paper. Recording Server Virtualization

xstream 2.0 Datasheet

High Performance Computing OpenStack Options. September 22, 2015

HPC performance applications on Virtual Clusters

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V

Performance Analysis of a Numerical Weather Prediction Application in Microsoft Azure

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

ECLIPSE Performance Benchmarks and Profiling. January 2009

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Integrated Grid Solutions. and Greenplum

Achieving Performance Isolation with Lightweight Co-Kernels

PES. Batch virtualization and Cloud computing. Part 1: Batch virtualization. Batch virtualization and Cloud computing

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Determining Overhead, Variance & Isola>on Metrics in Virtualiza>on for IaaS Cloud

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Scaling from Workstation to Cluster for Compute-Intensive Applications

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Transcription:

Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science), Jie Cai, Muhammad Atif and Joseph Antony (National Computational Infrastructure), The Australian National University Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 202 (slides available from http://cs.anu.edu.au/ Peter.Strazdins/seminars)

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds Overview motivating scenario experimental setup system (private and public cloud), software applications: Chaste cardiac and MetUM atmosphere simulations results OSU communication microbenchmarks NAS Parallel Benchmarks applications conclusions and future work

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 2 2 Motivations: a Cloud-bursting Supercomputer Facility supercomputing facilities provide access to state-of-the-art cluster computers also provide comprehensive software stacks to support a diverse range of applications the supercomputing cluster is typically highly contended resource users may be restricted to limited resources may have long turnaround times some workloads may not make good use of cluster may be better off using a private or even public cloud requires easily replication of software stack on cloud resources ideally, migration of jobs onto cloud would be transparent recent frameworks can transparently profile HPC jobs for cloud suitability, e.g. ARRIVE-F, (and migrate VMs accordingly)

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 3 3 Experimental Setup: Systems CPU Platform private cloud public cloud premiere cluster Platform DCC EC2 Vayu # of Nodes 8 4 492 Model Intel Xeon E5520 Intel Xeon X5570 Intel Xeon X5570 Clock Spd 2.27GHz 2.93GHz 2.93GHz #Cores 8 (2 slots) 8 2 8 (2 slots) L2 Cache 8MB (shared) 8MB (shared) 8MB (shared) Memory per node 40GB 20GB 24GB Operating System Centos 5.7 CentOS 5.7 CentOS 5.7 Virtualization VMware ESX 4.0 Xen File System NFS NFS Lustre Interconnect GigE (dual) 0 GigE QDR IB DCC: VM/node; filesystems mounted via external cluster via two QLogic channel fibre HBAs vayu: QDR IB used for both compute and storage

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 4 4 Experimental Setup: Software EC2: StarCluster instance to automate the build, configuration & management of HPC compute nodes vayu /apps directory: system-wide compilers, libraries, and application codes user environment is configured via module package rsync /apps and user home/project directories onto the VM to replicate stack minimizes interference of existing stack on the clouds only occasionally needed to recompile for the clouds benchmarking software OSU MPI communication micro-benchmarks: bandwidth and latency NAS Parallel Benchmark MPI suite 3.3, class B 5 kernels & 3 pseudo-applications derived from CFD applications

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 5 5 Experimental Setup: Applications UK Met Office Unified Model (MetUM) version 7.8 used for operational weather forecasts in UK, Australia, S. Korea,... benchmark: global atmosphere model using a 640 48 70 grid Chaste version 2. cardiac simulation model electric field propagations and ion transport benchmark: 4M node, 24M element rabbit heart mesh more memory-intensive than the UM benchmark! dominant part of both is a linear system solve on each timestep

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 6 6 Results: Communication Micro-benchmarks OSU bandwith benchmark results comparison on three different platforms 0000 dcc GigE EC2 0GigE 000 vayu QDR IB OSU latency benchmark results comparison on three different platforms 00000 dcc GigE EC2 0GigE 0000 vayu QDR IB MB/sec 00 0 Microseconds 000 00 0. 0 0.0 8 64 52 4K 32K 256K 2M Message Size 8 64 52 4K 32K 256K 2M Message Size OSU MPI bandwidth tests OSU MPI latency tests (MB/s vs message size) (time (µs) vs message size) trends as expected per theoretical specifications: more than one order of magnitude better performance on vayu fluctuations on DCC suspected from CPU scheduling by hypervisor

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 7 7 Results: NAS Parallel Benchmarks (I) 64 32 6 EP benchmark scalability comparison on three different platforms dcc GigE EC2 0GigE vayu QDR IB 64 32 6 IS benchmark scalability comparison on three different platforms dcc GigE EC2 0GigE vayu QDR IB 8 8 Speedup 4 2 Speedup 4 2 0.4 0.4 0.2 2 4 8 6 32 64 # of cores 0.2 2 4 8 6 32 64 # of cores EP.B speedup, to 64 cores IS.B speedup, to 64 cores EC2 fluctuations for EP.B suspected due to jitter (CPU scheduling and hyperthreading IS shows the poorest scaling of all benchmarks: IPM profiling shows % communication at 64 cores is 98% (DCC), 85% (DCC) and 68% (vayu)

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 8 8 Results: NAS Parallel Benchmarks (II) 64 32 6 FT benchmark scalability comparison on three different platforms dcc GigE EC2 0GigE vayu QDR IB 64 32 6 CG benchmark scalability comparison on three different platforms dcc GigE EC2 0GigE vayu QDR IB 8 8 Speedup 4 2 Speedup 4 2 0.4 0.4 0.2 2 4 8 6 32 64 # of cores 0.2 2 4 8 6 32 64 # of cores FT.B speedup, to 64 cores CG.B speedup, to 64 cores CG.B: IPM profiling shows % communication at 64 cores is 90% (DCC), 58% (DCC) and 22% (vayu) drop-offs on clouds occur when intra-node communication is required BT.B, MG.B, SP.B and LU/B showed similar scaling to FT.B single core performance consistently 20% (30%) faster on EC2 (vayu)

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 9 9 Results: Chaste Cardiac Simulation (results not available on EC2 due to complex dependencies) scaling of the KSp linear solver determines overall trends note: benchmark scales to 024 cores on vayu input mesh:.4 faster on vayu (8 cores), scaled the same on both 8 6 32 64 output: 2.6 faster on vayu (8 Number of Cores cores) but scaled better on dcc @ 32 cores, 48% vs % of time spent in communication on dcc vs vayu Speedup (over 8 cores) 8 4 2 vayu total (t8=599) dcc total (t8=07) vayu KSp (t8=938) dcc KSp (t8=579) 3 more spent in KSp solver on dcc (large numbers of collectives) IPM profiles also indicated a greater degree and a higher irregularity of load imbalance on DCC dcc performance hurt by high message latencies & jitter

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 0 0 Results: MetUM Global Atmosphere Simulation Speedup (over 8 cores) 8 4 2 vayu (t8=963) dcc total (t8=486) EC2 total (t8=82) EC2-4 total (t8=646) 8 6 32 64 Number of Cores Vayu DCC EC2 EC2-4 time(s) 303 624 770 380 r comp.0.37 2.39.7 r comm.0 6.7 3.53.6 %comm 3 42 8 8 %imbal 3 4 8 9 I/O (s) 4.5 37.8 9. 7.6 Details at 32 cores (EC2-4: 4 nodes used, uniformly 2 faster) overall load imbalance least on dcc, but generally higher & more irregular across individual sections (NUMA effects) EC-2 shows similar imbalance and communication profiles to vayu dcc spent most time in communication, particularly in sections where there were large numbers of collectives read-only I/O section: dcc much slower to vayu, EC2 similar, to vayu

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds Conclusions and Future Work largely successful in creating x86-64 binaries on HPC system & replicating all software dependencies into the VMs on clouds MPI micro-benchmarks and NPB benchmarks showed communication bound applications were disadvantaged on the virtualized platforms large numbers of short messages were especially problematic corroborated by the two applications over-subscription of cores and hidden hardware characteristics (e.g. NUMA) also affected cloud performance only saw minor effects (e.g. jitter) directly attributable to virtualization the underlying filesystem affected application I/O performance future work: use metrics from the ARRIVE-F framework to assess candidate workloads for private/public science clouds using StarCluster, cloud burst onto OpenStack based resources locally & externally

IPDPS/PDSEC-2, May 202 Scientific Application Performance on HPC, Private and Public Clouds 2 Acknowledgements thanks to Michael Chapman, David Singleton, Robin Humble, Ahmed El Zein, Ben Evans and Lindsay Botten at the NCI-NF for their support and encouragement! Questions???