VASP Performance Benchmark and Profiling. January 2013

Similar documents
FLOW-3D Performance Benchmark and Profiling. September 2012

ECLIPSE Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Advancing Applications Performance With InfiniBand

RDMA over Ethernet - A Preliminary Study

SR-IOV: Performance Benefits for Virtualized Interconnects!

HPC Applications Scalability.

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High Performance Computing in CST STUDIO SUITE

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

The CNMS Computer Cluster

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Virtual Compute Appliance Frequently Asked Questions

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Enabling High performance Big Data platform with RDMA

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Pivot3 Reference Architecture for VMware View Version 1.03

Clusters: Mainstream Technology for CAE

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (

Improved LS-DYNA Performance on Sun Servers

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

High Performance Computing Infrastructure at DESY

White Paper Solarflare High-Performance Computing (HPC) Applications

Mellanox Academy Online Training (E-learning)

Performance Evaluation, Scalability Analysis, and Optimization Tuning of HyperWorks Solvers on a Modern HPC Compute Cluster

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Interoperability Testing and iwarp Performance. Whitepaper

Building a Top500-class Supercomputing Cluster at LNS-BUAP

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

A PERFORMANCE COMPARISON USING HPC BENCHMARKS: WINDOWS HPC SERVER 2008 AND RED HAT ENTERPRISE LINUX 5

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

SMB Direct for SQL Server and Private Cloud

Business white paper. HP Process Automation. Version 7.0. Server performance

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

Comparing the Network Performance of Windows File Sharing Environments

Maximum performance, minimal risk for data warehousing

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

NFS SERVER WITH 10 GIGABIT ETHERNET NETWORKS

Deep Learning GPU-Based Hardware Platform

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

High Performance SQL Server with Storage Center 6.4 All Flash Array

Dell Reference Configuration for Hortonworks Data Platform

Mellanox Global Professional Services

SX1024: The Ideal Multi-Purpose Top-of-Rack Switch

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Sun Constellation System: The Open Petascale Computing Architecture

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

HP reference configuration for entry-level SAS Grid Manager solutions

SR-IOV In High Performance Computing

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

- An Essential Building Block for Stable and Reliable Compute Clusters

PRIMERGY server-based High Performance Computing solutions

Boosting Data Transfer with TCP Offload Engine Technology

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Vertical Scaling of Oracle 10g Performance on Red Hat Enterprise Linux 5 on Intel Xeon Based Servers. Version 1.0

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Pedraforca: ARM + GPU prototype

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Dell PowerEdge Blades Outperform Cisco UCS in East-West Network Performance

RESOLVING SERVER PROBLEMS WITH DELL PROSUPPORT PLUS AND SUPPORTASSIST AUTOMATED MONITORING AND RESPONSE

Minimize cost and risk for data warehousing

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

SX1012: High Performance Small Scale Top-of-Rack Switch

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Scaling from Workstation to Cluster for Compute-Intensive Applications

Cloud Computing through Virtualization and HPC technologies

Intel Open Network Platform Release 2.1: Driving Network Transformation

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Driving Revenue Growth Across Multiple Markets

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Summary. Key results at a glance:

PCI Express IO Virtualization Overview

Cost Efficient VDI. XenDesktop 7 on Commodity Hardware

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

Transcription:

VASP Performance Benchmark and Profiling January 2013

Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource HPC Advisory Council Cluster Center For more info please refer to http://www.amd.com http://www.dell.com/hpc http://www.mellanox.com http://www.vasp.at 2

VASP VASP Stands for Vienna Ab-initio Simulation Package Performs ab-initio quantum-mechanical molecular dynamics (MD) using pseudopotentials and a plane wave basis set The code is written in FORTRAN 90 with MPI support Access to the code may be given by a request via the VASP website The approach that used in VASP is based on the techniques: A finite-temperature local-density approximation, and An exact evaluation of the instantaneous electronic ground state at each MDstep using efficient matrix diagonalization schemes and an efficient Pulay mixing These techniques avoid problems in original Car-Parrinello method which is based on the simultaneous integration of electronic and ionic equations of motion 3

Objectives The following was done to provide best practices VASP performance benchmarking Understanding VASP communication patterns Ways to increase VASP productivity Compilers and network interconnects comparisons The presented results will demonstrate The scalability of the compute environment The capability of VASP to achieve scalable productivity Considerations for performance optimizations 4

Test Cluster Configuration Dell PowerEdge R815 11-node (704-core) cluster AMD Opteron 6276 (code name Interlagos ) 16-core @ 2.3 GHz CPUs 4 CPU sockets per server node Mellanox ConnectX -3 FDR InfiniBand Adapters Mellanox SwitchX 6036 36-Port InfiniBand switch Memory: 128GB memory per node DDR3 1333MHz OS: RHEL 6.2, SLES 11.2 with MLNX-OFED 1.5.3 InfiniBand SW stack MPI: Intel MPI 4 Update 3, MVAPICH2 1.8.1, Open MPI 1.6.3 (w/ dell_affinity 0.85) Math Libraries: ACML 5.2.0, Intel MKL 11.0, SCALAPACK 2.0.2 Compilers: Intel Compilers 13.0, Open64 4.5.2 Application: VASP 5.2.7 Benchmark workload: Pure Hydrogen (MD simulation, 10 iconic steps, 60 electronic steps, 264 bands, IALGO=48) 5

Dell PowerEdge R815 11-node cluster HPC Advisory Council Test-bed System New 11-node 704 core cluster - featuring Dell PowerEdge R815 servers Replacement system for Dell PowerEdge SC1435 (192 cores) cluster system following 2 years of rigorous benchmarking and product EOL System to be redirected to explore HPC in the Cloud applications Workload profiling and benchmarking Characterization for HPC and compute intense environments Optimization for scale, sizing and configuration and workload performance Test-bed Benchmarks RFPs Customers/Prospects, etc ISV & Industry standard application characterization Best practices & usage analysis 6

About Dell PowerEdge Platform Advantages Best of breed technologies and partners Combination of AMD Opteron 6200 series platform and Mellanox ConnectX -3 InfiniBand on Dell HPC Solutions provide the ultimate platform for speed and scale Dell PowerEdge R815 system delivers 4 socket performance in dense 2U form factor Up to 64 core/32dimms per server 1344 core in 42U enclosure Integrated stacks designed to deliver the best price/performance/watt 2x more memory and processing power in half of the space Energy optimized low flow fans, improved power supplies and dual SD modules Optimized for long-term capital and operating investment protection System expansion Component upgrades and feature releases 7

VASP Performance Interconnect QDR InfiniBand delivers the best performance for VASP Up to 186% better performance than 40GbE on 8 nodes Over 5 times better performance than 10GbE on 8 nodes Over 8 times better performance than 1GbE on 8 nodes Scalability limitation seen with Ethernet networks 1GbE, 10GbE and 40GbE performance starts to decline after 2 nodes 520% 186% 893% NPAR=8 Higher is better 32 Cores/Node 8

VASP Performance Open MPI w/ SRQ Using SRQ enables better performance for VASP at high core counts 24% higher performance than Open MPI at 8 nodes Flags used for enabling SRQ in Open MPI: -mca btl_openib_receive_queues S,9216,256,128,32:S,65536,256,128,32 Processor binding is enabled for both cases 24% NPAR=8 Higher is better 32 Cores/Node 9

VASP Performance Open MPI w/ MXM Enabling MXM enables better performance for VASP at high core counts 47% higher job productivity than the untuned Open MPI run at 8 nodes 19% higher job productivity than the SRQ-enabled Open MPI at 8 nodes Flags used for enabling MXM in Open MPI: -mca mtl mxm -mca btl_openib_free_list_num 8192 -mca btl_openib_free_list_inc 1024 -mca mpi_preconnect_mpi 1 -mca btl_openib_flags 9 Processor binding using dell_affinity.exe for all 3 cases 47% 19% NPAR=8 Higher is better 32 Cores/Node 10

VASP Performance Processor Binding Processor binding is crucial for achieving he best performance on AMD Interlagos Allocating MPI processes on the most optimal cores allows Open MPI to perform Options that are used between the 2 cases: bind-to-core: OMPI param: --bind-to-core (OMPI compiled with hwloc support) dell_affinity.exe: dell_affinity.exe -v -n 32 -t 1 dell_affinity is described at the HPC Advisory Council Spain Conference 2012: http://www.hpcadvisorycouncil.com/events/2012/spain-workshop/pres/7_dell.pdf Works with all open source and commercial MPI libraries on Dell platforms 26% NPAR=8 Higher is better 32 Cores/Node 11

VASP Performance Processes Per Node Running 1 active core in core pairs yield higher system utilization 42% gain in performance with 64 PPN versus 32 PPN (with 1 active core) for 8 nodes 1 floating point unit (FPU) is shared between 2 CPU cores in a package Using 4P servers deliver higher performance than 2P servers 30% gain with 4P server (32 PPN with 1 active core/package) than 32PPN in a 2P 30% 42% NPAR=8 Higher is better 12

VASP Performance Software Stack Free software stack performs comparably to Intel software stack Free software stack: MVAPICH2, Open64 compilers, ACML and ScaLAPACK Specifications regarding the runs: Intel Compiler/MKL: Default optimization in Makefile with minor modification to loc MVAPICH2 runs with processor affinity dell_affinity.exe -v -n32 -t 1 Open64: Compiler flags: -march=bdver1 -mavx mfma Changes to interface blocks of certain modules to support Open64 can be accessed and acquired from the University of Vienna Higher is better 32 Cores/Node 13

VASP Profiling MPI/User Time Ratio QDR InfiniBand reduces the amount of time for MPI communications MPI Communication time increase gradually as the compute time reduces 32 Cores/Node 14

VASP Profiling Number of MPI Calls The most used MPI functions are for MPI collective operations MPI_Bcast(34%), MPI_Allreduce(16%), MPI_Recv(11%), MPI_Alltoall(8%) at 8 nodes Collective operations cause communication time to grow at larger node counts 15

VASP Profiling Time Spent of MPI Calls The time in communications is taken place in the following MPI functions: MPI_Alltoallv(40%) MPI_Alltoall(17%), MPI_Bcast (15%) at 8 nodes 16

VASP Profiling MPI Message Sizes Uneven communication seen for MPI messages More MPI collective communication takes place on certain MPI ranks 1 Nodes 32 Processes 4 Nodes 128 Processes 17

VASP Profiling MPI Message Sizes Communication pattern changes as more processes involved 4 nodes: Majority of messages are concentrated at 64KB 8 nodes: MPI_Alltoallv is the largest MPI time consumer, is largely concentrated at 64KB 8 nodes: MPI_Bcast is the most frequently called MPI API, is largely concentrated at 4B 4 Nodes 128 Processes 8 Nodes 256 Processes 18

VASP Profiling Point To Point Data Flow The point to point data flow shows the communication pattern of VASP VASP mainly communicates on lower ranks The pattern stays the same as the cluster scales 1 Nodes 32 Processes 4 Nodes 128 Processes 19

Summary Low latency in network communication is required to make VASP scalable QDR InfiniBand delivers good scalability and provides lowest latency among the tested: 186% versus 40GbE, over 5 times better than 10GbE and over 8 times than 1GbE on 8 nodes Ethernet would not scale and become inefficient to run beyond 2 nodes Mellanox messaging accelerations (SRQ and MXM) can provide benefit for VASP to run at scale Heavy MPI collective communication occurred in VASP CPU: Running single core in core pairs performs 42% faster than running with both cores dell_affinity.exe ensures proper process allocation support in Open MPI and MVAPICH2 Software stack: Free stack (Open64/MVAPICH2/ACML/ScaLAPACK) performs comparably to Intel stack Additional performance is expected with source code optimizations and tuning using the latest development tools (such as Open64, ACML) that support AMD Interlagos architecture 20

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 21 21