Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. June 2015

Similar documents
FLOW-3D Performance Benchmark and Profiling. September 2012

Performance Evaluation, Scalability Analysis, and Optimization Tuning of HyperWorks Solvers on a Modern HPC Compute Cluster

ECLIPSE Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Data Sheet FUJITSU Server PRIMERGY CX272 S1 Dual socket server node for PRIMERGY CX420 cluster server

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

SR-IOV: Performance Benefits for Virtualized Interconnects!

Mellanox Academy Online Training (E-learning)

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

Advancing Applications Performance With InfiniBand

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Cost Efficient VDI. XenDesktop 7 on Commodity Hardware

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

APACHE HADOOP PLATFORM HARDWARE INFRASTRUCTURE SOLUTIONS

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HPC Update: Engagement Model

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

FUJITSU Enterprise Product & Solution Facts

Crossing the Performance Chasm with OpenPOWER

Enabling High performance Big Data platform with RDMA

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with Internal PCIe SSD Storage

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

RDMA over Ethernet - A Preliminary Study

The Hardware Dilemma. Stephanie Best, SGI Director Big Data Marketing Ray Morcos, SGI Big Data Engineering

LABS. Boston Solutions. Future Intel Xeon processor E v3 product families. September Powered by

Business white paper. HP Process Automation. Version 7.0. Server performance

SMB Direct for SQL Server and Private Cloud

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

C460 M4 Flexible Compute for SAP HANA Landscapes. Judy Lee Released: April, 2015

Couchbase Server: Accelerating Database Workloads with NVM Express*

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

SX1012: High Performance Small Scale Top-of-Rack Switch

Cluster Implementation and Management; Scheduling

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Long-Haul System Family. Highest Levels of RDMA Scalability, Simplified Distance Networks Manageability, Maximum System Productivity

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Building Clusters for Gromacs and other HPC applications

White Paper Solarflare High-Performance Computing (HPC) Applications

SX1024: The Ideal Multi-Purpose Top-of-Rack Switch

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

The MAX5 Advantage: Clients Benefit running Microsoft SQL Server Data Warehouse (Workloads) on IBM BladeCenter HX5 with IBM MAX5.

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Data Sheet Fujitsu PRIMERGY CX122 S1 Cloud server unit for PRIMERGY CX1000

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Microsoft SharePoint Server 2010

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Data Sheet FUJITSU Server PRIMERGY CX400 M1 Multi-Node Server Enclosure

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

Recommended hardware system configurations for ANSYS users

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

High Performance Computing in CST STUDIO SUITE

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Fujitsu PRIMERGY BX920 S2 Dual-Socket Server

SOLUTION BRIEF AUGUST All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Microsoft Exchange 2010 on Dell Systems. Simple Distributed Configurations

Data Center Storage Solutions

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

MESOS CB220. Cluster-in-a-Box. Network Storage Appliance. A Simple and Smart Way to Converged Storage with QCT MESOS CB220

VMware Virtual SAN Hardware Guidance. TECHNICAL MARKETING DOCUMENTATION v 1.0

The Foundation for Better Business Intelligence

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Data Sheet FUJITSU Server PRIMEQUEST 2400E Mission Critical

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

SAS Business Analytics. Base SAS for SAS 9.2

Scaling from Workstation to Cluster for Compute-Intensive Applications

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

HP Cloudline Overview

How To Perform A File Server On A Poweredge R730Xd With Windows Storage Spaces

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing.

Video Surveillance Storage and Verint Nextiva NetApp Video Surveillance Storage Solution

Microsoft SharePoint Server 2010

RESOLVING SERVER PROBLEMS WITH DELL PROSUPPORT PLUS AND SUPPORTASSIST AUTOMATED MONITORING AND RESPONSE

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Data Sheet Fujitsu Server PRIMERGY BX924 S4 Dual Socket Server Blade

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Efficiency and Scalability

Transcription:

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling June 2015

2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center The following was done to provide best practices WRF performance overview Understanding WRF communication patterns Ways to increase WRF productivity MPI libraries comparisons For more info please refer to http://www.dell.com http://www.intel.com http://www.mellanox.com http://wrf-model.org

3 Weather Research and Forecasting (WRF) The Weather Research and Forecasting (WRF) Model Numerical weather prediction system Designed for operational forecasting and atmospheric research WRF developed by National Center for Atmospheric Research (NCAR) The National Centers for Environmental Prediction (NCEP) Forecast Systems Laboratory (FSL) Air Force Weather Agency (AFWA) Naval Research Laboratory Oklahoma University Federal Aviation Administration (FAA)

4 WRF Usage The WRF model includes Real-data and idealized simulations Various lateral boundary condition options Full physics options Non-hydrostatic and hydrostatic One-way, two-way nesting and moving nest Applications ranging from meters to thousands of kilometers

Objectives 5 The presented research was done to provide best practices WRF performance benchmarking CPU performance comparison MPI library performance comparison Interconnect performance comparison System generations comparison The presented results will demonstrate The scalability of the compute environment/application Considerations for higher productivity and efficiency

6 Test Cluster Configuration Dell PowerEdge R730 32-node (896-core) Thor cluster Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs (Power Management in BIOS sets to Maximum Performance) Memory: 64GB memory, DDR4 2133 MHz, Memory Snoop Mode in BIOS sets to Home Snoop, Turbo Enabled OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack Hard Drives: 2x 1TB 7.2 RPM SATA 2.5 on RAID 1 Mellanox ConnectX-4 EDR 100Gbps EDR InfiniBand Adapters Mellanox Switch-IB SB7700 36-port 100Gb/s EDR InfiniBand Switch Mellanox Connect-IB FDR InfiniBand Adapter, Mellanox ConnectX-3 40GbE Ethernet VPI Adapters Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch MPI: Mellanox HPC-X v1.2.0-326, Intel MPI 5.0.3 Compilers: Intel Composer XE 2015.3.187 (compiled using the WRF option (SNB with AVX mods) ifort compiler with icc) Application: WRF 3.6.1, Libraries: NetCDF 4.1.3 Benchmarks: CONUS-12km - 48-hour, 12km resolution case over the Continental US from October 24, 2001

PowerEdge R730 Massive flexibility for data intensive operations 7 Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from GigE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

8 WRF Performance Network Interconnects InfiniBand is the only interconnect that delivers superior scalability performance EDR IB provides higher performance than 1GbE, 10GbE or 40GbE Ethernet stops scaling beyond 2 nodes InfiniBand demonstrates continuous performance gain at scale 28x 51x 28x Higher is better 28 MPI Processes / Node

9 WRF Performance EDR vs FDR InfiniBand EDR InfiniBand delivers superior scalability in application performance As the number of nodes scales, performance gap of EDR IB becomes widen Performance advantage of EDR InfiniBand increases for larger core counts EDR InfiniBand provides 28% versus FDR InfiniBand at 32 nodes (896 cores) 28% 15% Higher is better 28 MPI Processes / Node

10 WRF Performance System Generations Thor cluster (based on Intel E5-2697v3 - Haswell) outperforms prior generations Up to 34-40% higher performance than the Jupiter cluster (based on Intel Ivy Bridge CPUs) System components used: Jupiter: 2-socket 10-core Intel E5-2680V2@2.8GHz, 1600MHz DIMMs, Connect-IB FDR IB Thor: 2-socket 14-core Intel E5-2680V3@2.6GHz, 2133MHz DIMMs, ConnectX-4 EDR IB 40% 34% Higher is better

11 WRF Performance Cores Per Node Running more CPU cores provides higher performance ~39% higher productivity with 28PPN compared to 20PPN ~13% higher productivity with 28PPN compared to 24PPN Higher demand on memory bandwidth and network might limit performance as more cores are used 39% 13% Higher is better CPU @ 2.6GHz

12 WRF Performance MPI Libraries HPC-X delivers slightly higher scalability performance than Intel MPI at 32 nodes (896 cores) HPC-X delivers 8% higher productivity than Intel MPI at 32 nodes (896 cores); gap increases as node size increases Flags for HPC-X: -mca btl_sm_use_knem 1 -x MXM_SHM_KCOPY_MODE=knem -bind-to core -map-by node Flags for Intel MPI: -genv I_MPI_PIN on -genv I_MPI_PIN_PROCESSOR_LIST all -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_SCALABLE_PROGRESS 1 8% Higher is better 28 MPI Processes / Node

13 WRF Profiling Time Spent by MPI Calls Majority of the MPI time is spent on MPI_Bcast MPI_Bcast: 62% of runtime at 32 nodes (896 cores) MPI_Scatter: 8%, MPI_Wait: 2% For waiting for pending non-blocking sends and receives to complete 32 Nodes

14 WRF Profiling MPI Message Sizes Majority of data transfer messages are medium sizes, except for: MPI_Bcast has a large concentration (60% wall) in small sizes (e.g. 4 byte size) MPI_Scatter shows some concentration (~6% wall time) at 16KB buffer MPI_Wait: Large concentration of 1 to less than 256 bytes 32 Nodes

15 WRF Profiling MPI Data Transfer As the cluster grows, less data is transferred between MPI processes Decrease from 523MB max (8 nodes) at to 263MB max per rank (16 nodes) Majority of communications are between neighboring ranks Nonblocking (point to point) data transfers are shown in the graph Collective data communications are small compared to non-blocking communications 32 Nodes 2 Nodes

16 WRF Summary Using the best system components can greatly improve WRF scalability performance Compute: Intel Haswell cluster outperforms system architecture of previous generations Haswell cluster outperforms Ivy Bridge cluster by 34% at 32 nodes (896 CPU cores) Compute: Running more CPU cores provides higher performance ~39% higher productivity with 28PPN compared to 20PPN, and ~13% compared to 24PPN Network: EDR InfiniBand and HPC-X MPI library deliver superior scalability in application performance EDR IB delivers 28 times higher performance than 10GbE/40GbE, 41 times than 1GbE at 32 nodes (896 CPU cores) EDR InfiniBand delivers 28% of higher performance compared to FDR InfiniBand at 32 nodes (896 cores) Mellanox HPC-X provides 8% of performance benefit at scale WRF has high dependency on both network latency and throughput EDR InfiniBand delivers higher network throughput, which allows it to outperforms FDR InfiniBand at high node count MPI profile shows large concentration of medium messages which is benefit by EDR InfiniBand MPI Profile shows majority of data transfer messages are medium sizes, except for: MPI_Bcast has a large concentration in small sizes (e.g. 4 byte size) MPI_Wait: Large concentration of 1 to less than 256 bytes

17 17 Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein