Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster



Similar documents
Hadoop on the Gordon Data Intensive Cluster

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

SR-IOV: Performance Benefits for Virtualized Interconnects!

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Multicore Parallel Computing with OpenMP

RDMA over Ethernet - A Preliminary Study

Advancing Applications Performance With InfiniBand

benchmarking Amazon EC2 for high-performance scientific computing

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Cloud Data Center Acceleration 2015

OpenMP Programming on ScaleMP

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

FLOW-3D Performance Benchmark and Profiling. September 2012

SMB Direct for SQL Server and Private Cloud

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

HPC Update: Engagement Model

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

Performance Evaluation of InfiniBand with PCI Express

Performance Characteristics of Large SMP Machines

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Mellanox Accelerated Storage Solutions

Kriterien für ein PetaFlop System

Lecture 1: the anatomy of a supercomputer

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Cluster Implementation and Management; Scheduling

Current Status of FEFS for the K computer

SR-IOV In High Performance Computing

ECLIPSE Performance Benchmarks and Profiling. January 2009

Architecting a High Performance Storage System

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

LS DYNA Performance Benchmarks and Profiling. January 2009

Scalability evaluation of barrier algorithms for OpenMP

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Improving Time to Results for Seismic Processing with Paradigm and DDN. ddn.com. DDN Whitepaper. James Coomer and Laurent Thiers

LLNL Redefines High Performance Computing with Fusion Powered I/O

PCI Express* Ethernet Networking

Designed for Maximum Accelerator Performance

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Jean-Pierre Panziera Teratec 2011

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Lustre & Cluster. - monitoring the whole thing Erich Focht

Introduction to SDSC systems and data analytics software packages "

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Parallel Programming Survey

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

Interconnecting Future DoE leadership systems

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Big Data: Hadoop and Memcached

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

High Performance MPI on IBM 12x InfiniBand Architecture

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

Hari Subramoni. Education: Employment: Research Interests: Projects:

Software-defined Storage Architecture for Analytics Computing

Sun Constellation System: The Open Petascale Computing Architecture

HPC enabling of OpenFOAM R for CFD applications

Communicating with devices

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

MPI and Hybrid Programming Models. William Gropp

Transcription:

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni (mahidhar@sdsc.edu) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data Intensive Supercomputer on CCF-1213084 Unified Runtime for Supporting Hybrid Programming Models Heterogeneous Architecture

Gordon A Data Intensive Supercomputer Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others Appro integrated 1,024 node Sandy Bridge cluster 300 TB of high performance Intel flash Large memory supernodes via vsmp Foundation from ScaleMP 3D torus interconnect from Mellanox In production operation since February 2012 Funded by the NSF and available through the NSF Extreme Science and Engineering Discovery Environment program (XSEDE)

Subrack Level Architecture 16 compute nodes per subrack

3D Torus of Switches Linearly expandable Simple wiring pattern Short Cables- Fiber Optic cables generally not required Lower Cost :40% as many switches, 25% to 50% fewer cables Works well for localized communication Fault Tolerant within the mesh with 2QoS Alternate Routing Fault Tolerant with Dual-Rails for all routing algorithms 3 rd dimension wrap-around not shown for clarity

Gordon System Specification INTEL SANDY BRIDGE COMPUTE NODE Sockets 2 Cores 16 Clock speed 2.6 DIMM slots per socket 4 DRAM capacity 64 GB INTEL FLASH I/O NODE NAND flash SSD drives 16 SSD capacity per drive/capacity per node/total 300 GB / 4.8 TB / 300 TB Flash bandwidth per drive (read/write) Flash bandwidth per node (write/read) 270 MB/s / 210 MB/s 4.3 /3.3 GB/s SMP SUPER-NODE Compute nodes 32 I/O nodes 2 Addressable DRAM 2 TB Addressable memory including flash 12TB GORDON Compute Nodes 1,024 Total compute cores 16,384 Peak performance 341TF Aggregate memory 64 TB INFINIBAND INTERCONNECT Aggregate torus BW 9.2 TB/s Type Dual-Rail QDR InfiniBand Link Bandwidth 8 GB/s (bidirectional) Latency (min-max) 1.25 µs 2.5 µs Total storage I/O bandwidth File system DISK I/O SUBSYSTEM /oasis/scratch (1.6 PB), /oasis/projects/nsf(1.5pb) 100 GB/s Lustre Gordon System Specification

OSU Micro-Benchmark Results

Software Environment Details MVAPICH2-X version 2.0 + Intel Compilers Includes unified support for MPI, UPC, OpenShmem, and OpenMP UPC - berkeley_upc, version 2.18.2 GASNET version 1.22.4 (ibv conduit) OpenSHMEM release: 1.0f

MVAPICH2 Dual Rail Performance Results from XSEDE14 paper* Performance on OSU latency and bandwidth micro-benchmarks. Single and dual rail QDR InfiniBand, FDR InfiniBand compared. Evaluate the impact of railing sharing, scheduling, and threshold parameters *Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer Dong Ju Choi, Glenn K. Lockwood, Robert S Sinkovits, and Mahidhar Tatineni

OSU Bandwidth Test Results for Single Rail QDR, FDR, and Dual-Rail QDR Network Configurations - Single rail FDR performance is much better than single rail QDR for message sizes larger than 4K bytes - Dual rail QDR performance exceeds FDR performance at sizes greater than 32K - FDR showing better performance between 4K and 32K byte sizes due to the rail-sharing threshold

OSU Bandwidth Test Performance with MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=8K - Lowering the rail sharing threshold bridges the dual-rail QDR, FDR performance gap down to 8K bytes.

OSU Bandwidth Test Performance with MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=8K And MV2_RAIL_SHARING_POLICY = ROUND_ROBIN - Adding explicit roundrobin tasks to communic ate over different rails

OSU Latency Benchmark Results for QDR, Dual- Rail QDR with Round Robin Option, FDR - Distributing messages across HCAs using the round-robin option increases the latency at small message sizes. - Again, the latency results are better than the FDR case.

Latency (us) UPC memput MVAPICH2-X, Berkeley UPC Two nodes, 1 task/node 1000 100 10 MVAPICH2-X Berkeley UPC 1 1 4 16 64 256 1024 4096 16384 65536 262144 1048576 4194304 Message Size (Bytes)

Latency (us) UPC memget - MVAPICH2-X, Berkeley UPC Two nodes, 1 task/node 10000 1000 100 MVAPICH2-X Berkeley UPC 10 1 1 4 16 64 256 1024 4096 16384 65536 262144 1048576 4194304 Message Size (Bytes)

Latency (us) OpenSHMEM Put MVAPICH2-X, OpenSHMEM V1.0f Two tasks, 1 task/node 1000 100 10 MVAPICH2-X UH 1 1 8 64 512 4096 32768 262144 Message Size (Bytes)

OpenSHMEM - OSU Atomics Benchmarks MV2X-Ops/s MV2X-Latency UH- Ops/s UH-Latency shmem_int_fadd 0.31 3.19 0.18 5.50 shmem_int_finc 0.40 2.53 0.20 5.04 shmem_int_add 0.42 2.36 0.22 4.60 shmem_int_inc 0.41 2.44 0.01 69.22 shmem_int_cswap 0.38 2.66 0.21 4.83 shmem_int_swap 0.40 2.49 0.22 4.53 shmem_longlong_fadd 0.38 2.61 0.22 4.58 shmem_longlong_finc 0.42 2.41 0.01 71.51 shmem_longlong_add 0.42 2.38 0.23 4.42 shmem_longlong_inc 0.42 2.38 0.22 4.62 shmem_longlong_cswap 0.42 2.39 0.21 4.75 shmem_longlong_swap 0.40 2.50 0.03 33.10

OpenSHMEM Barrier MVAPICH2-X, OpenSHMEM V1.0f Tasks (Two nodes) MVAPICH2-X (Latency in us) Release Openshmem (UH) (Latency in us)* 2 1.87 15.85 4 2.44 77.97 8 2.71 191.90 16 3.29 430.99 32 7.01 1009.97 *Release OpenSHMEM version run with gasnet_ibv conduit. No optimizations attempted and further work may be needed.

Latency (us) OpenSHMEM Broadcast 2 Nodes, 8 tasks/node (16 tasks total) 900 800 700 600 500 400 MVAPICH2-X UH 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 262144 1048576 Message Size (Bytes)

Latency (us) OpenSHMEM Broadcast OSU Benchmark; MVAPICH2-X 1600 1400 1200 1000 800 600 400 512 Cores 256 Cores 128 Cores 64 Cores 32 Cores 200 0 1 4 16 64 256 1024 4096 16384 65536 262144 1048576 Message Size (Bytes)

Applications: #1 NAS Parallel Benchmarks: (a) 2.4 UPC version - GWU/HPCL http://threads.hpcl.gwu.edu/sites/npb-upc (b) 3.2 OpenSHMEM version University of Houston https://github.com/jeffhammond/openshmem-npbs #2 CP2K (Hybrid MPI + OpenMP) [Ongoing] #3 PSDNS (Hybrid MPI + SHMEM) [Ongoing]

NPB CLASS D CG UPC Version, MVAPICH2-X Cores (Nodes) Time (in secs) 32 (2) 385.54 64 (4) 229.46 128 (8) 100.08 256 (16) 64.39 CG: Conjugate Gradient, irregular memory access and communication

NPB, CLASS C IS UPC Version, MVAPICH2-X Cores (Nodes) Time (in secs) 16 (1) 1.52 32 (2) 1.11 64 (4) 0.90 128 (8) 0.74 256 (16) 0.46 IS: Integer Sort, random memory access

NPB CLASS D MG, UPC Version, MVAPICH2-X Cores (Nodes) Time (in secs) 32 (2) 46.89 64 (4) 34.77 128 (8) 18.14 256 (16) 8.85 512 (32) 6.07 MG: Multi-Grid on a sequence of meshes, long- and shortdistance communication, memory intensive

NPB CLASS D SP, OpenSHMEM Version, MVAPICH2-X Cores (Nodes) Time (in secs) 16 2535.42 64 699.16 256 144.68 SP: Scalar Penta-diagonal solver

NPB CLASS D MG, OpenSHMEM Version, MVAPICH2-X Cores (Nodes) Time (in seconds) 16 (1) 169.95 32 (2) 93.97 64 (4) 35.90 128 (8) 19.48 256 (16) 9.81 MG: Multi-Grid on a sequence of meshes, long- and shortdistance communication, memory intensive

Ongoing/Future Work Further investigate performance results, look into OpenSHMEM v1.0f aspect. Testing at larger scales (Gordon, Stampede, Comet) Hybrid code performance PSDNS MPI + OpenSHMEM version (Dmitry Pekurovsky) CP2K MPI + OpenMP Big Thanks to Dr. Panda s team for their excellent work and support for MVAPICH2, MVAPICH2-X!