Performance Characteristics of Large SMP Machines



Similar documents
OpenMP Programming on ScaleMP

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Parallel Programming Survey

OpenMP and Performance

Parallel Algorithm Engineering

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

September 25, Maya Gokhale Georgia Institute of Technology

Using the Intel Inspector XE

Altix Usage and Application Programming. Welcome and Introduction

SGI High Performance Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Multi-Threading Performance on Commodity Multi-Core Processors

An HPC Application Deployment Model on Azure Cloud for SMEs

Tyche: An efficient Ethernet-based protocol for converged networked storage

Experiences with HPC on Windows

LS DYNA Performance Benchmarks and Profiling. January 2009

Using the Windows Cluster

Case Study on Productivity and Performance of GPGPUs

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Technical Paper. Performance and Tuning Considerations for SAS on Fusion-io ioscale Flash Storage

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Scalability evaluation of barrier algorithms for OpenMP

High Performance Computing in Aachen

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Parallel Processing and Software Performance. Lukáš Marek

Multicore Parallel Computing with OpenMP

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Suitability of Performance Tools for OpenMP Task-parallel Programs

Building Clusters for Gromacs and other HPC applications

HP ProLiant SL270s Gen8 Server. Evaluation Report

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Improving System Scalability of OpenMP Applications Using Large Page Support

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

benchmarking Amazon EC2 for high-performance scientific computing

Rethinking SIMD Vectorization for In-Memory Databases

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Clusters: Mainstream Technology for CAE

ECLIPSE Performance Benchmarks and Profiling. January 2009

Lecture 2 Parallel Programming Platforms

Chapter 2 Parallel Architecture, Software And Performance

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

IBM System x Enterprise Servers in the New Enterprise Data

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

A Fast Inter-Kernel Communication and Synchronization Layer for MetalSVM

Hadoop on the Gordon Data Intensive Cluster

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Clustering Billions of Data Points Using GPUs

FLOW-3D Performance Benchmark and Profiling. September 2012

The Transition to PCI Express* for Client SSDs

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

SMB Direct for SQL Server and Private Cloud

New to servers. Are you new to servers? Consider these HP ProLiant Essentials servers. Family guide HP ProLiant rack and tower servers

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Michael Kagan.

Large-Scale Reservoir Simulation and Big Data Visualization

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

4 th Workshop on Big Data Benchmarking

High Performance Computing in the Multi-core Area

SUN ORACLE EXADATA STORAGE SERVER

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

Understanding the Benefits of IBM SPSS Statistics Server

Benchmarking Guide. Performance. BlackBerry Enterprise Server for Microsoft Exchange. Version: 5.0 Service Pack: 4

Programming the Intel Xeon Phi Coprocessor

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

How System Settings Impact PCIe SSD Performance

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

RAID 5 rebuild performance in ProLiant

Lecture 1. Course Introduction

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

Building a Top500-class Supercomputing Cluster at LNS-BUAP

UTS: An Unbalanced Tree Search Benchmark

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

High Performance Computing in CST STUDIO SUITE

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Parallel Large-Scale Visualization

Agility Database Scalability Testing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

AIX NFS Client Performance Improvements for Databases on NAS

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Transcription:

Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ)

Agenda Investigated Hardware Kernel Benchmark Results Memory Bandwidth NUMA Distances Synchronizations Applications NestedCP TrajSearch Conclusion 2

Hardware HP ProLiant DL98 G7 8 x Intel Xeon X655 @ 2 GHz 256 GB main memory internally several boards SGI Altix Ultraviolet 14 x Intel Xeon E7-487 @ 2.4 GHz about 2 TB main memory 2 Socket Boards connected with NUMALink network Bull Coherence Switch System 16 x Intel Xeon X755 @ 2 GHz 256 GB main memory 3 4-socket boards externally connected with the Bull Coherence Switch (BCS)

Hardware ScaleMP System 64 x Intel Xeon X755 @ 2 GHz about 4 TB main memory 4-socket boards connected with Infiniband vsmp foundation software used to create a cache coherent single system Intel Xeon Phi 1 Intel Xeon Phi coprocessor @ 1.5 GHz plugged in a PCIe slot 8 GB main memory 4

1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB 1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB Bandwidth in GB/s Bandwidth in GB/s 1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB 1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB 1B 16B 256B 4KB 64KB 1MB 16MB 256MB 4GB Bandwidth in GB/s Bandwidth in GB/s Bandwidth in GB/s Serial Bandwidth 15 1 5 Write Bandwidth HP Write Bandwidth BCS 15 1 5 15 1 5 Write Bandwidth AltixUV Write Bandwidth ScaleMP Write Bandwidth Phi local remote 1st level 15 1 5 5 4 3 2 1 remote 2nd level 5 Standard SW-Prefetching

Distance Matrix Measured bandwidth between sockets memory and threads placed with numactl normalized to 1 for socket BCS Socket 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 1 13 13 13 57 57 57 57 59 59 59 59 59 57 57 57 1 13 1 13 13 56 55 56 56 56 56 56 55 55 56 56 55 2 14 13 1 13 58 58 58 58 56 56 56 56 58 58 58 58 3 13 13 13 1 56 55 56 55 56 56 56 55 56 55 56 55 4 56 56 56 56 1 13 13 13 56 56 56 57 58 58 58 58 5 55 55 55 55 13 1 13 13 55 55 55 55 56 56 56 55 6 58 58 58 59 13 13 1 13 58 58 58 58 56 56 56 57 7 56 55 56 55 13 13 13 1 56 56 56 56 56 56 56 56 8 58 58 58 58 56 57 56 56 1 13 13 13 56 56 56 56 9 56 56 55 55 55 55 55 55 13 1 13 13 55 55 56 55 1 56 56 56 56 58 58 58 58 13 13 1 13 58 58 58 58 11 56 56 56 55 56 56 56 55 13 13 13 1 56 56 56 56 12 56 56 56 56 58 58 58 58 56 57 56 56 1 13 13 13 13 55 55 55 55 56 56 55 55 56 55 55 55 13 1 13 13 14 58 58 58 58 56 56 56 56 58 58 58 58 13 13 1 13 15 56 56 56 56 56 56 56 56 56 56 56 56 13 13 13 1 remote accesses much more expensive on the BCS machine HP machine internally has also several NUMA levels HP Socket 1 2 3 4 5 6 7 1 1 17 13 18 18 18 18 1 1 1 17 13 18 18 18 18 2 17 17 1 11 18 18 18 18 3 17 17 1 11 19 19 18 18 4 18 18 18 18 1 1 17 17 5 18 18 18 18 1 1 17 17 6 18 18 18 18 17 17 1 1 7 18 19 18 18 17 17 1 9 6

1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 Bandwidth in GB/s Parallel Bandwidth 25 2 15 1 5 Number of Threads HP-read ALTIX-read BCS-read SCALEMP-read Phi-read HP-write ALTIX-write BCS-write SCALEMP-write Phi-write Read and Write Bandwidth on local data 16 MB memory footprint per thread 7

mem_go_around Investigate slow-down, when remote accesses ocure Every thread initializes local memory and measures the bandwidth In step n thread t uses the memory of thread (t+n)%nthreads this increases the number of remote accesses in every step 8

1 2 3 4 5 6 7 8 9 1 11 12 Memory Bandwidth in GB/s mem_go_around 512 256 128 64 32 16 8 4 2 1 Turn HP Altix BCS ScaleMP Phi 9

Synchronization Overhead in microseconds to acquire a lock #threads BCS SCALEMP PHI ALTIX HP 1.6.7.4.5.93 8.27.29 1.89.21.26 32/3.62.99 1.77 3.29.97 64/6 1.4 24.36 1.94 3.72 1.7 128/12 1.64 35.78 2.1 2.99 24 2.26 Synchronization overhead rises with the number of threads ScaleMP introduces more overhead for large thread counts 1

NestedCP: Parallel Critical Point Extraction Virtual Reality Group of RWTH Aachen University: Analysis of large-scale flow simulations Feature extraction from raw data Interactive analysis in virtual environment (e.g. a cave) Critical Point: Point in the vector field with zero velocity Andreas Gerndt, Virtual Reality Center, RWTH Aachen 11

1 2 4 8 16 32 64/6 128/12 24 Runtime in sec. Speedup NestedCP parallelization done with OpenMP tasks many independent tasks only synchronized at the end 3 1 25 2 15 1 5 8 6 4 2 Number of Threads BCS SCALEMP PHI ALTIX HP BCS-Speedup SCALEMP-Speedup PHI-Speedup ALTIX-Speedup HP-Speedup 12

TrajSearch Direct Numerical Simulation of three dimensional turbulent flow field produces large output arrays 16384 Procs@BlueGene ~ ½ year of computation produced 248³ output grid (32GB) Trajectory Analysis (TrajSearch) implemented with OpenMP was optimized for large NUMA machines Here the 124³ grid cells data was used (~4 GB) Institute for Combustion Technology 13

8 16 32 64 128 Runtime in hours Speedup TrajSearch Optimizations: reduced number of locks NUMA aware data initialization data blocked to 8x8x8 blocks to load nearest data on ScaleMP self-written NUMA aware scheduler 4 35 3 25 2 15 1 5 14 12 1 8 6 4 2 Number of Threads ALTIX BCS SCALEMP ALTIX-Speedup BCS-Speedup SCALEMP-Speedup 14

Conclusion larger systems provide a larger total memory bandwidth the overhead for a lot of remote accesses is also higher on larger systems as has been seen with the mem_go_around test the caching in the vsmp software can hide the remote latency, even when larger arrays are read or written remotely synchronization is a problem on all systems that increases with the number of cores the Xeon Phi system delivers a good bandwidth and low synchronization overhead for a large number of threads applications can run well on large NUMA machines Remark: A Revised version with newer performance measurements will soon be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/hpc/public/default.aspx 15

Conclusion larger systems provide a larger total memory bandwidth the overhead for a lot of remote accesses is also higher on larger systems as has been seen with the mem_go_around test the caching in the vsmp software can hide the remote latency, even when larger arrays are read or written remotely Thank you for your attention! synchronization is a problem on all systems that increases with the Questions? number of cores the Xeon Phi system delivers a good bandwidth and low synchronization overhead for a large number of threads applications can run well on large NUMA machines Remark: A Revised version with newer performance measurements will soon be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/hpc/public/default.aspx 16