OpenMP Programming on ScaleMP



Similar documents
Performance Characteristics of Large SMP Machines

Parallel Programming Survey

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Using the Windows Cluster

Lecture 2 Parallel Programming Platforms

Using the Intel Inspector XE

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

1 Bull, 2011 Bull Extreme Computing

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Parallel Algorithm Engineering

FLOW-3D Performance Benchmark and Profiling. September 2012

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Lecture 1: the anatomy of a supercomputer

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Altix Usage and Application Programming. Welcome and Introduction

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Symmetric Multiprocessing

Experiences with HPC on Windows

The RWTH Compute Cluster Environment

High Performance Computing in Aachen

OpenMP and Performance

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC


High Performance Computing in CST STUDIO SUITE

High Performance Computing. Course Notes HPC Fundamentals

Performance of the JMA NWP models on the PC cluster TSUBAME.

Virtualization of a Cluster Batch System

Self service for software development tools

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Cloud Computing through Virtualization and HPC technologies

BSC - Barcelona Supercomputer Center

Supercomputing Resources in BSC, RES and PRACE

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Multicore Parallel Computing with OpenMP

Overlapping Data Transfer With Application Execution on Clusters

Pedraforca: ARM + GPU prototype

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

HPC Update: Engagement Model

Chapter 2 Parallel Architecture, Software And Performance

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Extreme Computing: The Bull Way

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

benchmarking Amazon EC2 for high-performance scientific computing

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Introduction to Cloud Computing

Trends in High-Performance Computing for Power Grid Applications

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Kriterien für ein PetaFlop System

Programming the Intel Xeon Phi Coprocessor

Simulation Platform Overview

An HPC Application Deployment Model on Azure Cloud for SMEs

Best practices for efficient HPC performance with large models

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Recent Advances in HPC for Structural Mechanics Simulations

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Parallel Processing using the LOTUS cluster

Distributed communication-aware load balancing with TreeMatch in Charm++

Benchmarking Cassandra on Violin

Multi-Threading Performance on Commodity Multi-Core Processors

Clusters: Mainstream Technology for CAE

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

RDMA over Ethernet - A Preliminary Study

2009 Oracle Corporation 1

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

New Storage System Solutions

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

The Asterope compute cluster

Recommended hardware system configurations for ANSYS users

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

SGI High Performance Computing

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Building Clusters for Gromacs and other HPC applications

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Multi-GPU Load Balancing for Simulation and Rendering

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

owncloud Enterprise Edition on IBM Infrastructure

Oracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud

Introduction to Hybrid Programming

YALES2 porting on the Xeon- Phi Early results

Transcription:

OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ)

MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign OpenMP shared address space read or write to shared variables incremental parallelization e.g. jacobi example parallelized with 35 modified lines e.g. jacobi example parallelized with 3 modified lines => harder to program => easier for fist parallelization

Cluster vs. SMP Clusters: many nodes network interconnect distributed address space programmable with MPI SMPs one big shared address space one operating system instance programmable with OpenMP or MPI largest Cluster: Sequoia IBM Blue Gene/Q,57,864 compute cores 0. PFLOP/s peak performance largest SMP: SGI Altix UV 048 Cores max 64 GB shared memory ~47 TFLOP/s peak performance 3 www.sgi.com/uv

BCS systems @ RWTH Aachen High-level overview 4 boards 4 sockets each levels of NUMAness 8 Cores up to TB RAM IB I/O Hub BCS BCS I/O Hub IB IB I/O Hub BCS BCS I/O Hub IB 4 XCSI cables Source: D. Gutfreund, Bull GmbH

ScaleMP 64 GB RAM Intel Xeon X7550 @GHz QPI Bullx s6030 QPI Intel Xeon X7550 @GHz QPI IB HCA IB HCA 64 GB RAM 64 GB RAM Intel Xeon X7550 @GHz QPI Intel Xeon X7550 @GHz 64 GB RAM 5

ScaleMP vsmp Foundation: - creates the single system image - internally uses prefetching and caching mechanisms to optimize memory accesses - most of the mechanisms work on a 4k basis One Linux Instance vsmp Foundation 4xQDR Infiniband s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 s6030 6

ScaleMP system @ RWTH Aachen Recourse ScaleMP system CPUs 64 Xeon X7550 Cores 5 Clockrate GHz Memory ~ 4 TB Memory bandwidth 886 GB/s Accumulated L3 Cache. GB Peak Performance 4 TFLOP/s vsmp version 4.0 Operated in batch mode under LSF: - jobs scheduled board wise - a binding script restricts execution to a set of boards to minimize interference between different jobs 7

Case Study: TrajSearch Direct Numerical Simulation of three dimensional turbulent flow field produces large output arrays 6384 Procs@BlueGene ~ ½ year of computation produced 048³ output grid (30GB) Trajectory Analysis (TrajSearch) implemented with OpenMP was running fine on standard 8-Core System for 04³ grid cells data Analysis did not work for 048³ dataset Institute for Combustion Technology Chair for Operating Systems Center for Computing and Communication 8

Case Study: TrajSearch Post-processing code for dissipation element analysis Follows Trajectories starting at each grid cell in direction of ascending and descending gradient of an passive scalar field Trajectories lead to a local maximum or minimum respectively The composition of all grid cells of which trajectories end in the same pair of extremal points defines a dissipation element 9

Reducing synchronization OpenMP Synchronization: Global list of extremal points (exp) Use individual local copies Merge local copies once only Trajectory count field (trajcount) Locally buffered increment operations Continuously flushed OS/Runtime hidden synchronization Random Number generation Implement own Random Number Generator Memory Allocation 0 switch to kmp_malloc

Reducing remote memory accesses column-major order arrays Trajectory searches need neighbors in 3D and move in a-priori unknown directions only data in one direction is loaded nearly no reuse of data on a loaded remote page 3D Memory blocks cacheline-locality in a three-dimensional fashion 8x8x8 double Elements are combined to a 4KByte block Data has to be reordered in memory

Reducing remote memory accesses Adaptive NUMA-scheduler Combines principle of static and dynamic scheduling Static layout defines how work should be distributed across nodes Work is distributed dynamically between threads on the same node Idle nodes will take work from the node with least progress To reduce interference work of foreign nodes will get iterations from the highest index backwards 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 3 0 3 node 0 core core node core core node core core node 3 core core 0, 4, 6, 5, 3 8 9, 7 8, 0 core3 core4 core3 core4 core3 core4 6 core3 4 core4, 30 3, 7, 5 0 6, 9, 7 5

Machine Comparison Recources SCALEMP: Nehalem EX SCALEMP: SandyBridge EP Bull BCS: Nehalem EX SGI Altix UV: Nehalem EX Processors 64 Xeon X7550 6 Xeon E5-670 6 Xeon X7550 3 Xeon X7550 Sockets per board 4 4 Cores 5 8 8 56 Clockrate.00 GHz.60 GHz.00 GHz.00 GHz RAM ~ 4 TB 5 GB 56 GB TB 56 GB vsmp version 4.0 5.0 - - Interconnect Infiniband Infiniband Bull Coherence Switch NUMAlink 3

Runtime in Seconds Speedup Performance Comparison: medium dataset 40000 0000 00000 80000 60000 40000 0000 300 50 00 50 00 50 0 0 8 6 3 64 8 56 5 Number of Threads SGI Altix UV: Nehalem EX SCALEMP: Nehalem EX Bull BCS: Nehalem EX SCALEMP: SandyBridge EP SGI Altix UV: Nehalem EX-Speedup SCALEMP: Nehalem EX-Speedup Bull BCS: Nehalem EX-Speedup SCALEMP: SandyBridge EP-Speedup 4

Conclusion OpenMP programming is easy but limited to smaller machines. Getting good performance can be hard. Big SMP machines like the ScaleMP machine can deliver good performance. An incremental approach can be used instead of completely rewriting the simulation code. 5