Scalability evaluation of barrier algorithms for OpenMP

Similar documents
Multi-Threading Performance on Commodity Multi-Core Processors

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Open-Source Software Support for the OpenMP Runtime API for Profiling

Parallel Computing. Parallel shared memory computing with OpenMP

Multicore Parallel Computing with OpenMP

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

A Case Study - Scaling Legacy Code on Next Generation Platforms

Towards an Implementation of the OpenMP Collector API

Performance Characteristics of Large SMP Machines

Parallel Algorithm Engineering

Parallel Programming Survey

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Spring 2011 Prof. Hyesoon Kim

Parallel Computing. Shared memory parallel programming with OpenMP

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Enabling Technologies for Distributed Computing

CHAPTER 1 INTRODUCTION

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Multi-core Programming System Overview

An Introduction to Parallel Computing/ Programming

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

MPI and Hybrid Programming Models. William Gropp

Software and the Concurrency Revolution

Sun Constellation System: The Open Petascale Computing Architecture

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

MAQAO Performance Analysis and Optimization Tool

Enabling Technologies for Distributed and Cloud Computing

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

Symmetric Multiprocessing

BLM 413E - Parallel Programming Lecture 3

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Full and Para Virtualization

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

benchmarking Amazon EC2 for high-performance scientific computing

Distributed communication-aware load balancing with TreeMatch in Charm++

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Performance Analysis and Optimization Tool

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Incorporating Multicore Programming in Bachelor of Science in Computer Engineering Program

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Data Structure Oriented Monitoring for OpenMP Programs

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Data-Flow Awareness in Parallel Data Processing

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Towards OpenMP Support in LLVM

UTS: An Unbalanced Tree Search Benchmark

Course Development of Programming for General-Purpose Multicore Processors

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Scalability and Classifications

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

OpenMP and Performance

Siebel Application Deployment Manager Guide. Siebel Innovation Pack 2013 Version 8.1/8.2 September 2013

HPC Deployment of OpenFOAM in an Industrial Setting

Parallel Computing. Benson Muite. benson.

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

A Flexible Cluster Infrastructure for Systems Research and Software Development

OpenMP Programming on ScaleMP

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Keys to node-level performance analysis and threading in HPC applications

A Pattern-Based Approach to. Automated Application Performance Analysis

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Part I Courses Syllabus

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

Optimizing Shared Resource Contention in HPC Clusters

Performance Comparison of RTOS

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

VMware and CPU Virtualization Technology. Jack Lo Sr. Director, R&D

Optimization of Cluster Web Server Scheduling from Site Access Statistics

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

Overlapping Data Transfer With Application Execution on Clusters

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

A Fast Inter-Kernel Communication and Synchronization Layer for MetalSVM

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

OpenCL for programming shared memory multicore CPUs

Transcription:

Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science Department, Texas Supported by NSF under grant CCF 0702775 and CCF - 0833201

Agenda Motivation Overview of different barrier implementation. Methodology Experiment Results Conclusion Future Work

Motivation The number of cores continues to grow.. OpenMP application scalability is critical. Diverse architecture, memory hierarchy, network interconnect, applications.. Needs an adaptive runtime.

OpenMP and Barriers OpenMP Relies heavily on explicit barrier synchronization to coordinate the work of threads. Most OpenMP constructs involve implicit barriers, and incur overhead because of it. Barrier algorithms classification: 1. Centralized busy-wait : busy-wait on shared variable. 2. Blocking : pthread call that de-schedules the waiting thread. 3. Distributed busy-wait : busy-wait on a separate locally accessible variable. 1. Dissemination barrier. 2. Tournament barrier. 3. Tree barrier.

Centralized barrier The shared counter accessible by all threads is incremented atomically. Busy waiting on a single flag causes hot spots and large amounts of memory, and interconnect traffic. Effect is pronounced as application scales.

Blocking barrier algorithm Improves upon centralized shared counter algorithm. Involved rescheduling of threads. Suitable when there is frequent oversubscription of threads as in case of, multiuser environment. task support. Poor scalability because of, Mutex and condition variables. Context switching.

Overheads of blocking barrier (EPCC benchmark) OpenUH 4.x Intel 9.x

Dissemination Barrier Originally implemented in MPI environment. For each round n, thread i, waits for flag setting from i-2k and sets flag of thread i+2k Results Gives good results for < 16 threads because of the simplicity. Not as good on 16+ ( constant interconnect communication in each round) and directory based cache (sun niagara)

Tournament barrier Based on the idea of tournament game. For each round n, loser sets the flag of winner before spinning on root flag. Better performance on higher number of threads because of faster initialization as compared to tree algorithm. The interconnect traffic caused by root flag can cause performance bottleneck in which case the tree implementation would be better.

Tree barrier Binary tree communication between threads. All non-root threads wait on local flags during arrival phase. (leaves to root) The parent sets the flag of the children during wakeup phase (root to leaves). Better performance in some cases because it eliminates the root flag. Involves complexity of building the tree structure.

Methodology Platforms: SGI 4700 Altix: 1024 dual core Intel processor. Sun Fire X4600: 8 dual core AMD Opteron. Fujitsu-Siemens RX600S4: 4 quad core Intel Xeon processors. Sun T5120: 8 processor cores each with 8 hardware threads. Benchmarks The EPCC microbenchmark suite. NAS Parallel Benchmark 3.0, LU-HP benchmark. OpenMP application ASPCG kernel and GENIDLEST. An hand-crafted pthread application. Compilers and Tools OpenUH, OpenMP Collector API tool, GCC

EPCC Results OMP_BARRIER The centralized and blocking barrier do not perform as well as other algorithms. Dissemination barrier has the least overhead on 16 threads or less. Tournament barrier resulted in lowest overhead for more than 16 threads for (k=0; k<=outerreps; k++){ start = getclock(); #pragma omp parallel private(j) { for (j=0; j<innerreps; j++){ delay(delaylength); #pragma omp barrier } } end = getclock(); }

GenIDLEST / ASPCG Generalized Incompressible Direct and Large Eddy Simulations of Turbulence Solves the incompressible Navier Stokes and energy equations Overlapping multi block body fitted structured mesh topology combined with an unstructured inter block topology. ASPCG is a small CG kernel of GenIDLEST OpenMP, MPI and OpenMP/MPI Versions of code. Large Eddy Application Dev.rot1.extended.96_Closeup.wmv The Altix 3700 512-processor system

Application Results GENIDLEST GENIDLEST solves incompressible Navier-stokes and energy equations. Total 35% improvement in the execution time on 32 threads. Blocking barrier performs equally well because of nature of the application, which is memory intensive.

Application Results (cont ) ASPCG ASPCG solves a linear system of equations generated by a Laplace equation in cartesian coordinates. All busy wait algorithms scale, some with better performance than other. Nearly 12x improvement when compared to blocking barrier on 128 threads. Tree barrier incurs nearly 20% more execution time than the tournament and dissemination.

Sun Niagara Results Employed an hand crafter pthread application for testing, due to lack of OpenUH implementation on Solaris. The cache coherence protocol causes poor performance of dissemination.

NAS PB results Explored the behavior of LU-HP NAS Parallel benchmark LU-HP is CFD application that invokes 739426 implicit barriers on 4 threads. (Data obtained using OpenMP collector API tool) Shows considerable performance benefit of tree barrier over others (10% better).

Conclusion Achieved good scalability and performance using the new algorithms. In general, the performance of a given barrier implementation is dependent on The number of threads used. The number of barriers. The application. The OpenMP directive. The architecture (memory layout/interconnect) And possibly system load (on small scale multicore). An OpenMP runtime library should ideally, adapt to different barrier implementations during runtime.

Future Work Further testing of these algorithms on even larger system (in particular, with 2048 cores) As OpenMP 3.0 implementation becomes available in OpenUH we would like to evaluate how these algorithms affect the tasking feature. Implement similar enhancements for the reduction operation. Cost model for adaptive runtime. Integrate these work with the Adaptive OpenMP and Dynamic compiler work.

Questions?

Backup slides

Dynamic Optimizer and OpenMP DO checks if collector is present DO requests collector for initialization. Thread states are kept by the OpenMP RTL DO registers thread events with callbacks to the OpenMP RTL DO requests for parallel region replacement, change of schedulers, implementations. Dynamic Optimizer OpenMP Application with Collector API. Is there a collector API? Yes/No Initialize collector API Success/Ready Register Event(s)/ Callback(s) Event Notification /Callback