Suitability of Performance Tools for OpenMP Task-parallel Programs

Similar documents

OpenMP and Performance

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Performance Characteristics of Large SMP Machines

Using the Intel Inspector XE

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

OpenMP Programming on ScaleMP

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Parallel Computing. Parallel shared memory computing with OpenMP

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Multicore Parallel Computing with OpenMP

OpenACC Basics Directive-based GPGPU Programming

DAGViz: A DAG Visualization Tool for Analyzing Task Parallel Programs (DAGViz: DAG )

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Scalability evaluation of barrier algorithms for OpenMP

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP

OpenMP* 4.0 for HPC in a Nutshell

An HPC Application Deployment Model on Azure Cloud for SMEs

OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging

Case Study on Productivity and Performance of GPGPUs

OpenACC 2.0 and the PGI Accelerator Compilers

Debugging with TotalView

UTS: An Unbalanced Tree Search Benchmark

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Evaluation of OpenMP Task Scheduling Strategies

Parallel Computing. Shared memory parallel programming with OpenMP

Experiences with HPC on Windows

OMPT: OpenMP Tools Application Programming Interfaces for Performance Analysis

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Multi-Threading Performance on Commodity Multi-Core Processors

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Towards OpenMP Support in LLVM

Binary Heap Algorithms

DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

An Introduction to Parallel Programming with OpenMP

Rethinking SIMD Vectorization for In-Memory Databases

Spring 2011 Prof. Hyesoon Kim

Getting OpenMP Up To Speed

OpenMP C and C++ Application Program Interface

Hybrid Programming with MPI and OpenMP

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

CSE 326: Data Structures B-Trees and B+ Trees

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

GPU Tools Sandra Wienke

Parallelization: Binary Tree Traversal

OpenMP Tools API (OMPT) and HPCToolkit

Application Performance Analysis Tools and Techniques

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

Programming the Intel Xeon Phi Coprocessor

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Control 2004, University of Bath, UK, September 2004

Performance Analysis for GPU Accelerated Applications

MAQAO Performance Analysis and Optimization Tool

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

Performance Analysis and Optimization Tool

OpenACC Programming on GPUs

Parallel Programming Survey

10CS35: Data Structures Using C

Using the Windows Cluster

Xeon Phi Application Development on Windows OS

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

DATA STRUCTURES USING C

Parallel Algorithm Engineering

High Performance Computing in Aachen

Whitepaper: performance of SqlBulkCopy

PES Institute of Technology-BSC QUESTION BANK

YALES2 porting on the Xeon- Phi Early results

Symbol Tables. Introduction

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes

MPI and Hybrid Programming Models. William Gropp

HPC with Multicore and GPUs

Fast Multipole Method for particle interactions: an open source parallel library component

Binary Search Trees CMPSC 122

A Comparison of Oracle Performance on Physical and VMware Servers

Virtuoso and Database Scalability

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Performance Evaluation and Optimization of A Custom Native Linux Threads Library

Sample Questions Csci 1112 A. Bellaachia

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

High Performance Computing in CST STUDIO SUITE

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

11.1 inspectit inspectit

High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem

Solution of Linear Systems

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

Transcription:

Suitability of Performance Tools for OpenMP Task-parallel Programs Dirk Schmidl et al. schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ)

Agenda The OpenMP Tasking Model Task-Programming Patterns Investigated Performance Tools Application Tests Tasking in OpenMP 4.0: New Challenges for Performance Tools 2

The OpenMP Tasking Model 3

Sudoko for Lazy Computer Scientists Lets solve Sudoku puzzles with brute multi-core force (1) Search an empty field (2) Insert a number (3) Check Sudoku (4 a) If invalid: Delete number, Insert next number (4 b) If valid: Create a task to check the rest of the filed and go on (5) Wait for completion 4

The OpenMP Task Construct C/C++ #pragma omp task [clause]... structured block... Fortran!$omp task [clause]... structured block...!$omp end task Each encountering thread/task creates a new task 5 Code and data is being packaged up Tasks can be nested Into another task directive Into a Worksharing construct Data scoping clauses: shared(list) private(list) firstprivate(list) default(shared none)

Barrier and Taskwait Constructs OpenMP barrier (implicit or explicit) All tasks created by any thread of the current Team are guaranteed to be completed at barrier exit C/C++ #pragma omp barrier Task barrier: taskwait Encountering task is suspended until child tasks are complete Applies only to direct childs, not descendants! C/C++ #pragma omp taskwait 6

Parallel Brute-force Sudoku (1/3) This parallel algorithm finds all valid solutions first call contained in a #pragma omp parallel #pragma omp single such that one tasks starts the execution of the algorithm (1) Search an empty field (2) Insert a number (3) Check Sudoku (4 a) If invalid: Delete number, Insert next number 7 #pragma omp task needs to work on a new copy of the Sudoku board #pragma omp taskwait wait for all child tasks (4 b) If valid: Create a task to check the rest of the filed and go on (5) Wait for completion

Parallel Brute-force Sudoku (2/3) OpenMP parallel region creates a team of threads #pragma omp parallel { #pragma omp single solve_parallel(0, 0, sudoku2,false); } // end omp parallel Single construct: One thread enters the execution of solve_parallel the other threads wait at the end of the single and are ready to pick up threads from the work queue 8

Parallel Brute-force Sudoku (3/3) The actual implementation for (int i = 1; i <= sudoku->getfieldsize(); i++) { if (!sudoku->check(x, y, i)) { #pragma omp task firstprivate(i,x,y,sudoku) { // create from copy constructor CSudokuBoard new_sudoku(*sudoku); of the Sudoku board new_sudoku.set(y, x, i); if (solve_parallel(x+1, y, &new_sudoku)) { new_sudoku.printboard(); } } // end omp task } } #pragma omp task needs to work on a new copy #pragma omp taskwait 9 #pragma omp taskwait wait for all child tasks

Runtime [sec] for 16x16 Speedup Performance Evaluation Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding 8 4.0 7 3.5 6 3.0 5 2.5 4 2.0 3 1.5 2 1.0 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 16 24 32 #threads 0.0 10

Runtime [sec] for 16x16 Speedup Performance Evaluation Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding 8 4.0 7 3.5 6 3.0 5 4 3 Is this the best we can can do? 2.5 2.0 1.5 2 1.0 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 16 24 32 #threads 0.0 11

Sudoku How many tasks will be created for this puzzle? Hint: 131 empty cells Between 131 and 16 131 tasks. You want to guess? 12

Task-Programming Patterns 13

Task-Programming Patterns 14 Application/Benchmark task creation nested tasks Barcelona OpenMP Task Suite Alignment iterative no FFT recursive yes Fib recursive yes Floorplan recursive yes Health recursive yes NQueens recursive yes Sort recursive yes SparseLU iterative no Strassen recursive yes RWTH Aachen University Sudoku recursive yes SparseCG iterative no FIRE iterative yes NestedCP iterative yes KegelSpan recursive yes

Task-Programming Patterns 15 Application/Benchmark task creation nested tasks Barcelona OpenMP Task Suite Alignment iterative no FFT recursive yes Fib recursive yes Floorplan recursive yes Health recursive yes NQueens recursive yes Sort recursive yes Let s look what performance tools can do with these applications. SparseLU iterative no Strassen recursive yes RWTH Aachen University Sudoku recursive yes SparseCG iterative no FIRE iterative yes NestedCP iterative yes KegelSpan recursive yes

Investigated Performance Tools 16

Investigated Tools Measurement Tool Intel VTune Amplifier XE Oracle Solaris Studio Performance Analyzer (collect) Score-P (Profiling mode) Score-P (Tracing mode) Measurement Method Version Sampling Update 10 (build 298370) Visualization / Analysis Tool Intel VTune Amplifier XE Sampling 7.9 Oracle Solaris Studio Performance Analyzer (analyze) Event Based Profiling 1.2 Cube GUI Event Based Tracing 1.2 Vampir 17

Intel VTune Amplifier XE The Task-region is our main Hotspot. Overhead and Spin time is found for the task region There is a long recursive call stack and the amount of work per level declines. and even for individual source code lines. 18

Oracle Solaris Studio Performance Analyzer The Function with the hotspot is also found here. A call stack can be shown per timestamp, but no accumulated time per level is shown. More details about wait and overhead time is presented in an OpenMP task view and can also been shown for source lines. 19

Score-P (Profiling)/ Cube GUI Profiling also gives more details Profiled Code with: OpenMP constructs enabled Function instrumentation disabled about individual task-instances: Every thread is executing ~1.3m tasks 20 The Task-region is identified as hotspot. No mapping to source code possible. in ~5.7 seconds. => average duration of a task is ~4.4 μs

Score-P (Tracing)/ Vampir Tracing gives a detailed timeline view lvl 6 Duration: 0.16 sec lvl 12 Duration: 0.047 sec but also more detailed information on the call stack and the task durations can be shown. lvl 48 lvl 82 Duration: 0.001 sec Duration: 2.2 μs Tasks get much smaller down the call stack. 21

Runtime [sec] for 16x16 Speedup Performance Evaluation Solution: stop creating tasks after 2 rows of the puzzle (cut-off) only 133k tasks are created (~ 100X less tasks) Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding Intel C++ 13.1, scatter binding, cutoff speedup: Intel C++ 13.1, scatter binding, cutoff 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 16 24 32 #threads 18 16 14 12 10 8 6 4 2 0 22

Overhead 16 Threads Runtime Overhead (setup routine) Oracle Analyzer <1% 28 MB Intel VTune 7% 8.5 MB Overhead of direct instrumentation is enormous in this case Amount of stored data for the Trace is huge Data Volume (complete program) Score-P (Profiling) ~150% 44 KB (OpenMP only) Score-P (Tracing) ~150% 1.2 GB (OpenMP only) The timing information provided by sampling tools might be more accurate due to lower overhead. The details only seen in the Score-P cases still gave additional useful information. 23

Application Tests 24

Conjugate Gradient Method Sparse Linear Algebra Sparse Linear Equation Systems occur in many scientific disciplines. Sparse matrix-vector multiplications (SpMxV) are the dominant part in many iterative solvers (like the CG) for such systems. number of non-zeros << n*n Beijing Botanical Garden Oben Rechts: Unten Rechts: Unten Links: Orginal Gebäude Modell Matrix (Quelle: Beijing Botanical Garden and University of Florida, Sparse Matrix Collection) 25

Conjugate Gradient Method Sparse Matrix-Vector Multiplication: //define ChunkSize #define CS 10 for(i=0; i<n; i+= CS ){ #pragma omp task firstprivate(i) private(k,j) for (k=i; ((k < i+ CS ) && (k < n)) ; k++){ y[k]=0; for(j=ptr[k]; j<ptr[k+1]; j++){ y[k]+=value[j]*x[index[j]]; } } } One task computes CS rows at once. Large ChunkSize low overhead and less load balancing Small ChunkSize higher overhead and good load balancing 26

Conjugate Gradient Method ChunkSize=10 Hotspot OpenMP Task Count Task Size Tool Overhead Oracle Analyzer matvec:61-65 95% (omp task) n/a n/a Intel VTune matvec:52-61 85% (omp single) n/a n/a Score-P (Profiling) matvec (single:54) ~ 75 % (barrier) ~ 18,000 5.5 μs (avg) Score-P (Tracing) matvec ~ 67.5 % ~ 18,000 0.7 12 μs ChunkSize=100 Hotspot OpenMP Task Count Task Size Tool Overhead Oracle Analyzer matvec:61-65 0% n/a n/a Intel VTune matvec:61 0% n/a n/a Score-P (Profiling) matvec (single:54) < 10% (barrier) 1837 42 μs (avg) Score-P (Tracing) matvec ~ 11% 1837 1 70 μs 27

KegelSpan Simulation of Gearwheel Cutting (from Laboratory for Machine Tools and Production Engineering - WZL) written in Fortran 90/95 Polygon grid of the workpiece adaptively refined in the work area an unbalanced BSP Tree used to handle the polygons efficiently OpenMP tasks used to traverse the tree in parallel Tests with a routine to setup a BSP tree 28

KegelSpan Runtime Overhead (setup routine) Oracle Analyzer 3.7% 99 MB Intel VTune 2.6% 1.4 MB Data Volume (complete program) Score-P (Profiling) 4.0% 88 KB (OpenMP only) Score-P (Tracing) 4.0% 34 MB (OpenMP only) Overhead and amount of data much more usable than with the Sudoku solver. 29

KegelSpan Tool Hotspot OpenMP Overhead Task Count Task Size Oracle Analyzer BSPTrees.f4717-4744 2.4 % n/a n/a Intel VTune BSPTrees.f:4714-4745 ~2% + X n/a n/a Score-P (Profiling) BuildBSPTreeOMPTasks n/a 124,471 0.2 ms (avg) Score-P (Tracing) BuildBSPTreeOMPTasks n/a 124,471 0.8 μs - 3 ms 1 Task 2 Tasks Split Plane 4 Tasks 4 Tasks Cutting Depth 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 opt1 opt2 Runtime in Sec. OpenMP and Tasks (first Tests) Opt1: - stop creating parallel tasks when the current task only processes <100 points. - slightly faster than best depth level + independent from input dataset Opt2: - serial optimization in sorting of points - not seen as performance critical without measurements 120 100 80 60 40 20 0 Cutting Depth 31

Tasking in OpenMP 4.0: New Challenges for Performance Tools 32

The depend Clause C/C++ #pragma omp task depend(dependency-type: list)... structured block... The task dependence is fulfilled when the predecessor task has completed in dependency-type: the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an out or inout clause. out and inout dependency-type: The generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an in, out, or inout clause. The list items in a depend clause may include array sections. 33

Concurrent Execution w/ Dep. Note: variables in the depend clause do not necessarily have to indicate the data flow T1 has to be completed before T2 and T3 can be #pragma omp parallel #pragma omp single executed. void process_in_parallel) { } 34 { int x = 1;... for (int i = 0; i < T; ++i) { } #pragma omp task shared(x,...) depend(out: x) // T1 preprocess_some_data(...); #pragma omp task shared(x,...) depend(in: x) do_something_with_data(...); #pragma omp task shared(x,...) depend(in: x) do_something_independent_with_data(...); } // end omp single, omp parallel T2 and T3 can be executed in parallel. // T2 // T3

The taskgroup Construct C/C++ #pragma omp taskgroup... structured block... Fortran!$omp taskgroup... structured block...!$omp end task Specifies a wait on completion of child tasks and their descendent tasks deeper sychronization than taskwait, but with the option to restrict to a subset of all tasks (as opposed to a barrier) 35

Cancellation of OpenMP Tasks Cancellation only acts on tasks of group by taskgroup The encountering task jumps to the end of its task region Any executing task will run to completion (or until they reach a cancellation point region) Any task that has not yet begun execution may be discarded (and is considered completed) Tasks cancellation also occurs, if a parallel region is canceled. But not if cancellation affects a worksharing construct. 36

Conclusion Performance Tools can help to understand the execution of task-parallel programs. Sampling Tools give a good overview and provide helpful information on OpenMP overhead and waiting time. Call-path profiling can give more details on task-count and average runtimes, but it introduces more overhead. Tracing gives very detailed information on individual task instances, but the overhead and the amount of data can be large. Details on task-count and individual execution time is very useful to understand the behavior of recursive tasking programs. There is no one best solution, all tools have advantages and disadvantages, you need to choose the right one for your problem. OpenMP 4.0 provides new opportunities and challenges for application developers and tool developers. 37

Conclusion Performance Tools can help to understand the execution of task-parallel programs. Sampling Tools give a good overview and provide helpful information on OpenMP overhead and waiting time. Call-path profiling can give more details on task-count and average runtimes, but it introduces more overhead. Tracing gives very detailed information on individual task instances, but the overhead and the amount of data can be large. Thank you for your attention! Questions? Details on task-count and individual execution time is very useful to understand the behavior of recursive tasking programs. There is no one best solution, all tools have advantages and disadvantages, you need to choose the right one for your problem. OpenMP 4.0 provides new opportunities and challenges for application developers and tool developers. 38