Suitability of Performance Tools for OpenMP Task-parallel Programs Dirk Schmidl et al. schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ)
Agenda The OpenMP Tasking Model Task-Programming Patterns Investigated Performance Tools Application Tests Tasking in OpenMP 4.0: New Challenges for Performance Tools 2
The OpenMP Tasking Model 3
Sudoko for Lazy Computer Scientists Lets solve Sudoku puzzles with brute multi-core force (1) Search an empty field (2) Insert a number (3) Check Sudoku (4 a) If invalid: Delete number, Insert next number (4 b) If valid: Create a task to check the rest of the filed and go on (5) Wait for completion 4
The OpenMP Task Construct C/C++ #pragma omp task [clause]... structured block... Fortran!$omp task [clause]... structured block...!$omp end task Each encountering thread/task creates a new task 5 Code and data is being packaged up Tasks can be nested Into another task directive Into a Worksharing construct Data scoping clauses: shared(list) private(list) firstprivate(list) default(shared none)
Barrier and Taskwait Constructs OpenMP barrier (implicit or explicit) All tasks created by any thread of the current Team are guaranteed to be completed at barrier exit C/C++ #pragma omp barrier Task barrier: taskwait Encountering task is suspended until child tasks are complete Applies only to direct childs, not descendants! C/C++ #pragma omp taskwait 6
Parallel Brute-force Sudoku (1/3) This parallel algorithm finds all valid solutions first call contained in a #pragma omp parallel #pragma omp single such that one tasks starts the execution of the algorithm (1) Search an empty field (2) Insert a number (3) Check Sudoku (4 a) If invalid: Delete number, Insert next number 7 #pragma omp task needs to work on a new copy of the Sudoku board #pragma omp taskwait wait for all child tasks (4 b) If valid: Create a task to check the rest of the filed and go on (5) Wait for completion
Parallel Brute-force Sudoku (2/3) OpenMP parallel region creates a team of threads #pragma omp parallel { #pragma omp single solve_parallel(0, 0, sudoku2,false); } // end omp parallel Single construct: One thread enters the execution of solve_parallel the other threads wait at the end of the single and are ready to pick up threads from the work queue 8
Parallel Brute-force Sudoku (3/3) The actual implementation for (int i = 1; i <= sudoku->getfieldsize(); i++) { if (!sudoku->check(x, y, i)) { #pragma omp task firstprivate(i,x,y,sudoku) { // create from copy constructor CSudokuBoard new_sudoku(*sudoku); of the Sudoku board new_sudoku.set(y, x, i); if (solve_parallel(x+1, y, &new_sudoku)) { new_sudoku.printboard(); } } // end omp task } } #pragma omp task needs to work on a new copy #pragma omp taskwait 9 #pragma omp taskwait wait for all child tasks
Runtime [sec] for 16x16 Speedup Performance Evaluation Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding 8 4.0 7 3.5 6 3.0 5 2.5 4 2.0 3 1.5 2 1.0 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 16 24 32 #threads 0.0 10
Runtime [sec] for 16x16 Speedup Performance Evaluation Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding 8 4.0 7 3.5 6 3.0 5 4 3 Is this the best we can can do? 2.5 2.0 1.5 2 1.0 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 16 24 32 #threads 0.0 11
Sudoku How many tasks will be created for this puzzle? Hint: 131 empty cells Between 131 and 16 131 tasks. You want to guess? 12
Task-Programming Patterns 13
Task-Programming Patterns 14 Application/Benchmark task creation nested tasks Barcelona OpenMP Task Suite Alignment iterative no FFT recursive yes Fib recursive yes Floorplan recursive yes Health recursive yes NQueens recursive yes Sort recursive yes SparseLU iterative no Strassen recursive yes RWTH Aachen University Sudoku recursive yes SparseCG iterative no FIRE iterative yes NestedCP iterative yes KegelSpan recursive yes
Task-Programming Patterns 15 Application/Benchmark task creation nested tasks Barcelona OpenMP Task Suite Alignment iterative no FFT recursive yes Fib recursive yes Floorplan recursive yes Health recursive yes NQueens recursive yes Sort recursive yes Let s look what performance tools can do with these applications. SparseLU iterative no Strassen recursive yes RWTH Aachen University Sudoku recursive yes SparseCG iterative no FIRE iterative yes NestedCP iterative yes KegelSpan recursive yes
Investigated Performance Tools 16
Investigated Tools Measurement Tool Intel VTune Amplifier XE Oracle Solaris Studio Performance Analyzer (collect) Score-P (Profiling mode) Score-P (Tracing mode) Measurement Method Version Sampling Update 10 (build 298370) Visualization / Analysis Tool Intel VTune Amplifier XE Sampling 7.9 Oracle Solaris Studio Performance Analyzer (analyze) Event Based Profiling 1.2 Cube GUI Event Based Tracing 1.2 Vampir 17
Intel VTune Amplifier XE The Task-region is our main Hotspot. Overhead and Spin time is found for the task region There is a long recursive call stack and the amount of work per level declines. and even for individual source code lines. 18
Oracle Solaris Studio Performance Analyzer The Function with the hotspot is also found here. A call stack can be shown per timestamp, but no accumulated time per level is shown. More details about wait and overhead time is presented in an OpenMP task view and can also been shown for source lines. 19
Score-P (Profiling)/ Cube GUI Profiling also gives more details Profiled Code with: OpenMP constructs enabled Function instrumentation disabled about individual task-instances: Every thread is executing ~1.3m tasks 20 The Task-region is identified as hotspot. No mapping to source code possible. in ~5.7 seconds. => average duration of a task is ~4.4 μs
Score-P (Tracing)/ Vampir Tracing gives a detailed timeline view lvl 6 Duration: 0.16 sec lvl 12 Duration: 0.047 sec but also more detailed information on the call stack and the task durations can be shown. lvl 48 lvl 82 Duration: 0.001 sec Duration: 2.2 μs Tasks get much smaller down the call stack. 21
Runtime [sec] for 16x16 Speedup Performance Evaluation Solution: stop creating tasks after 2 rows of the puzzle (cut-off) only 133k tasks are created (~ 100X less tasks) Sudoku on 2x Intel Xeon E5-2650 @2.0 GHz Intel C++ 13.1, scatter binding speedup: Intel C++ 13.1, scatter binding Intel C++ 13.1, scatter binding, cutoff speedup: Intel C++ 13.1, scatter binding, cutoff 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 16 24 32 #threads 18 16 14 12 10 8 6 4 2 0 22
Overhead 16 Threads Runtime Overhead (setup routine) Oracle Analyzer <1% 28 MB Intel VTune 7% 8.5 MB Overhead of direct instrumentation is enormous in this case Amount of stored data for the Trace is huge Data Volume (complete program) Score-P (Profiling) ~150% 44 KB (OpenMP only) Score-P (Tracing) ~150% 1.2 GB (OpenMP only) The timing information provided by sampling tools might be more accurate due to lower overhead. The details only seen in the Score-P cases still gave additional useful information. 23
Application Tests 24
Conjugate Gradient Method Sparse Linear Algebra Sparse Linear Equation Systems occur in many scientific disciplines. Sparse matrix-vector multiplications (SpMxV) are the dominant part in many iterative solvers (like the CG) for such systems. number of non-zeros << n*n Beijing Botanical Garden Oben Rechts: Unten Rechts: Unten Links: Orginal Gebäude Modell Matrix (Quelle: Beijing Botanical Garden and University of Florida, Sparse Matrix Collection) 25
Conjugate Gradient Method Sparse Matrix-Vector Multiplication: //define ChunkSize #define CS 10 for(i=0; i<n; i+= CS ){ #pragma omp task firstprivate(i) private(k,j) for (k=i; ((k < i+ CS ) && (k < n)) ; k++){ y[k]=0; for(j=ptr[k]; j<ptr[k+1]; j++){ y[k]+=value[j]*x[index[j]]; } } } One task computes CS rows at once. Large ChunkSize low overhead and less load balancing Small ChunkSize higher overhead and good load balancing 26
Conjugate Gradient Method ChunkSize=10 Hotspot OpenMP Task Count Task Size Tool Overhead Oracle Analyzer matvec:61-65 95% (omp task) n/a n/a Intel VTune matvec:52-61 85% (omp single) n/a n/a Score-P (Profiling) matvec (single:54) ~ 75 % (barrier) ~ 18,000 5.5 μs (avg) Score-P (Tracing) matvec ~ 67.5 % ~ 18,000 0.7 12 μs ChunkSize=100 Hotspot OpenMP Task Count Task Size Tool Overhead Oracle Analyzer matvec:61-65 0% n/a n/a Intel VTune matvec:61 0% n/a n/a Score-P (Profiling) matvec (single:54) < 10% (barrier) 1837 42 μs (avg) Score-P (Tracing) matvec ~ 11% 1837 1 70 μs 27
KegelSpan Simulation of Gearwheel Cutting (from Laboratory for Machine Tools and Production Engineering - WZL) written in Fortran 90/95 Polygon grid of the workpiece adaptively refined in the work area an unbalanced BSP Tree used to handle the polygons efficiently OpenMP tasks used to traverse the tree in parallel Tests with a routine to setup a BSP tree 28
KegelSpan Runtime Overhead (setup routine) Oracle Analyzer 3.7% 99 MB Intel VTune 2.6% 1.4 MB Data Volume (complete program) Score-P (Profiling) 4.0% 88 KB (OpenMP only) Score-P (Tracing) 4.0% 34 MB (OpenMP only) Overhead and amount of data much more usable than with the Sudoku solver. 29
KegelSpan Tool Hotspot OpenMP Overhead Task Count Task Size Oracle Analyzer BSPTrees.f4717-4744 2.4 % n/a n/a Intel VTune BSPTrees.f:4714-4745 ~2% + X n/a n/a Score-P (Profiling) BuildBSPTreeOMPTasks n/a 124,471 0.2 ms (avg) Score-P (Tracing) BuildBSPTreeOMPTasks n/a 124,471 0.8 μs - 3 ms 1 Task 2 Tasks Split Plane 4 Tasks 4 Tasks Cutting Depth 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 opt1 opt2 Runtime in Sec. OpenMP and Tasks (first Tests) Opt1: - stop creating parallel tasks when the current task only processes <100 points. - slightly faster than best depth level + independent from input dataset Opt2: - serial optimization in sorting of points - not seen as performance critical without measurements 120 100 80 60 40 20 0 Cutting Depth 31
Tasking in OpenMP 4.0: New Challenges for Performance Tools 32
The depend Clause C/C++ #pragma omp task depend(dependency-type: list)... structured block... The task dependence is fulfilled when the predecessor task has completed in dependency-type: the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an out or inout clause. out and inout dependency-type: The generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an in, out, or inout clause. The list items in a depend clause may include array sections. 33
Concurrent Execution w/ Dep. Note: variables in the depend clause do not necessarily have to indicate the data flow T1 has to be completed before T2 and T3 can be #pragma omp parallel #pragma omp single executed. void process_in_parallel) { } 34 { int x = 1;... for (int i = 0; i < T; ++i) { } #pragma omp task shared(x,...) depend(out: x) // T1 preprocess_some_data(...); #pragma omp task shared(x,...) depend(in: x) do_something_with_data(...); #pragma omp task shared(x,...) depend(in: x) do_something_independent_with_data(...); } // end omp single, omp parallel T2 and T3 can be executed in parallel. // T2 // T3
The taskgroup Construct C/C++ #pragma omp taskgroup... structured block... Fortran!$omp taskgroup... structured block...!$omp end task Specifies a wait on completion of child tasks and their descendent tasks deeper sychronization than taskwait, but with the option to restrict to a subset of all tasks (as opposed to a barrier) 35
Cancellation of OpenMP Tasks Cancellation only acts on tasks of group by taskgroup The encountering task jumps to the end of its task region Any executing task will run to completion (or until they reach a cancellation point region) Any task that has not yet begun execution may be discarded (and is considered completed) Tasks cancellation also occurs, if a parallel region is canceled. But not if cancellation affects a worksharing construct. 36
Conclusion Performance Tools can help to understand the execution of task-parallel programs. Sampling Tools give a good overview and provide helpful information on OpenMP overhead and waiting time. Call-path profiling can give more details on task-count and average runtimes, but it introduces more overhead. Tracing gives very detailed information on individual task instances, but the overhead and the amount of data can be large. Details on task-count and individual execution time is very useful to understand the behavior of recursive tasking programs. There is no one best solution, all tools have advantages and disadvantages, you need to choose the right one for your problem. OpenMP 4.0 provides new opportunities and challenges for application developers and tool developers. 37
Conclusion Performance Tools can help to understand the execution of task-parallel programs. Sampling Tools give a good overview and provide helpful information on OpenMP overhead and waiting time. Call-path profiling can give more details on task-count and average runtimes, but it introduces more overhead. Tracing gives very detailed information on individual task instances, but the overhead and the amount of data can be large. Thank you for your attention! Questions? Details on task-count and individual execution time is very useful to understand the behavior of recursive tasking programs. There is no one best solution, all tools have advantages and disadvantages, you need to choose the right one for your problem. OpenMP 4.0 provides new opportunities and challenges for application developers and tool developers. 38