Suitability of Performance Tools for OpenMP Task-parallel Programs
|
|
- Kerrie Bridges
- 8 years ago
- Views:
Transcription
1 Suitability of Performance Tools for OpenMP Task-parallel Programs Dirk Schmidl et al. Rechen- und Kommunikationszentrum (RZ)
2 Agenda The OpenMP Tasking Model Task-Programming Patterns Investigated Performance Tools Application Tests Tasking in OpenMP 4.0: New Challenges for Performance Tools 2
3 The OpenMP Tasking Model 3
4 Sudoko for Lazy Computer Scientists Lets solve Sudoku puzzles with brute multi-core force (1) Search an empty field (2) Insert a number (3) Check Sudoku (4 a) If invalid: Delete number, Insert next number (4 b) If valid: Create a task to check the rest of the filed and go on (5) Wait for completion 4
5 The OpenMP Task Construct C/C++ #pragma omp task [clause]... structured block... Fortran!$omp task [clause]... structured block...!$omp end task Each encountering thread/task creates a new task 5 Code and data is being packaged up Tasks can be nested Into another task directive Into a Worksharing construct Data scoping clauses: shared(list) private(list) firstprivate(list) default(shared none)
6 Barrier and Taskwait Constructs OpenMP barrier (implicit or explicit) All tasks created by any thread of the current Team are guaranteed to be completed at barrier exit C/C++ #pragma omp barrier Task barrier: taskwait Encountering task is suspended until child tasks are complete Applies only to direct childs, not descendants! C/C++ #pragma omp taskwait 6
7 Parallel Brute-force Sudoku (1/3) This parallel algorithm finds all valid solutions first call contained in a #pragma omp parallel #pragma omp single such that one tasks starts the execution of the algorithm (1) Search an empty field (2) Insert a number (3) Check Sudoku (4 a) If invalid: Delete number, Insert next number 7 #pragma omp task needs to work on a new copy of the Sudoku board #pragma omp taskwait wait for all child tasks (4 b) If valid: Create a task to check the rest of the filed and go on (5) Wait for completion
8 Parallel Brute-force Sudoku (2/3) OpenMP parallel region creates a team of threads #pragma omp parallel { #pragma omp single solve_parallel(0, 0, sudoku2,false); } // end omp parallel Single construct: One thread enters the execution of solve_parallel the other threads wait at the end of the single and are ready to pick up threads from the work queue 8
9 Parallel Brute-force Sudoku (3/3) The actual implementation for (int i = 1; i <= sudoku->getfieldsize(); i++) { if (!sudoku->check(x, y, i)) { #pragma omp task firstprivate(i,x,y,sudoku) { // create from copy constructor CSudokuBoard new_sudoku(*sudoku); of the Sudoku board new_sudoku.set(y, x, i); if (solve_parallel(x+1, y, &new_sudoku)) { new_sudoku.printboard(); } } // end omp task } } #pragma omp task needs to work on a new copy #pragma omp taskwait 9 #pragma omp taskwait wait for all child tasks
10 Runtime [sec] for 16x16 Speedup Performance Evaluation Sudoku on 2x Intel Xeon GHz Intel C , scatter binding speedup: Intel C , scatter binding #threads
11 Runtime [sec] for 16x16 Speedup Performance Evaluation Sudoku on 2x Intel Xeon GHz Intel C , scatter binding speedup: Intel C , scatter binding Is this the best we can can do? #threads
12 Sudoku How many tasks will be created for this puzzle? Hint: 131 empty cells Between 131 and tasks. You want to guess? 12
13 Task-Programming Patterns 13
14 Task-Programming Patterns 14 Application/Benchmark task creation nested tasks Barcelona OpenMP Task Suite Alignment iterative no FFT recursive yes Fib recursive yes Floorplan recursive yes Health recursive yes NQueens recursive yes Sort recursive yes SparseLU iterative no Strassen recursive yes RWTH Aachen University Sudoku recursive yes SparseCG iterative no FIRE iterative yes NestedCP iterative yes KegelSpan recursive yes
15 Task-Programming Patterns 15 Application/Benchmark task creation nested tasks Barcelona OpenMP Task Suite Alignment iterative no FFT recursive yes Fib recursive yes Floorplan recursive yes Health recursive yes NQueens recursive yes Sort recursive yes Let s look what performance tools can do with these applications. SparseLU iterative no Strassen recursive yes RWTH Aachen University Sudoku recursive yes SparseCG iterative no FIRE iterative yes NestedCP iterative yes KegelSpan recursive yes
16 Investigated Performance Tools 16
17 Investigated Tools Measurement Tool Intel VTune Amplifier XE Oracle Solaris Studio Performance Analyzer (collect) Score-P (Profiling mode) Score-P (Tracing mode) Measurement Method Version Sampling Update 10 (build ) Visualization / Analysis Tool Intel VTune Amplifier XE Sampling 7.9 Oracle Solaris Studio Performance Analyzer (analyze) Event Based Profiling 1.2 Cube GUI Event Based Tracing 1.2 Vampir 17
18 Intel VTune Amplifier XE The Task-region is our main Hotspot. Overhead and Spin time is found for the task region There is a long recursive call stack and the amount of work per level declines. and even for individual source code lines. 18
19 Oracle Solaris Studio Performance Analyzer The Function with the hotspot is also found here. A call stack can be shown per timestamp, but no accumulated time per level is shown. More details about wait and overhead time is presented in an OpenMP task view and can also been shown for source lines. 19
20 Score-P (Profiling)/ Cube GUI Profiling also gives more details Profiled Code with: OpenMP constructs enabled Function instrumentation disabled about individual task-instances: Every thread is executing ~1.3m tasks 20 The Task-region is identified as hotspot. No mapping to source code possible. in ~5.7 seconds. => average duration of a task is ~4.4 μs
21 Score-P (Tracing)/ Vampir Tracing gives a detailed timeline view lvl 6 Duration: 0.16 sec lvl 12 Duration: sec but also more detailed information on the call stack and the task durations can be shown. lvl 48 lvl 82 Duration: sec Duration: 2.2 μs Tasks get much smaller down the call stack. 21
22 Runtime [sec] for 16x16 Speedup Performance Evaluation Solution: stop creating tasks after 2 rows of the puzzle (cut-off) only 133k tasks are created (~ 100X less tasks) Sudoku on 2x Intel Xeon GHz Intel C , scatter binding speedup: Intel C , scatter binding Intel C , scatter binding, cutoff speedup: Intel C , scatter binding, cutoff #threads
23 Overhead 16 Threads Runtime Overhead (setup routine) Oracle Analyzer <1% 28 MB Intel VTune 7% 8.5 MB Overhead of direct instrumentation is enormous in this case Amount of stored data for the Trace is huge Data Volume (complete program) Score-P (Profiling) ~150% 44 KB (OpenMP only) Score-P (Tracing) ~150% 1.2 GB (OpenMP only) The timing information provided by sampling tools might be more accurate due to lower overhead. The details only seen in the Score-P cases still gave additional useful information. 23
24 Application Tests 24
25 Conjugate Gradient Method Sparse Linear Algebra Sparse Linear Equation Systems occur in many scientific disciplines. Sparse matrix-vector multiplications (SpMxV) are the dominant part in many iterative solvers (like the CG) for such systems. number of non-zeros << n*n Beijing Botanical Garden Oben Rechts: Unten Rechts: Unten Links: Orginal Gebäude Modell Matrix (Quelle: Beijing Botanical Garden and University of Florida, Sparse Matrix Collection) 25
26 Conjugate Gradient Method Sparse Matrix-Vector Multiplication: //define ChunkSize #define CS 10 for(i=0; i<n; i+= CS ){ #pragma omp task firstprivate(i) private(k,j) for (k=i; ((k < i+ CS ) && (k < n)) ; k++){ y[k]=0; for(j=ptr[k]; j<ptr[k+1]; j++){ y[k]+=value[j]*x[index[j]]; } } } One task computes CS rows at once. Large ChunkSize low overhead and less load balancing Small ChunkSize higher overhead and good load balancing 26
27 Conjugate Gradient Method ChunkSize=10 Hotspot OpenMP Task Count Task Size Tool Overhead Oracle Analyzer matvec: % (omp task) n/a n/a Intel VTune matvec: % (omp single) n/a n/a Score-P (Profiling) matvec (single:54) ~ 75 % (barrier) ~ 18, μs (avg) Score-P (Tracing) matvec ~ 67.5 % ~ 18, μs ChunkSize=100 Hotspot OpenMP Task Count Task Size Tool Overhead Oracle Analyzer matvec: % n/a n/a Intel VTune matvec:61 0% n/a n/a Score-P (Profiling) matvec (single:54) < 10% (barrier) μs (avg) Score-P (Tracing) matvec ~ 11% μs 27
28 KegelSpan Simulation of Gearwheel Cutting (from Laboratory for Machine Tools and Production Engineering - WZL) written in Fortran 90/95 Polygon grid of the workpiece adaptively refined in the work area an unbalanced BSP Tree used to handle the polygons efficiently OpenMP tasks used to traverse the tree in parallel Tests with a routine to setup a BSP tree 28
29 KegelSpan Runtime Overhead (setup routine) Oracle Analyzer 3.7% 99 MB Intel VTune 2.6% 1.4 MB Data Volume (complete program) Score-P (Profiling) 4.0% 88 KB (OpenMP only) Score-P (Tracing) 4.0% 34 MB (OpenMP only) Overhead and amount of data much more usable than with the Sudoku solver. 29
30 KegelSpan Tool Hotspot OpenMP Overhead Task Count Task Size Oracle Analyzer BSPTrees.f % n/a n/a Intel VTune BSPTrees.f: ~2% + X n/a n/a Score-P (Profiling) BuildBSPTreeOMPTasks n/a 124, ms (avg) Score-P (Tracing) BuildBSPTreeOMPTasks n/a 124, μs - 3 ms 1 Task 2 Tasks Split Plane 4 Tasks 4 Tasks Cutting Depth 30
31 opt1 opt2 Runtime in Sec. OpenMP and Tasks (first Tests) Opt1: - stop creating parallel tasks when the current task only processes <100 points. - slightly faster than best depth level + independent from input dataset Opt2: - serial optimization in sorting of points - not seen as performance critical without measurements Cutting Depth 31
32 Tasking in OpenMP 4.0: New Challenges for Performance Tools 32
33 The depend Clause C/C++ #pragma omp task depend(dependency-type: list)... structured block... The task dependence is fulfilled when the predecessor task has completed in dependency-type: the generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an out or inout clause. out and inout dependency-type: The generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an in, out, or inout clause. The list items in a depend clause may include array sections. 33
34 Concurrent Execution w/ Dep. Note: variables in the depend clause do not necessarily have to indicate the data flow T1 has to be completed before T2 and T3 can be #pragma omp parallel #pragma omp single executed. void process_in_parallel) { } 34 { int x = 1;... for (int i = 0; i < T; ++i) { } #pragma omp task shared(x,...) depend(out: x) // T1 preprocess_some_data(...); #pragma omp task shared(x,...) depend(in: x) do_something_with_data(...); #pragma omp task shared(x,...) depend(in: x) do_something_independent_with_data(...); } // end omp single, omp parallel T2 and T3 can be executed in parallel. // T2 // T3
35 The taskgroup Construct C/C++ #pragma omp taskgroup... structured block... Fortran!$omp taskgroup... structured block...!$omp end task Specifies a wait on completion of child tasks and their descendent tasks deeper sychronization than taskwait, but with the option to restrict to a subset of all tasks (as opposed to a barrier) 35
36 Cancellation of OpenMP Tasks Cancellation only acts on tasks of group by taskgroup The encountering task jumps to the end of its task region Any executing task will run to completion (or until they reach a cancellation point region) Any task that has not yet begun execution may be discarded (and is considered completed) Tasks cancellation also occurs, if a parallel region is canceled. But not if cancellation affects a worksharing construct. 36
37 Conclusion Performance Tools can help to understand the execution of task-parallel programs. Sampling Tools give a good overview and provide helpful information on OpenMP overhead and waiting time. Call-path profiling can give more details on task-count and average runtimes, but it introduces more overhead. Tracing gives very detailed information on individual task instances, but the overhead and the amount of data can be large. Details on task-count and individual execution time is very useful to understand the behavior of recursive tasking programs. There is no one best solution, all tools have advantages and disadvantages, you need to choose the right one for your problem. OpenMP 4.0 provides new opportunities and challenges for application developers and tool developers. 37
38 Conclusion Performance Tools can help to understand the execution of task-parallel programs. Sampling Tools give a good overview and provide helpful information on OpenMP overhead and waiting time. Call-path profiling can give more details on task-count and average runtimes, but it introduces more overhead. Tracing gives very detailed information on individual task instances, but the overhead and the amount of data can be large. Thank you for your attention! Questions? Details on task-count and individual execution time is very useful to understand the behavior of recursive tasking programs. There is no one best solution, all tools have advantages and disadvantages, you need to choose the right one for your problem. OpenMP 4.0 provides new opportunities and challenges for application developers and tool developers. 38
Advanced OpenMP Features
Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Sudoku IT Center
More informationOpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
More informationAssessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum
More informationPerformance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
More informationUsing the Intel Inspector XE
Using the Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Race Condition Data Race: the typical OpenMP programming error, when: two or more threads access the same memory
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationA Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
More informationParallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
More informationINTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationOpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
More informationHigh Performance Computing in Aachen
High Performance Computing in Aachen Christian Iwainsky iwainsky@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Produktivitätstools unter Linux Sep 16, RWTH Aachen University
More informationDAGViz: A DAG Visualization Tool for Analyzing Task Parallel Programs (DAGViz: DAG )
DAGViz: A DAG Visualization Tool for Analyzing Task Parallel Programs (DAGViz: DAG ) 27 2 5 48-136431 Huynh Ngoc An Abstract Task parallel programming model has been considered as a promising mean that
More information<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization
An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended
More informationScalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
More informationPARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP
PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP Sruthi Sankar CSE 633: Parallel Algorithms Spring 2014 Professor: Dr. Russ Miller Sudoku: the puzzle A standard Sudoku puzzles contains 81 grids :9 rows
More informationOpenMP* 4.0 for HPC in a Nutshell
OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group (michael.klemm@intel.com) *Other brands and names are the property of their respective owners.
More informationAn HPC Application Deployment Model on Azure Cloud for SMEs
An HPC Application Deployment Model on Azure Cloud for SMEs Fan Ding CLOSER 2013, Aachen, Germany, May 9th,2013 Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Windows Azure Relevant Technology
More informationOMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging
OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging Alexandre Eichenberger, John Mellor-Crummey, Martin Schulz, Nawal Copty, John DelSignore, Robert Dietrich,
More informationCase Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
More informationOpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
More informationDebugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
More informationUTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
More informationOpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
More informationEvaluation of OpenMP Task Scheduling Strategies
Evaluation of OpenMP Task Scheduling Strategies Alejandro Duran, Julita Corbalán, and Eduard Ayguadé Barcelona Supercomputing Center Departament d Arquitectura de Computadors Universitat Politècnica de
More informationParallel Computing. Shared memory parallel programming with OpenMP
Parallel Computing Shared memory parallel programming with OpenMP Thorsten Grahs, 27.04.2015 Table of contents Introduction Directives Scope of data Synchronization 27.04.2015 Thorsten Grahs Parallel Computing
More informationExperiences with HPC on Windows
Experiences with on Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Server Computing Summit 2008 April 7 11, HPI/Potsdam Experiences with on
More informationOMPT: OpenMP Tools Application Programming Interfaces for Performance Analysis
OMPT: OpenMP Tools Application Programming Interfaces for Performance Analysis Alexandre Eichenberger, John Mellor-Crummey, Martin Schulz, Michael Wong, Nawal Copty, John DelSignore, Robert Dietrich, Xu
More informationINTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
More informationGet an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*
Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling
More informationElemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationKrishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C
Tutorial#1 Q 1:- Explain the terms data, elementary item, entity, primary key, domain, attribute and information? Also give examples in support of your answer? Q 2:- What is a Data Type? Differentiate
More informationHIGH PERFORMANCE CONSULTING COURSE OFFERINGS
Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...
More informationTowards OpenMP Support in LLVM
Towards OpenMP Support in LLVM Alexey Bataev, Andrey Bokhanko, James Cownie Intel 1 Agenda What is the OpenMP * language? Who Can Benefit from the OpenMP language? OpenMP Language Support Early / Late
More informationImproving System Scalability of OpenMP Applications Using Large Page Support
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationBinary Heap Algorithms
CS Data Structures and Algorithms Lecture Slides Wednesday, April 5, 2009 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks CHAPPELLG@member.ams.org 2005 2009 Glenn G. Chappell
More informationDAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces
DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces An Huynh University of Tokyo Tokyo, Japan huynh@eidos.ic.i.utokyo.ac.jp Douglas Thain University of Notre Dame Notre Dame, IN,
More informationOpenMP Application Program Interface
OpenMP Application Program Interface Version.0 - July 01 Copyright 1-01 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP Architecture
More informationIN the last few decades, OpenMP has emerged as the de
404 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 3, MARCH 2009 The Design of OpenMP Tasks Eduard Ayguadé, Nawal Copty, Member, IEEE Computer Society, Alejandro Duran, Jay Hoeflinger,
More informationAdapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
More informationAn Introduction to Parallel Programming with OpenMP
An Introduction to Parallel Programming with OpenMP by Alina Kiessling E U N I V E R S I H T T Y O H F G R E D I N B U A Pedagogical Seminar April 2009 ii Contents 1 Parallel Programming with OpenMP 1
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationGetting OpenMP Up To Speed
1 Getting OpenMP Up To Speed Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline The
More informationOpenMP C and C++ Application Program Interface
OpenMP C and C++ Application Program Interface Version.0 March 00 Copyright 1-00 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP
More informationHybrid Programming with MPI and OpenMP
Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming
More information5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.
1. The advantage of.. is that they solve the problem if sequential storage representation. But disadvantage in that is they are sequential lists. [A] Lists [B] Linked Lists [A] Trees [A] Queues 2. The
More informationCSE 326: Data Structures B-Trees and B+ Trees
Announcements (4//08) CSE 26: Data Structures B-Trees and B+ Trees Brian Curless Spring 2008 Midterm on Friday Special office hour: 4:-5: Thursday in Jaech Gallery (6 th floor of CSE building) This is
More informationAll ju The State of Software Development Today: A Parallel View. June 2012
All ju The State of Software Development Today: A Parallel View June 2012 2 What is Parallel Programming? When students study computer programming, the normal approach is to learn to program sequentially.
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationScalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures
Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken, Melven Zöllner** German Aerospace
More informationAuto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado
More informationParallelization of video compressing with FFmpeg and OpenMP in supercomputing environment
Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 231 237 doi: 10.14794/ICAI.9.2014.1.231 Parallelization of video compressing
More informationGPU Tools Sandra Wienke
Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA
More informationParallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
More informationOpenMP Tools API (OMPT) and HPCToolkit
OpenMP Tools API (OMPT) and HPCToolkit John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu SC13 OpenMP Birds of a Feather Session, November 19, 2013 OpenMP Tools Subcommittee
More informationApplication Performance Analysis Tools and Techniques
Mitglied der Helmholtz-Gemeinschaft Application Performance Analysis Tools and Techniques 2012-06-27 Christian Rössel Jülich Supercomputing Centre c.roessel@fz-juelich.de EU-US HPC Summer School Dublin
More informationPoisson Equation Solver Parallelisation for Particle-in-Cell Model
WDS'14 Proceedings of Contributed Papers Physics, 233 237, 214. ISBN 978-8-7378-276-4 MATFYZPRESS Poisson Equation Solver Parallelisation for Particle-in-Cell Model A. Podolník, 1,2 M. Komm, 1 R. Dejarnac,
More informationProgramming the Intel Xeon Phi Coprocessor
Programming the Intel Xeon Phi Coprocessor Tim Cramer cramer@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Many Integrated Core (MIC) Architecture Programming Models Native
More informationHow To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power
A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems TADaaM Team - Nicolas Denoyelle - Brice Goglin - Emmanuel Jeannot August 24, 2015 1. Context/Motivations
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationA Scalable Approach to MPI Application Performance Analysis
A Scalable Approach to MPI Application Performance Analysis Shirley Moore 1, Felix Wolf 1, Jack Dongarra 1, Sameer Shende 2, Allen Malony 2, and Bernd Mohr 3 1 Innovative Computing Laboratory, University
More informationPerformance Analysis for GPU Accelerated Applications
Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871
More informationMAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
More information#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:
omp critical The code inside a CRITICAL region is executed by only one thread at a time. The order is not specified. This means that if a thread is currently executing inside a CRITICAL region and another
More informationPerformance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
More informationOpenACC Programming on GPUs
OpenACC Programming on GPUs Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More information10CS35: Data Structures Using C
CS35: Data Structures Using C QUESTION BANK REVIEW OF STRUCTURES AND POINTERS, INTRODUCTION TO SPECIAL FEATURES OF C OBJECTIVE: Learn : Usage of structures, unions - a conventional tool for handling a
More informationUsing the Windows Cluster
Using the Windows Cluster Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster
More informationXeon Phi Application Development on Windows OS
Chapter 12 Xeon Phi Application Development on Windows OS So far we have looked at application development on the Linux OS for the Xeon Phi coprocessor. This chapter looks at what types of support are
More informationCOMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State
More informationDATA STRUCTURES USING C
DATA STRUCTURES USING C QUESTION BANK UNIT I 1. Define data. 2. Define Entity. 3. Define information. 4. Define Array. 5. Define data structure. 6. Give any two applications of data structures. 7. Give
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationHigh Performance Computing in Aachen
High Performance Computing in Aachen Samuel Sarholz sarholz@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University HPC unter Linux Sep 15, RWTH Aachen Agenda o Hardware o Development
More informationWhitepaper: performance of SqlBulkCopy
We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis
More informationPES Institute of Technology-BSC QUESTION BANK
PES Institute of Technology-BSC Faculty: Mrs. R.Bharathi CS35: Data Structures Using C QUESTION BANK UNIT I -BASIC CONCEPTS 1. What is an ADT? Briefly explain the categories that classify the functions
More informationYALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
More informationSymbol Tables. Introduction
Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The
More informationConcurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes
Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes Oluwapelumi Adenikinju1, Julian Gilyard2, Joshua Massey1, Thomas Stitt 1 Department of Computer Science and Electrical Engineering, UMBC
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationOpenMP Task Scheduling Strategies for Multicore NUMA Systems
OpenMP Task Scheduling Strategies for Multicore NUMA Systems Stephen L. Olivier University of North Carolina at Chapel Hill Campus Box 3175 Chapel Hill, NC 27599, USA olivier@cs.unc.edu Michael Spiegel
More informationFast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
More informationBinary Search Trees CMPSC 122
Binary Search Trees CMPSC 122 Note: This notes packet has significant overlap with the first set of trees notes I do in CMPSC 360, but goes into much greater depth on turning BSTs into pseudocode than
More informationA Comparison of Oracle Performance on Physical and VMware Servers
A Comparison of Oracle Performance on Physical and VMware Servers By Confio Software Confio Software 4772 Walnut Street, Suite 100 Boulder, CO 80301 303-938-8282 www.confio.com Comparison of Physical and
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationPerformance Evaluation and Optimization of A Custom Native Linux Threads Library
Center for Embedded Computer Systems University of California, Irvine Performance Evaluation and Optimization of A Custom Native Linux Threads Library Guantao Liu and Rainer Dömer Technical Report CECS-12-11
More informationSample Questions Csci 1112 A. Bellaachia
Sample Questions Csci 1112 A. Bellaachia Important Series : o S( N) 1 2 N N i N(1 N) / 2 i 1 o Sum of squares: N 2 N( N 1)(2N 1) N i for large N i 1 6 o Sum of exponents: N k 1 k N i for large N and k
More informationJVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationCS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team
CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team Lecture Summary In this lecture, we learned about the ADT Priority Queue. A
More information11.1 inspectit. 11.1. inspectit
11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.
More informationHigh Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem
High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem Kamil Rocki, PhD Department of Computer Science Graduate School of Information Science and Technology The University of
More informationSolution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
More informationA binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:
Binary Search Trees 1 The general binary tree shown in the previous chapter is not terribly useful in practice. The chief use of binary trees is for providing rapid access to data (indexing, if you will)
More informationUnprecedented Performance and Scalability Demonstrated For Meter Data Management:
Unprecedented Performance and Scalability Demonstrated For Meter Data Management: Ten Million Meters Scalable to One Hundred Million Meters For Five Billion Daily Meter Readings Performance testing results
More information