OpenMP* 4.0 for HPC in a Nutshell
|
|
|
- Jewel Cunningham
- 10 years ago
- Views:
Transcription
1 OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group ([email protected]) *Other brands and names are the property of their respective owners.
2 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2015 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
3 OpenMP API De-facto standard, OpenMP 4.0 out since July 2013 API for C/C++ and Fortran for shared-memory parallel programming Based on directives (pragmas in C/C++) Portable across vendors and platforms Supports various types of parallelism
4 Evolution of Hardware (at Intel) (die sizes not to scale, for illustration only) Intel Xeon processor 64-bit Intel Xeon processor 5100 series Intel Xeon processor 5500 series Intel Xeon processor 5600 series Intel Xeon E processor code-named Sandy Bridge EP Intel Xeon E v2 processor code-named Ivy Bridge EP Intel Xeon E v3 processor code-named Haswell EP Core(s) up to Threads up to SIMD Width (bits) Intel Xeon Phi coprocessor code-named Knights Corner Intel Xeon Phi processor & coprocessor code-named Knights Landing 1 61 > > Product specification for launched and shipped products available on ark.intel.com. 1 Not launched or in planning.
5 Levels of Parallelism in OpenMP 4.0 Cluster Coprocessors/Accelerators Node Socket Core Hyper-Threads Group of computers communicating through fast interconnect Special compute devices attached to the local node through special interconnect Group of processors communicating through shared memory Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units OpenMP 4.0 for Devices OpenMP 4.0 Affinity Superscalar Pipeline Vector Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units OpenMP 4.0 SIMD
6 OpenMP Intro in Three Slides (1) #pragma omp parallel { #pragma omp for for (i = 0; i<n; i++) { } fork distribute work barrier #pragma omp for for (i = 0; i< N; i++) { } distribute work barrier } join
7 barrier OpenMP Intro in Three Slides (2) double a[n]; double l,s = 0; #pragma omp parallel for reduction(+:s) private(l) \ schedule(static,4) s=0 for (i = 0; i<n; i++) { l = log(a[i]); s += l; } s =0 s =0 s =0 s =0 distribute work s += s s += s s += s s = s 7
8 OpenMP Intro in Three Slides (3) #pragma omp parallel #pragma omp single for(e = l->first; e ; e = e->next) #pragma omp task process(e); fork join
9 OpenMP 4.0 SIMD
10 In a Time before OpenMP 4.0 Programmers had to rely on auto-vectorization or to use vendor-specific extensions Programming models (e.g., Intel Cilk Plus) Compiler pragmas (e.g., #pragma vector) Low-level constructs (e.g., _mm_add_pd()) #pragma omp parallel for #pragma vector always #pragma ivdep for (int i = 0; i < N; i++) { a[i] = b[i] +...; } You need to trust the compiler to do the right thing.
11 OpenMP SIMD Loop Construct Vectorize a loop nest Cut loop into chunks that fit a SIMD vector register No parallelization of the loop body Syntax (C/C++) #pragma omp simd [clause[[,] clause], ] for-loops Syntax (Fortran)!$omp simd [clause[[,] clause], ] do-loops
12 Example void sprod(float *a, float *b, int n) { Example float sum = 0.0f; #pragma omp simd reduction(+:sum) for (int k=0; k<n; k++) sum += a[k] * b[k]; return sum; } vectorize
13 Data Sharing Clauses private(var-list): Uninitialized vectors for variables in var-list x: 42???? firstprivate(var-list): Initialized vectors for variables in var-list x: reduction(op:var-list): Create private variables for var-list and apply reduction operator op at the end of the construct x: 42
14 SIMD Loop Clauses safelen (length) Maximum number of iterations that can run concurrently without breaking a dependence in practice, maximum vector length linear (list[:linear-step]) The variable s value is in relationship with the iteration number x i = x orig + i * linear-step aligned (list[:alignment]) Specifies that the list items have a given alignment Default is alignment for the architecture collapse (n)
15 Loop-Carried Dependencies Dependencies may occur across loop iterations Loop-carried dependency The following code contains such a dependency: void lcd_ex(float* a, float* b, size_t n, float c1, float c2) { size_t i; for (i = 0; i < n; i++) { a[i] = c1 * a[i + 17] + c2 * b[i]; } } Some iterations of the loop have to complete before the next iteration can run Simple trick: can you reverse the loop w/o getting wrong results?
16 Loop-Carried Dependencies Can we parallelize or vectorize the loop? Parallelization: no (except for very specific loop schedules) Vectorization: yes (if vector length is shorter than any distance of any dependency)
17 SIMD Worksharing Construct Parallelize and vectorize a loop nest Distribute a loop s iteration space across a thread team Subdivide loop chunks to fit a SIMD vector register Syntax (C/C++) #pragma omp for simd [clause[[,] clause], ] for-loops Syntax (Fortran)!$omp do simd [clause[[,] clause], ] do-loops
18 Example void sprod(float *a, float *b, int n) { Example float sum = 0.0f; #pragma omp for simd reduction(+:sum) for (int k=0; k<n; k++) sum += a[k] * b[k]; return sum; } parallelize Thread 0 Thread 1 Thread 2 vectorize
19 SIMD Function Vectorization float min(float a, float b) { SIMD return Function a < b? Vectorization a : b; } float distsq(float x, float y) { return (x - y) * (x - y); } void example() { #pragma omp parallel for simd for (i=0; i<n; i++) { d[i] = min(distsq(a[i], b[i]), c[i]); } }
20 SIMD Function Vectorization Declare one or more functions to be compiled for calls from a SIMD-parallel loop Syntax (C/C++): #pragma omp declare simd [clause[[,] clause], ] [#pragma omp declare simd [clause[[,] clause], ]] [ ] function-definition-or-declaration Syntax (Fortran):!$omp declare simd (proc-name-list)
21 SIMD Function Vectorization #pragma omp declare simd float min(float a, float b) { return a < b? a : b; } vec8 min_v(vec8 a, vec8 b) { return a < b? a : b; } #pragma omp declare simd float distsq(float x, float y) { return (x - y) * (x - y); } vec8 distsq_v(vec8 x, vec8 y) return (x - y) * (x - y); } void example() { #pragma omp parallel for simd for (i=0; i<n; i++) { d[i] = min(distsq(a[i], b[i]), c[i]); } } vd = min_v(distsq_v(va, vb, vc))
22 SIMD Function Vectorization simdlen (length) generate function to support a given vector length uniform (argument-list) argument has a constant value between the iterations of a given loop inbranch function always called from inside an if statement notinbranch function never called from inside an if statement linear (argument-list[:linear-step]) aligned (argument-list[:alignment]) reduction (operator:list) Same as before
23 relative speed-up (higher is better) SIMD Constructs & Performance 5,00x 4,50x 4,00x 3,66x 4,34x ICC auto-vec ICC SIMD directive 3,50x 3,00x 2,50x 2,00x 2,04x 2,13x 2,40x 1,50x 1,47x 1,00x 0,50x 0,00x Mandelbrot Volume Rendering BlackScholes Fast Walsh Perlin Noise SGpp M. Klemm, A. Duran, X. Tian, H. Saito, D. Caballero, and X. Martorell. Extending OpenMP with Vector Constructs for Modern Multicore SIMD Architectures. In Proc. of the Intl. Workshop on OpenMP, pages 59-72, Rome, Italy, June LNCS 7312.
24 OpenMP 4.0 for Devices
25 Device Model OpenMP 4.0 supports accelerators/coprocessors Device model: One host Multiple accelerators/coprocessors of the same kind Coprocessors Host
26 OpenMP 4.0 for Devices - Constructs Transfer control [and data] from the host to the device Syntax (C/C++) #pragma omp target [data] [clause[[,] clause], ] structured-block Syntax (Fortran)!$omp target [data] [clause[[,] clause], ] structured-block!$omp end target [data] Clauses device(scalar-integer-expression) map(alloc to from tofrom: list) if(scalar-expr)
27 Execution Model The target construct transfers the control flow to the target device Transfer of control is sequential and synchronous The transfer clauses control direction of data flow Array notation is used to describe array length The target data construct creates a scoped device data environment Does not include a transfer of control The transfer clauses control direction of data flow The device data environment is valid through the lifetime of the target data region Use target update to request data transfers from within a target data region
28 Execution Model Data environment is lexically scoped Data environment is destroyed at closing curly brace Allocated buffers/data are automatically released pa Host 2 to( ) Device 1 alloc( ) 4 from( ) #pragma omp target \ map(alloc:...) \ map(to:...) \ map(from:...) {... } 3
29 Example #pragma omp target data device(0) map(alloc:tmp[:n]) map(to:input[:n)) map(from:res) { #pragma omp target device(0) #pragma omp parallel for for (i=0; i<n; i++) tmp[i] = some_computation(input[i], i); update_input_array_on_the_host(input); #pragma omp target update device(0) to(input[:n]) #pragma omp target device(0) #pragma omp parallel for reduction(+:res) for (i=0; i<n; i++) res += final_computation(input[i], tmp[i], i) } host target host target host
30 teams Construct Support multi-level parallel devices Syntax (C/C++): #pragma omp teams [clause[[,] clause], ] structured-block Syntax (Fortran):!$omp teams [clause[[,] clause], ] structured-block Clauses num_teams(integer-expression) num_threads(integer-expression) default(shared none) private(list), firstprivate(list) shared(list), reduction(operator : list)
31 Offloading SAXPY to a Coprocessor int main(int argc, const char* argv[]) { float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); SAXPY // Define scalars n, a, b & initialize x, y #pragma omp target data map(to:x[0:n]) { #pragma omp target map(tofrom:y) #pragma omp teams num_teams(num_blocks) num_threads(nthreads) all do the same } for (int i = 0; i < n; i += num_blocks){ for (int j = i; j < i + num_blocks; j++) { y[j] = a*x[j] + y[j]; } } } free(x); free(y); return 0;
32 Offloading SAXPY to a Coprocessor int main(int argc, const char* argv[]) { float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars n, a, b & initialize x, y SAXPY: Coprocessor/Accelerator #pragma omp target data map(to:x[0:n]) { #pragma omp target map(tofrom:y) #pragma omp teams num_teams(num_blocks) num_threads(bsize) all do the same #pragma omp distribute for (int i = 0; i < n; i += num_blocks){ workshare (w/o barrier) #pragma omp parallel for for (int j = i; j < i + num_blocks; j++) { workshare (w/ barrier) y[j] = a*x[j] + y[j]; } } } free(x); free(y); return 0; }
33 Offloading SAXPY to a Coprocessor int main(int argc, const char* argv[]) { SAXPY: Combined Constructs float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars n, a, b & initialize x, y #pragma omp target map(to:x[0:n]) map(tofrom:y) { #pragma omp teams distribute parallel for \ num_teams(num_blocks) num_threads(bsize) for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; } } } free(x); free(y); return 0;
34 OpenMP 4.0 Affinity
35 NUMA is here to Stay (Almost) all multi-socket compute servers are NUMA systems Different access latencies for different memory locations Different bandwidth observed for different memory locations Example: Intel Xeon E5-2600v2 Series processor Xeon E5-2600v2 Xeon E5-2600v2
36 GB/sec [higher is better] Thread Affinity Why It Matters? 100,00 90,00 80,00 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 STREAM Triad, Intel Xeon E5-2697v2 # of threads/cores compact, par scatter, par compact, seq scatter, seq
37 Thread Affinity Processor Binding Binding strategies depends on machine and the app Putting threads far, i.e. on different packages (May) improve the aggregated memory bandwidth (May) improve the combined cache size (May) decrease performance of synchronization constructs Putting threads close together, i.e. on two adjacent cores which possible share the cache (May) improve performance of synchronization constructs (May) decrease the available memory bandwidth and cache size (per thread)
38 Thread Affinity in OpenMP* 4.0 OpenMP 4.0 introduces the concept of places set of threads running on one or more processors can be defined by the user pre-defined places available: threads cores sockets and affinity policies spread close master one place per hyper-thread one place exists per physical core one place per processor package spread OpenMP threads evenly among the places pack OpenMP threads near master thread collocate OpenMP thread with master thread and means to control these settings Environment variables OMP_PLACES and OMP_PROC_BIND clause proc_bind for parallel regions
39 Thread Affinity Example Example (Intel Xeon Phi Coprocessor): Distribute outer region, keep inner regions close OMP_PLACES=cores(8); OMP_NUM_THREADS=4,4 #pragma omp parallel proc_bind(spread) #pragma omp parallel proc_bind(close) p0 p1 p2 p3 p4 p5 p6 p7 p0 p1 p2 p3 p4 p5 p6 p7 p0 p1 p2 p3 p4 p5 p6 p7
40 We re Almost Through This presentation focused on HPC stuff OpenMP 4.0 has more to offer! Improved Fortran 2003 support User-defined reductions Task dependencies Cancellation We can chat about these features during the Q&A (if you want to)
41 The last Slide OpenMP 4.0 is a major leap for OpenMP New kind of parallelism has been introduced Support for heterogeneous systems with coprocessor devices Support in Intel Composer XE 2013 SP1 SIMD Constructs (except combined constructs) OpenMP for devices (except combined constructs) OpenMP Affinity
42
Towards OpenMP Support in LLVM
Towards OpenMP Support in LLVM Alexey Bataev, Andrey Bokhanko, James Cownie Intel 1 Agenda What is the OpenMP * language? Who Can Benefit from the OpenMP language? OpenMP Language Support Early / Late
OpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. [email protected] Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.
Vendor Update Intel 49 th IDC HPC User Forum Mike Lafferty HPC Marketing Intel Americas Corp. Legal Information Today s presentations contain forward-looking statements. All statements made that are not
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
Intel Many Integrated Core Architecture: An Overview and Programming Models
Intel Many Integrated Core Architecture: An Overview and Programming Models Jim Jeffers SW Product Application Engineer Technical Computing Group Agenda An Overview of Intel Many Integrated Core Architecture
INTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
High Performance Computing and Big Data: The coming wave.
High Performance Computing and Big Data: The coming wave. 1 In science and engineering, in order to compete, you must compute Today, the toughest challenges, and greatest opportunities, require computation
Intel Platform and Big Data: Making big data work for you.
Intel Platform and Big Data: Making big data work for you. 1 From data comes insight New technologies are enabling enterprises to transform opportunity into reality by turning big data into actionable
Scaling up to Production
1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State
Intel Media SDK Library Distribution and Dispatching Process
Intel Media SDK Library Distribution and Dispatching Process Overview Dispatching Procedure Software Libraries Platform-Specific Libraries Legal Information Overview This document describes the Intel Media
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
The ROI from Optimizing Software Performance with Intel Parallel Studio XE
The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire
OpenACC Programming on GPUs
OpenACC Programming on GPUs Directive-based GPGPU Programming Sandra Wienke, M.Sc. [email protected] Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum
Parallel Computing. Shared memory parallel programming with OpenMP
Parallel Computing Shared memory parallel programming with OpenMP Thorsten Grahs, 27.04.2015 Table of contents Introduction Directives Scope of data Synchronization 27.04.2015 Thorsten Grahs Parallel Computing
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
The Foundation for Better Business Intelligence
Product Brief Intel Xeon Processor E7-8800/4800/2800 v2 Product Families Data Center The Foundation for Big data is changing the way organizations make business decisions. To transform petabytes of data
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
Large-Data Software Defined Visualization on CPUs
Large-Data Software Defined Visualization on CPUs Greg P. Johnson, Bruce Cherniak 2015 Rice Oil & Gas HPC Workshop Trend: Increasing Data Size Measuring / modeling increasingly complex phenomena Rendering
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library
Pexip Speeds Videoconferencing with Intel Parallel Studio XE
1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip
Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation
Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation Agenda 1D interpolation problem statement Computation flow Application areas Data fitting in Intel MKL Data
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
Debugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors.
Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors. Executive Summary: In today s data centers, live migration is a required
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Big Data Visualization on the MIC
Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth [email protected] Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth
HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK
HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK Barry Davis, General Manager, High Performance Fabrics Operation Data Center Group, Intel Corporation Legal Disclaimer Today s presentations contain
High Performance Computing
High Performance Computing Oliver Rheinbach [email protected] http://www.mathe.tu-freiberg.de/nmo/ Vorlesung Introduction to High Performance Computing Hörergruppen Woche Tag Zeit Raum
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?
Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Software Solutions Group Intel Corporation 2012 *Other brands and names are the
New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC
New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC Alan Gara Intel Fellow Exascale Chief Architect Legal Disclaimer Today s presentations contain forward-looking
Improve Fortran Code Quality with Static Analysis
Improve Fortran Code Quality with Static Analysis This document is an introductory tutorial describing how to use static analysis on Fortran code to improve software quality, either by eliminating bugs
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
Intel Data Direct I/O Technology (Intel DDIO): A Primer >
Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Haswell Cryptographic Performance
White Paper Sean Gulley Vinodh Gopal IA Architects Intel Corporation Haswell Cryptographic Performance July 2013 329282-001 Executive Summary The new Haswell microarchitecture featured in the 4 th generation
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Finding Performance and Power Issues on Android Systems. By Eric W Moore
Finding Performance and Power Issues on Android Systems By Eric W Moore Agenda Performance & Power Tuning on Android & Features Needed/Wanted in a tool Some Performance Tools Getting a Device that Supports
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
Extended Attributes and Transparent Encryption in Apache Hadoop
Extended Attributes and Transparent Encryption in Apache Hadoop Uma Maheswara Rao G Yi Liu ( 刘 轶 ) Who we are? Uma Maheswara Rao G - [email protected] - Software Engineer at Intel - PMC/committer, Apache
Intel Service Assurance Administrator. Product Overview
Intel Service Assurance Administrator Product Overview Running Enterprise Workloads in the Cloud Enterprise IT wants to Start a private cloud initiative to service internal enterprise customers Find an
Intel Perceptual Computing SDK My First C++ Application
Intel Perceptual Computing SDK My First C++ Application LEGAL DISCLAIMER THIS DOCUMENT CONTAINS INFORMATION ON PRODUCTS IN THE DESIGN PHASE OF DEVELOPMENT. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
CSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services
Reference Architecture Developing Storage Solutions with Intel Cloud Edition for Lustre* and Amazon Web Services Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
Hybrid Programming with MPI and OpenMP
Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming
Programming the Intel Xeon Phi Coprocessor
Programming the Intel Xeon Phi Coprocessor Tim Cramer [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Many Integrated Core (MIC) Architecture Programming Models Native
The Transition to PCI Express* for Client SSDs
The Transition to PCI Express* for Client SSDs Amber Huffman Senior Principal Engineer Intel Santa Clara, CA 1 *Other names and brands may be claimed as the property of others. Legal Notices and Disclaimers
OpenMP C and C++ Application Program Interface
OpenMP C and C++ Application Program Interface Version.0 March 00 Copyright 1-00 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP
A Superior Hardware Platform for Server Virtualization
A Superior Hardware Platform for Server Virtualization Improving Data Center Flexibility, Performance and TCO with Technology Brief Server Virtualization Server virtualization is helping IT organizations
* * * Intel RealSense SDK Architecture
Multiple Implementations Intel RealSense SDK Architecture Introduction The Intel RealSense SDK is architecturally different from its predecessor, the Intel Perceptual Computing SDK. If you re a developer
Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization
OpenMP Objectives Overview of OpenMP Structured blocks Variable scope, work-sharing Scheduling, synchronization 1 Overview of OpenMP OpenMP is a collection of compiler directives and library functions
Accelerating Business Intelligence with Large-Scale System Memory
Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness
Cloud-based Analytics and Map Reduce
1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,
Accomplish Optimal I/O Performance on SAS 9.3 with
Accomplish Optimal I/O Performance on SAS 9.3 with Intel Cache Acceleration Software and Intel DC S3700 Solid State Drive ABSTRACT Ying-ping (Marie) Zhang, Jeff Curry, Frank Roxas, Benjamin Donie Intel
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
-------- Overview --------
------------------------------------------------------------------- Intel(R) Trace Analyzer and Collector 9.1 Update 1 for Windows* OS Release Notes -------------------------------------------------------------------
Intel Media Server Studio Professional Edition for Windows* Server
Intel Media Server Studio 2015 R3 Professional Edition for Windows* Server Release Notes Overview What's New System Requirements Installation Installation Folders Known Limitations Legal Information Overview
Intel Media SDK Features in Microsoft Windows 7* Multi- Monitor Configurations on 2 nd Generation Intel Core Processor-Based Platforms
Intel Media SDK Features in Microsoft Windows 7* Multi- Monitor Configurations on 2 nd Generation Intel Core Processor-Based Platforms Technical Advisory December 2010 Version 1.0 Document Number: 29437
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
Introduction to Intel Xeon Phi Coprocessors. Slides by Johannes Henning
Introduction to Intel Xeon Phi Coprocessors Slides by Johannes Henning 5 Xeon Phi Coprocessor Lineup Intel Xeon Phi Coprocessor Product Lineup 7 Family Highest Performance Most Memory Performance leadership
Revealing the performance aspects in your code. Intel VTune Amplifier XE Generics. Rev.: Sep 1, 2013
Revealing the performance aspects in your code Intel VTune Amplifier XE Generics Rev.: Sep 1, 2013 1 Agenda Introduction to Intel VTune Amplifier XE profiler High-level Features Types of Analysis Hotspot
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
SDN. WHITE PAPER Intel Ethernet Switch FM6000 Series - Software Defined Networking. Recep Ozdag Intel Corporation
WHITE PAPER Intel Ethernet Switch FM6000 Series - Software Defined Networking Intel Ethernet Switch FM6000 Series - Software Defined Networking Recep Ozdag Intel Corporation Software Defined Networking
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
LTE Control Plane on Intel Architecture
White Paper Soo Jin Tan Platform Solution Architect Siew See Ang Performance Benchmark Engineer Intel Corporation LTE Control Plane on Intel Architecture Using EEMBC CoreMark* to optimize control plane
Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture Processors
White Paper Vinodh Gopal Jim Guilford Wajdi Feghali Erdinc Ozturk Gil Wolrich Martin Dixon IA Architects Intel Corporation Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment
Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 231 237 doi: 10.14794/ICAI.9.2014.1.231 Parallelization of video compressing
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Hetero Streams Library 1.0
Release Notes for release of Copyright 2013-2016 Intel Corporation All Rights Reserved US Revision: 1.0 World Wide Web: http://www.intel.com Legal Disclaimer Legal Disclaimer You may not use or facilitate
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
Improving OpenSSL* Performance
White Paper Vinodh Gopal Jim Guilford Erdinc Ozturk Sean Gulley Wajdi Feghali IA Architects Intel Corporation Improving OpenSSL* Performance October 2011 326232-001 Executive Summary The SSL/TLS protocols
