Introduction to Intel Xeon Phi Coprocessors. Slides by Johannes Henning
|
|
|
- Sydney Allison
- 10 years ago
- Views:
Transcription
1 Introduction to Intel Xeon Phi Coprocessors Slides by Johannes Henning 5
2 Xeon Phi Coprocessor Lineup Intel Xeon Phi Coprocessor Product Lineup 7 Family Highest Performance Most Memory Performance leadership 16GB GDDR5 352GB/s >1.2TF DP 300W TDP 7120P MM# A MM# D (Dense Form Factor) MM# X (No Thermal Solution MM# Family Optimized for High Density Environments Performance/Watt leadership 8GB GDDR5 >300GB/s >1TF DP W TDP 5110P MM# D (no thermal) MM# Family Outstanding Parallel Computing Solution Performance/$ leadership 6GB GDDR5 240GB/s >1TF DP 300W TDP 3120P MM# A MM# Optional 3-year Warranty Extend to 3-year warranty on any Intel Xeon Phi Coprocessor. Product Code: XPX100WRNTY, MM# Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating 31 your contemplated purchases, including the performance of that product when combined with other products. For more information go to Copyright 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 31 Source: 2
3 Xeon Phi 5110P Specifications 60 Cores based on P54C architecture (Pentium) > 1.0 Ghz clock speed; 64bit based x86 instructions + SIMD No x86 compatibility, uses own MIC instruction set 1x 25 MB L2 Cache (=512KB per core) + 64 KB L1 Cache coherency 8 GB of GDDR5 Memory In-Order Execution 4 Hardware Threads per Core (240 logical cores) Think graphics-card hardware threads Only one runs = memory latency hiding Switched after each instruction 512 bit wide VPU with new ISA KCi No support for MMX, SSE or AVX Could handle 8 doule precision floats/16 single precision floats Always structured in vectors with 16 elements 3
4 Xeon Phi Performance in Practice: Case Studies Intel Xeon Phi Coprocessor: Increases Application Performance up to 7x Segment Application/Code Performance vs. 2S Xeon* DCC NEC / Video Transcoding ( see case study : NEC Case Study ) Up to 3.0x 2 Energy Financial Services Life Science Manufacturing Physics Seismic Imaging ISO3DFD Proxy 16th order Isotropic kernel RTM Seismic Imaging 3DFD TTI 3- Proxy 8th order RTM (complex structures) Petrobras Seismic ISO-3D RTM (with 1, 2, 3 or 4 Intel Xeon Phi coprocessors) BlackScholes SP / DP Monte Carlo European Option SP / DP Monte Carlo RNG European SP / DP Binomial Options SP / DP BWA/Bio-Informatics Wayne State University/MPI-Hmmer GROMACS /Molecular Dynamics ANSYS / Mechanical SMP Sandia Mantevo / minife case study : software.intel.com/ en-us/articles/running-minife-on-intel-xeon-phi-coprocessors ZIB (Zuse-Institut Berlin) / Ising 3D (Solid State Physics) ASKAP thogbomclean (astronomy) Princeton / GTC-P (Gyrokinetic Torodial) Turbulence Simulation IVB Up to 1.45x 3 Up to 1.23x 3 Up to 2.2x, 3.4x, 4.6x or 5.6x 4 SP: Up to 2.12x 3 ; DP Up to 1.72x 3 SP: Up to 7x 3 ; DP Up to 3.13x 3 SP: Up to 1.58x 3 ; DP Up to 1.17x 3 SP: Up to 1.85x 3 ; DP Up to 1.85x 3 Up to 1.5x 4 Up to 1.56x 1 Up to 1.36x 1 Up to 1.88x 5 Up to 2.3x 4 Up to 3.46x 1 Up to 1.73x 3 Up to 1.18x 6 Weather WRF /Code WRF V x S Xeon E vs. 2S Xeon* E Xeon Phi* coprocessor (Symmetric) 2. 2S Xeon E vs. 2S Xeon E Xeon Phi coprocessor 3. 2S Xeon E5-2697v2 vs. 1 Xeon Phi coprocessor (Native Mode) 4. 2S Xeon E5-2697v2 vs. 2S Xeon E5 2697v2 +1 Xeon Phi coprocessor (Symmetric Mode) (for Petrobras, 1, 2 3 or 4 Xeon Phi s in the system) 5. 2S Xeon E vs. 2S Xeon* E Xeon Phi* coprocessor (Symmetric) (only 2 Xeon cores used to optimize licensing costs) 6. 4 nodes of 2S E5-2697v2 vs. 4 nodes of E5-2697v2 + 1 Xeon Phi coprocessor (Symmetric)! Xeon = Intel Xeon processor! Xeon Phi = Intel Xeon Phi coprocessor Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel & Customer Measured. Configuration Details: Please reference slide speaker notes. For more information go to 32 Copyright 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 32 Source: 4
5 Operating System Minimal, embedded Linux Linux Standard Base (LSB) Core libraries. Implements Busybox minimal shell environment 5
6 Operating System 6
7 Hello World 7
8 Offload-Compiler OpenMP like approach to offload functions onto the MIC Like OpenMP pragma based library approach int p1[size] = populate(); int p2[size] = populate(); int res[size]; #pragma offload target(mic) in(p1, p2:length(size)) out(res) { for (i=0; i<size; i++){ res[i] = p1[i] + p2[i]; } } Simple mechanism but questionable optimisation Very limited in features 8
9 Offload-Compiler Other Keywords: attribute ((target(mic))) int res[]; #pragma offload_attribute(push, target(mic)) #pragma offload_attribute(pop) nocopy() alloc_if() free_if() #pragma offload_transfer #pragma offload_wait signal() wait() Further reading: effective-use-of-the-intel-compilers-offload-features /opt/intel/composerxe/samples/en_us/c++/mic_samples/ intro_samplec/ 9
10 OpenMP: Native binaries and offloading Offloading mode: #pragma offload target (mic) #pragma omp parallel for reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5f)/count); pi += 4.0f/(1.0f+t*t); } pi /= count; You can also use OpenMP in native Phi-binaries OpenMP 4.1 will integrate offload clauses into omp pragmas 10
11 Intel TBB Template based C++ library Task parallelism + Task stealing API Overview: parallel_for, parallel_reduce, parallel_scan parallel_while, parallel_do, parallel_pipeline, parallel_sort concurrent_queue, mutex, spin_mutex, fetch_and_add, fetch_and_increment, 11
12 Intel TBB void SerialApplyFoo( float a[], size_t n ) { } for( size_t i=0; i!=n; ++i ) Foo(a[i]); #include "tbb/tbb.h" using namespace tbb; class ApplyFoo { }; float *const my_a; public: void operator()( const blocked_range<size_t>& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) } Foo(a[i]); ApplyFoo( float a[] ) : my_a(a) {} 12
13 Intel Cilk Plus C++ extension providing: fork-join task parallelism with work-stealing auto vecorization data-parallel language constructs for vectorization (array notation) Can/Should be combined with TBB or OpenMP Ease of usage bought with reduced fine tuning capabilities 13
14 Intel Cilk Plus Keywords cilk_spawn cilk_sync cilk_for #include <cilk/cilk.h> void parallel_qsort(int * begin, int * end) { if (begin!= end) { --end; // Exclude last element (pivot) int * middle = std::partition(begin, end, std::bind2nd(std::less<int>(),*end)); std::swap(*end, *middle); // pivot to middle 29 cilk_spawn parallel_qsort(begin, middle); parallel_qsort(++middle, ++end); // Exclude pivot cilk_sync; } } 14
15 Intel Cilk Plus Reducers cilk::reducer_opadd<unsigned long long int> total (0); cilk_for(unsigned int i = 1; i <= n; ++i) { *total += compute(i); } Holders cilk::holder<hash_table<k, V> > m_holder; 15
16 Intel Cilk Plus Extension for array notation // refers to 12 elements in the two-dimensional array a a[0:3][0:4] // refers to elements 0 and 3 of the one-dimensional array b b[0:2:3] b[:] //refers to the entire array b a[:] * b[:] // element-wise multiplication a[3:2][3:2] + b[5:2][5:2] // matrix addition of the 2x2 matrices a[0:4][1:2] + b[1:2][0:4] // error, different rank sizes a[0:4][1:2] + b[0][1] // ok, adds a scalar b[0][1] // to an array section. 16
17 OpenCL Whats different to GPUs? Thread granularity bigger + No need for local shared memory Work Groups are mapped to Threads 240 OpenCL hardware threads handle workgroups More than 1000 Work Groups recommended Each thread executes one work group Implicit Vectorization by the compiler of the inner most loop 16 elements per Vector dimension zero must be divisible by 16 otherwise scalar execution (Good WG-Size = 16) Kernel ABC() for (int i = 0; i < get_local_size(2); i++) for (int j = 0; j < get_local_size(1); j++) for (int k = 0; k < get_local_size(0); k++) Kernel_Body; dimension zero of the NDRange 17
18 OpenCL Non uniform branching has significant overhead Between workgroups it is okay, since each WG is executed by one thread Non Vector size alligned memory access has significant overhead Non linear access patterns have significant overhead Manual prefetching can be advantageous No hardware support for barriers Especially for dimension zero However: Intel prefers 18
19 MPI Each MIC is considered as a stand alone node MICs can communicate directly via Infiniband (Hardware or Software) Build and launch process: $ source /opt/intel/composer_xe_ /bin/compilervars.sh intel64 $ source /opt/intel/impi/ /bin64/mpivars.sh $ mpiicc -mmic mpi_test.c -o mpi_test.mic $ mpiicc mpi_test.c -o mpi_test.host $ export I_MPI_MIC=enable $ scp mpi_test.mic mic0:~/. $ sudo scp /opt/intel/impi/ /mic/bin/* mic0:/bin/ $ sudo scp /opt/intel/impi/ /mic/lib/* mic0:/lib64/ $ sudo scp /opt/intel/composer_xe_ /compiler/lib/mic/* mic0:/ lib64/ $ mpirun n <# of processes> -host <hostname1> <application> : n <# of processes> -host <hostname2> <application>
20 Source: Vectorization on Intel compilers Auto Vectorization Guided Vectorization Low level Vectorization Compiler knobs Compiler hints/pragmas Array notation C/C++ vector classes Intrinsics/Assembly Easy of use Fine control 8 20
21 Source: Auto vectorization: not all loops will vectorize Data dependencies between iterations Proven Read-after-Write data (i.e., loop carried) dependencies Assumed data dependencies Aggressive optimizations (e.g., IPO) might help Vectorization won t be efficient Compiler estimates how better the vectorized version will be Affected by data alignment, data layout, etc. Unsupported loop structure While-loop, for-loop with unknown number of iterations Complex loops, unsupported data types, etc. (Some) function calls within loop bodies Not the case for SVML functions RaW dependency for (int i = 0; i < N; i++) a[i] = a[i-1] + b[i]; Inefficient vectorization for (int i = 0; i < N; i++) a[c[i]] = b[d[i]]; Function call within loop body for (int i = 0; i < N; i++) a[i] = foo(b[i]); 11 21
22 Source: Guided vectorization: disambiguation hints Get rid of assumed vector dependencies Assume function arguments won t be aliased C/C++: Compile with -fargument-noalias C99 restrict keyword for pointers Compile with -restrict otherwise Ignore assumed vector dependencies (compiler directive) C/C++: #pragma ivdep Fortran:!dir$ ivdep void v_add(float *restrict c, float *restrict a, float *restrict b) { for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; } void v_add(float *c, float *a, float *b) { #pragma ivdep for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; } 14 22
23 Source: Guided vectorization: #pragma simd Force loop vectorization ignoring all dependencies Additional clauses for specify reductions, etc. SIMD loop void v_add(float *c, float *a, float *b) { #pragma simd for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; } SIMD function declspec(vector) void v_add(float c, float a, float b) { c = a + b; } for (int i = 0; i < N; i++) v_add(c[i], A[i], B[i]); Also supported in OpenMP Almost same functionality/syntax Use #pragma omp simd [clauses] for SIMD loops Use #pragma omp declare simd [clauses] for SIMD functions See OpenMP 4.0 specification for more information 16 23
24 #pragma ivdep versus #pragma simd #pragma ivdep Implicit vectorization Notifies the compiler about the absence of pointer aliasing Based on practicability and costs, the compiler decides about vectorization #pragma simd Explicit Enforces vectorization rergardless of the costs If no parameter is provided, the vector length of the SIMD unit is assumed 24
25 Source: Improving vectorization: data layout Vectorization more efficient with unit strides Non-unit strides will generate gather/scatter Unit strides also better for data locality Compiler might refuse to vectorize AoS vs SoA Layout your data as Structure of Arrays (SoA) Traverse matrices in the right direction C/C++: a[i][:], Fortran: a(:,i) Loop interchange might help Usually the compiler is smart enough to apply it Check compiler optimization report Array of Structures vs Structure of Arrays // Array of Structures (AoS) struct coordinate { float x, y, z; } crd[n]; for (int i = 0; i < N; i++) = f(crd[i].x, crd[i],y, crd[i].z); Consecutive elements in memory x0 y0 z0 x1 y1 z1 x(n-1) y(n-1) z(n-1) // Structure of Arrays (SoA) struct coordinate { float x[n], y[n], z[n]; } crd; for (int i = 0; i < N; i++) = f(crd.x[i], crd.y[i], crd.z[i]); Consecutive elements in memory x0 x1 x(n-1) y0 y1 y(n-1) z0 z1 z(n-1) 19 25
26 Source: Improving vectorization: data alignment Unaligned accesses might cause significant performance degradation Two instructions on current Intel Xeon Phi TM coprocessor Might cause false sharing problems Consumer/producer thread on the same cache line Alignment is generally unknown at compile time Every vector access is potentially an unaligned access Vector access size = cache line size (64-byte) Compiler might peel a few loop iterations In general, only one array can be aligned, though When possible, we have to Align our data Tell the compiler data is aligned Might not be always the case 20 26
27 Does the Xeon Phi fit the problem? Source: 27
28 What you need to start An installation of the Intel Parallel Studio XE (>= 2013) Samples including build files at: /opt/intel/composerxe/samples/ The following lines in your bashrc if [ -d "/opt/intel/composer_xe_13/compiler/lib/intel64" ] ; then LD_LIBRARY_PATH="/opt/intel/composer_xe_13/compiler/lib/intel64:$LD_LIBRARY_PATH" fi if [ -d "/opt/intel/bin" ] ; then PATH="/opt/intel/bin:$PATH" fi MIC library files at: /opt/intel/composer_xe_{version_number}/ compiler/lib/mic/ 28
29 Knights Landing: Next Generation Xeon Phi Knights Landing: Next-Generation Intel Xeon Phi Architectural Enhancements = ManyX Performance Based on Intel Atom core (based on Silvermont microarchitecture) with Enhancements for HPC! 14nm process technology! 4 Threads/Core! Deep Out-of-Order Buffers! Gather/Scatter! Better Branch Prediction! Higher Cache Bandwidth and many more Binary-compatible with Intel Xeon processors 60+ cores 3+ Teraflops 1 3x Single-Thread 2 2-D Core Mesh Cache Integrated Coherency Fabric High- Performance Memory Over 5x STREAM vs. DDR4 3 Up to 16 GB at launch NUMA support DDR4 Capacity Comparable to Intel Xeon Processors Core Server Processor In partnership with * Copyright 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Source: 29
30 Knights Landing Some Details Knights Landing Some Details PERFORMANCE 3+ TeraFLOPS of double-precision peak theoretical performance per single socket node0 INTEGRATION Intel Omni Scale fabric integration Over 5x STREAM vs. DDR41! Over 400 GB/s Up to 16GB at launch NUMA support High-performance on-package Over 5x Energy Efficiency vs. GDDR52 memory (MCDRAM) Over 3x Density vs. GDDR52 In partnership with Micron Technology Flexible memory modes including cache and flat SERVER PROCESSOR Standalone bootable processor (running host OS) and a PCIe coprocessor (PCIe end-point device) Platform memory: up to 384GB DDR4 using 6 channels Reliability ( Intel server-class reliability ) Power Efficiency (Over 25% better than discrete coprocessor)4! Over 10 GF/W Density (3+ KNL with fabric in 1U)5 Up to 36 lanes PCIe* Gen 3.0 MICROARCHITECTURE Over 8 billion transistors per die based on Intel s 14 nanometer manufacturing technology Binary compatible with Intel Xeon Processors with support for Intel Advanced Vector Extensions 512 (Intel AVX-512) 6 3x Single-Thread Performance compared to Knights Corner7 60+ cores in a 2D Mesh architecture 2 cores per tile with 2 vector processing units (VPU) per core) 1MB L2 cache shared between 2 cores in a tile (cache-coherent) 4 Threads / Core 2X Out-of-Order Buffer Depth8 Gather/scatter in hardware Based on Intel Atom core (based Advanced Branch Prediction on Silvermont microarchitecture) High cache bandwidth with many HPC enhancements 32KB Icache, Dcache 2 x 64B Load ports in Dcache 46/48 Physical/virtual address bits Most of today s parallel optimizations carry forward to KNL Multiple NUMA domain support per socket AVAILABILITY First commercial HPC systems in 2H 15 Knights Corner to Knights Landing upgrade program available today Intel Adams Pass board (1U half-width) is custom designed for Knights Landing (KNL) and will be available to system integrators for KNL launch; the board is OCP Open Rack 1.0 compliant, features 6 ch native DDR4 (1866/2133/2400MHz) and 36 lanes of integrated PCIe* Gen 3 I/O All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 0 Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expecations of cores, clock frequency and floating point operations per cycle. 1 Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory with all channels populated. 2 Projected result based on internal Intel analysis comparison of 16GB of ultra high-bandwidth memory to 16GB of GDDR5 memory used in the Intel Xeon Phi coprocessor 7120P. 3 Compared to 1st Generation Intel Xeon Phi 7120P Coprocessor (formerly codenamed Knights Corner) 4 Projected result based on internal Intel analysis using estimated performance and power consumption of a rack sized deployment of Intel Xeon processors and Knights Landing coprocessors as compared to a rack with KNL processors only 5 Projected result based on internal Intel analysis comparing a discrete Knights Landing processor with integrated fabric to a discrete Intel fabric component card. 6 Binary compatible with Intel Xeon Processors v3 (Haswell) with the exception of Intel TSX (Transactionaly Synchronization Extensions) 7 Projected peak theoretical single-thread performance relative to 1st Generation Intel Xeon Phi Coprocessor 7120P 8 Compared to the Intel Atom core (base on Silvermont microarchitecture) Copyright 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Source: 30
31 AVX-512: A Common Instruction Set AVX-512 KNL and future XEON! KNL and future Xeon architecture share a large set of instructions! but sets are not identical! Subsets are represented by individual feature flags (CPUID) SSE* AVX SSE* AVX2 AVX SSE* AVX-512PR AVX-512ER AVX-512CD AVX-512F AVX2* AVX SSE* MPX,SHA, AVX-512VL AVX-512BW AVX-512DQ AVX-512CD AVX-512F AVX2 AVX SSE* Common Instruction Set NHM SNB HSW Future Xeon Phi (KNL) Future Xeon Copyright 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Source: 31
32 Software Adaption for Knights Landing Software Adaption for KNL Key New Features Large impact: Intel AVX-512 instruction set! Slightly different from future Intel Xeon architecture AVX-512 extensions! Includes SSE, AVX, AVX-2! Apps built for HSW and earlier can run on KNL ( few exceptions like TSX )! Incompatible with 1st Generation Intel Xeon Phi ( KNC ) Medium impact: New, on-chip high bandwidth memory (MCDRAM) creates heterogeneous (NUMA) memory access! can be used transparently too however Minor impact: Differences in floating point execution / rounding due to FMA and new HW-accelerated transcendental functions - like exp() Copyright 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 36 Source: 32
33 Sources Intel Xeon Phi System Software Developer s Guide Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE Intel Many Integrated Core Platform Software Stack OpenCL Design and Programming Guide for the Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors Using the Intel MPI Library on Intel Xeon Phi Coprocessor Systems Understanding MPI Load Imbalance with Intel Trace Analyzer and Collector A Guide to Vectorization with Intel C++ Compilers Differences in floating-point arithmeic between Intel Xeon processors and the Intel Xeon Phi Intel Xeon Phi Coprocessor Datasheet Intel Many Integrated Core Symmetric Communications Interface (SCIF) User Guide Intel HPC Code Modernization Workshop 33
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Programming the Intel Xeon Phi Coprocessor
Programming the Intel Xeon Phi Coprocessor Tim Cramer [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Many Integrated Core (MIC) Architecture Programming Models Native
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
OpenMP* 4.0 for HPC in a Nutshell
OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group ([email protected]) *Other brands and names are the property of their respective owners.
White Paper. Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE. Version 1.5
White Paper Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE Version 1.5 Contents Introduction... 4 Goals... 4 This document does:... 4 This document does not:... 4 Terminology... 4 System Configuration...
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
A quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
Big Data Visualization on the MIC
Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth [email protected] Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Performance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial
Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany
Large-Data Software Defined Visualization on CPUs
Large-Data Software Defined Visualization on CPUs Greg P. Johnson, Bruce Cherniak 2015 Rice Oil & Gas HPC Workshop Trend: Increasing Data Size Measuring / modeling increasingly complex phenomena Rendering
Intel Many Integrated Core Architecture: An Overview and Programming Models
Intel Many Integrated Core Architecture: An Overview and Programming Models Jim Jeffers SW Product Application Engineer Technical Computing Group Agenda An Overview of Intel Many Integrated Core Architecture
Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old
OpenFOAM: Computational Fluid Dynamics Gauss Siedel iteration : (L + D) * x new = b - U * x old What s unique about my tuning work The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.
Vendor Update Intel 49 th IDC HPC User Forum Mike Lafferty HPC Marketing Intel Americas Corp. Legal Information Today s presentations contain forward-looking statements. All statements made that are not
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
Running Native Lustre* Client inside Intel Xeon Phi coprocessor
Running Native Lustre* Client inside Intel Xeon Phi coprocessor Dmitry Eremin, Zhiqi Tao and Gabriele Paciucci 08 April 2014 * Some names and brands may be claimed as the property of others. What is the
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
The Foundation for Better Business Intelligence
Product Brief Intel Xeon Processor E7-8800/4800/2800 v2 Product Families Data Center The Foundation for Big data is changing the way organizations make business decisions. To transform petabytes of data
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Xeon+FPGA Platform for the Data Center
Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab
Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event
Experiences With Mobile Processors for Energy Efficient HPC
Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica
CSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC
New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC Alan Gara Intel Fellow Exascale Chief Architect Legal Disclaimer Today s presentations contain forward-looking
Accelerate Discovery with Powerful New HPC Solutions
solution brief Accelerate Discovery with Powerful New HPC Solutions Achieve Extreme Performance across the Full Range of HPC Workloads Researchers, scientists, and engineers around the world are using
Towards OpenMP Support in LLVM
Towards OpenMP Support in LLVM Alexey Bataev, Andrey Bokhanko, James Cownie Intel 1 Agenda What is the OpenMP * language? Who Can Benefit from the OpenMP language? OpenMP Language Support Early / Late
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Kalray MPPA Massively Parallel Processing Array
Kalray MPPA Massively Parallel Processing Array Next-Generation Accelerated Computing February 2015 2015 Kalray, Inc. All Rights Reserved February 2015 1 Accelerated Computing 2015 Kalray, Inc. All Rights
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
Xeon Phi Application Development on Windows OS
Chapter 12 Xeon Phi Application Development on Windows OS So far we have looked at application development on the Linux OS for the Xeon Phi coprocessor. This chapter looks at what types of support are
The Transition to PCI Express* for Client SSDs
The Transition to PCI Express* for Client SSDs Amber Huffman Senior Principal Engineer Intel Santa Clara, CA 1 *Other names and brands may be claimed as the property of others. Legal Notices and Disclaimers
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25 December 2014 FPGAs in the news» Catapult» Accelerate BING» 2x search acceleration:» ½ the number of servers»
Intel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,
Intel Xeon +FPGA Platform for the Data Center
Intel Xeon +FPGA Platform for the Data Center FPL 15 Workshop on Reconfigurable Computing for the Masses PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA
INTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Intel Threading Building Blocks (Intel TBB) 2.2. In-Depth
Intel Threading Building Blocks (Intel TBB) 2.2 In-Depth Contents Intel Threading Building Blocks (Intel TBB) 2.2........... 3 Features................................................ 3 New in This Release...5
Modernizing Servers and Software
SMB PLANNING GUIDE Modernizing Servers and Software Increase Performance with Intel Xeon Processor E3 v3 Family Servers and Windows Server* 2012 R2 Software Why You Should Read This Document This planning
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Using the Windows Cluster
Using the Windows Cluster Christian Terboven [email protected] aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
Innovativste XEON Prozessortechnik für Cisco UCS
Innovativste XEON Prozessortechnik für Cisco UCS Stefanie Döhler Wien, 17. November 2010 1 Tick-Tock Development Model Sustained Microprocessor Leadership Tick Tock Tick 65nm Tock Tick 45nm Tock Tick 32nm
Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?
Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Software Solutions Group Intel Corporation 2012 *Other brands and names are the
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820
Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820 This white paper discusses the SQL server workload consolidation capabilities of Dell PowerEdge R820 using Virtualization.
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
The Engine for Digital Transformation in the Data Center
product brief The Engine for Digital Transformation in the Data Center Intel Xeon Processor E5-2600 v4 Product Family From individual servers and workstations to clusters, data centers, and clouds, IT
YALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
Pexip Speeds Videoconferencing with Intel Parallel Studio XE
1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Industry First X86-based Single Board Computer JaguarBoard Released
Industry First X86-based Single Board Computer JaguarBoard Released HongKong, China (May 12th, 2015) Jaguar Electronic HK Co., Ltd officially launched the first X86-based single board computer called JaguarBoard.
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
