Programming the Intel Xeon Phi Coprocessor

Similar documents

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Case Study on Productivity and Performance of GPGPUs

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

OpenMP Programming on ScaleMP

Multi-Threading Performance on Commodity Multi-Core Processors

A quick tutorial on Intel's Xeon Phi Coprocessor

Introduction to Hybrid Programming

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Xeon Phi Application Development on Windows OS

White Paper. Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE. Version 1.5

Big Data Visualization on the MIC

OpenACC Basics Directive-based GPGPU Programming

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Performance Characteristics of Large SMP Machines

MAQAO Performance Analysis and Optimization Tool

Debugging with TotalView

Parallel Programming Survey

Parallel Computing. Parallel shared memory computing with OpenMP

A Case Study - Scaling Legacy Code on Next Generation Platforms

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Performance Analysis and Optimization Tool

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Using the Windows Cluster

The RWTH Compute Cluster Environment

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

OpenACC Programming and Best Practices Guide

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

OpenACC 2.0 and the PGI Accelerator Compilers

Parallel Computing. Shared memory parallel programming with OpenMP

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

GPU Tools Sandra Wienke

Evaluation of CUDA Fortran for the CFD code Strukti

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Using the Intel Inspector XE

OpenMP and Performance

Multicore Parallel Computing with OpenMP

The C Programming Language course syllabus associate level

Experiences with HPC on Windows

Towards OpenMP Support in LLVM

Chapter 2 Parallel Architecture, Software And Performance

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

- An Essential Building Block for Stable and Reliable Compute Clusters

Introduction to HPC Workshop. Center for e-research

Introduction to Intel Xeon Phi Coprocessors. Slides by Johannes Henning

Parallel Algorithm Engineering

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

CUDA programming on NVIDIA GPUs

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Getting OpenMP Up To Speed

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

QCD as a Video Game?

HPC Wales Skills Academy Course Catalogue 2015

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Using NeSI HPC Resources. NeSI Computational Science Team

Binary search tree with SIMD bandwidth optimization using SSE

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

CSE 6040 Computing for Data Analytics: Methods and Tools

GPU Acceleration of the SENSEI CFD Code Suite

Scalability and Classifications

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to Cloud Computing

Rethinking SIMD Vectorization for In-Memory Databases

How To Write Portable Programs In C

AMD Opteron Quad-Core

Parallel Computing. Benson Muite. benson.

YALES2 porting on the Xeon- Phi Early results

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

An Introduction to Parallel Computing/ Programming

FLOW-3D Performance Benchmark and Profiling. September 2012

An HPC Application Deployment Model on Azure Cloud for SMEs

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Embedded Systems: map to FPGA, GPU, CPU?

Parallel and Distributed Computing Programming Assignment 1

OpenMP* 4.0 for HPC in a Nutshell

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Retargeting PLAPACK to Clusters with Hardware Accelerators

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

The programming language C. sws1 1

Keys to node-level performance analysis and threading in HPC applications

The CNMS Computer Cluster

Kalray MPPA Massively Parallel Processing Array

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPUs for Scientific Computing

Resource Scheduling Best Practice in Hybrid Clusters

Intel Many Integrated Core Architecture: An Overview and Programming Models

Next Generation GPU Architecture Code-named Fermi

Transcription:

Programming the Intel Xeon Phi Coprocessor Tim Cramer cramer@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ)

Agenda Motivation Many Integrated Core (MIC) Architecture Programming Models Native Language Extension for Offload (LEO) Symmetric (with Message Passing) Debugging Optimization Case Study: CG Solver RWTH MIC Environment 2

1 11 21 31 41 51 61 71 81 91 101 111 Performance Motivation Demand for more compute power Reach higher performance with more threads Power consumption: Better performance / watt ratio GPUs are one alternative, but: CUDA is hard to learn / program Intel Xeon Phi can be programmed with 2500 2000 1500 1000 500 0 Xeon Phi Peak Xeon Peak Threads Source: James Reinders, Intel established programming paradigms like OpenMP, MPI, Pthreads 3

What is the Intel Xeon Phi? 4

When to Use the Intel Xeon Phi? Xeon Phi is not intended to replace Xeon -> choose the right vehicle 5

Architecture (1/2) Instruction Decode Scalar Unit Vector Unit Source: Intel Intel Xeon Phi Coprocessor 1 x Intel Xeon Phi @ 1090 MHz 60 Cores (in-order) ~ 1 TFLOPS DP Peak 4 hardware threads per core 8 GB GDDR5 memory 512-bit SIMD vectors (32 registers) Fully-coherent L1 and L2 caches Plugged into PCI Express bus Scalar Registers 32k/32k L1 Cache inst/data 512k L2 Cache Ring Vector Registers 6

Architecture (2/2) Core L1 Core L1 Core L1 Core L1 Ring network L2 L2 L2 L2 L2 L2 L2 L2 Ring network L1 L1 L1 L1 Memory & I/O interface GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Core Core Core Core 7

Xeon Phi Nodes at RWTH Aachen DDR3 (16 GB) Xeon (8 Cores @ 2 GHz) QPI Xeon Phi (60 Cores @ 1GHz) GDDR5 (8 GB) MIC System DDR3 (16 GB) Shared Virtual Memory Xeon (8 Cores @ 2 GHz) Xeon Phi (60 Cores @ 1GHz) GDDR5 (8 GB) MIC System Host System PCI Express Compute Node 8

Programming Models Host only Language Extension for Offload ( LEO ) Symmetric Coprocessor only ( native ) main() foo() MPI_*() main() foo() MPI_*() #pragma offload main() foo() MPI_*() Additional Compiler Flag: -mmic Host CPU PCI Express bus foo() main() foo() MPI_*() main() foo() MPI_*() Intel Xeon Phi 9

Coprocessor only ( native ) Cross-compile for the coprocessor instruction set on the CPU and the coprocessor is similar, but identical easy (just add -mmic, login with ssh and execute) OpenMP, posix threads, OpenCL, MPI usable analyze benefit for hotspots very slow IO poor single thread performance host CPU will be bored only suitable for highly parallelized / scalable codes 10

Language Extension for Offload (LEO) Add pragmas similar to OpenMP or OpenACC C/C++: #pragma offload target (mic:device_id) Fortran:!dir$ offload target (mic:device_id) statements in this scope can be executed on the MIC (not guaranteed!) Variable & function definitions C/C++: attribute ((target(mic))) Fortran:!dir$ attributes offload:<mic> :: <func-/varname> compiles for, or allocates variable on, both the CPU and the MIC mark entire files or large blocks of code (C/C++ only) #pragma offload_attribute(push, target(mic)) #pragma offload_attribute(pop) 11

LEO - Data Transfer Host CPU and MIC do not share physical of virtual memory Implicit copy 12 Scalar variables Static arrays Explicit copy Programmer designates variables to be copied between host and card Specify in, out, inout or nocopy for the data nocopy is deprecated, use in(data: length(0)) instead Data transfer with offload region C/C++: #pragma offload target(mic) in(data:length(size)) Fortran:!dir$ offload target(mic) in(data:length(size)) Data transfer without offload region C/C++: #pragma offload_transfer target(mic) \ in(data:length(size)) Fortran:!dir$ offload_transfer target(mic) & in(data:length(size))

LEO First Simple Example Data transfer for offload C/C++ #pragma offload target (mic) out(a:length(count)) \ in(b:length(count)) for (i=0; i<count; i++) { a[i] = b[i] * c + d; } Fortran!dir$ offload begin target (mic) out(a) in(b) do i=1, count a(i) = b(i) * c +d enddo!dir$ end offload 13

LEO Second Simple Example Compile functions / subroutines for the CPU and the coprocessor C/C++ attribute ((target(mic))) void foo(){ printf( Hello MIC\n ); } int main(){ #pragma offload target (mic) foo(); return 0; } The directive is needed within both the calling routine's scope and the function definition/scope for the function itself. Fortran!dir$ attributes &!dir$ offload:mic :: hello subroutine hello write(*,*) "Hello MIC end subroutine program main!dir$ attributes &!dir$ offload:mic :: hello!dir$ offload begin target (mic) call hello()!dir$ end offload end program 14

LEO - Managing Memory Allocation (1/3) Memory management for pointer variables on the CPU is still up to the programmer coprocesser is done automatically in in, inout and out clauses Input / Output Pointers fresh memory allocation for each pointer variable by default (on coprocesser) de-allocation after offload region (on coprocesser) use alloc_if and free_if qualifiers to modify the allocation defaults Data transfer in pre-allocated memory Retain target memory by setting free_if(0) or free_if(.false.) Reuse data in subsequent offload by setting alloc_if(0) or alloc_if(.false.) important: always specify the target number on systems with multiple coprocessors: #pragma offload target(mic:0) 15

LEO - Managing Memory Allocation (2/3) Example in C #Define macros to make modifiers more understandable #define ALLOC alloc_if(1) #define FREE free_if(1) #define RETAIN free_if(0) #define REUSE alloc_if(0) #Allocate (default) the memory, but do not de-allocate #pragma offload target(mic:0) in(a:length(8) ALLOC RETAIN) #Do not allocate or de-allocate the memory #pragma offload target(mic:0) in(a:length(8) REUSE RETAIN) #Do not allocate the memory, but de-allocate (default) #pragma offload target(mic:0) in(a:length(0) REUSE FREE) 16

LEO - Managing Memory Allocation (3/3) Example in Fortran!Compiler allocated and frees data around the offload real, dimension(8) :: a!allocate (default) the memory, but do not de-allocate!dir$ offload target(mic:0) in(a:length(8) alloc_if(.true.) & free_if(.false.))!do not allocate or de-allocate the memory!dir$ offload target(mic:0) in(a:length(8) alloc_if(.false.) & free_if(.false.))!do not allocate the memory, but de-allocate (default)!dir$ offload target(mic:0) in(a:length(0) alloc_if(.false.) & free_if(.true.)) 17

LEO - Allocation for Parts of Arrays (1/3) Allocation of array slices is possible C/C++ int *p; // 1000 elements allocated. Data transferred into p[10:100] #pragma offload in ( p[10:100] : alloc(p[5:1000]) ) { } first element array length 1005 length p[0:5] p[5:5] p[10:100] p[110:895] alloc(p[5:1000]) modifier allocate 1000 elements on coprocessor first useable element has index 5, last 1004 (dark blue + orange) p[10:100] specifies 100 elements to transfer (orange) 18

LEO - Allocation for Parts of Arrays (2/3) Allocation of array slices is possible Fortran integer :: p (1005); // 1000 elements allocated. Data transferred into p[11:110]!dir$ offload in ( p[11:110] : alloc(p[6:1005]) ) { } first element array length 1005 last element p[1:5] p[6:10] p[11:110] p[111:1005] alloc(p[6:1005]) modifier allocate 1000 elements on coprocessor first useable element has index 6, last 1005 (dark blue + orange) p[11:110] specifies 100 elements to transfer (orange) 19

LEO - Allocation for Parts of Arrays (3/3) Also possible for multi-dimensional arrays C/C++ (Fortran analogous) int[4][4] *p; #pragma offload in ( (*p)[2][:] : alloc(p[1:4][:]) ) { } alloc(p[1:4][:]) modifier allocates 16 elements on coprocessor for a 5x4 shape (dark blue) first row is not allocated (light blue) only row 2 is transferred to the coprocessor (orange) 20

LEO Partial Memory Copies Moving data from one variable to another copy data from the CPU to another array on the MIC is possible Overlapping copy is undefined INTEGER :: P (1000), P1 (2000) INTEGER :: RANK1 (1000), RANK2 (10, 100)! Partial copy!dir$ OFFLOAD (P (1:500) : INTO ( P1 (501:1000) ) )! Overlapping copy; result undefined!dir$ OFFLOAD IN (P (1:600) : INTO (P1 (1:600))) & & IN (P (601:1000) : INTO (P1 (100:499))) 21

LEO and OpenMP Use OpenMP directives in an offload region Fortran omp is optional When omp is present, the next line must be an OpenMP directive if not, the next line can also be a call or assignment statement Number of processes Set with environment variables OMP_NUM_THREADS=16 MIC_OMP_NUM_THREADS=120 MIC_ENV_PREFIX=MIC #pragma offload target (mic) #pragma omp parallel for for (i=0; i<count; i++) { a[i] = b[i] * c + d; }!dir omp offload target (mic)!$omp parallel do do i=1, count A(i) = B(i) *c +d end do!$omp end parallel Set with API omp_set_num_threads_target (TARGET_TYPE target_type, int target_number, int num_threads) 22

LEO OpenMP Affinity (1/2) The coprocessor has 4 hardware threads per core to saturate the coprocessor at least 2 have to be utilized Affinity strategies KMP_AFFINITY propagates to the coprocessor new placing strategy for MIC: balanced runtime places threads on separate cores until all cores have at least one thread (similar to scatter) balanced ensures that threads are close to each other when multiple hardware threads per core are needed not supported for CPU, might produce a warning to avoid warning for CPU use MIC_ENV_PREFIX=MIC and then set 23 MIC_KMP_AFFINITY for balanced placing strategy

LEO OpenMP Affinity (2/2) Example 3-core system Core 0 Core 1 Core 2 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 compact (6T) 0 1 2 3 4 5 scatter (6T) 0 3 1 4 2 5 balanced (6T) 0 1 2 3 4 5 balanced (9T) 0 1 2 3 4 5 6 7 8 24

Using Native Libraries on MIC Vendor / standard libraries Standard libraries are available with no need to use special syntax or runtime features Some common libraries (MKL, OpenMP) are also available Using native libraries in offload regions To use shared libraries on MIC build it twice (for CPU and MIC using -mmic ) Transfer native library to the device (e.g., using ssh) Use MIC_LD_LIBRARY_PATH to point to your own target library on the MIC 25

LEO - Static Libraries in Offloaded Code Create own libraries Use xiar or the equivalent xild -lib to create a static archive library containing routines with offload code Specify -qoffload-build, which causes xiar to create both a library for the CPU (lib.a), and for the coprocessor (libmic.a) Use the same options available to ar to create the archive xiar -qoffload-build ar options archive [member...] Use the linker options -Lpath and llibname, the compiler will automatically incorporate the corresponding coprocessor library (libmic.a) Example Create libsample.a and libsamplemic.a: xiar -qoffload-build rcs libsample.a obj1.o obj2.o! Linking: ifort myprogram.f90 libsample.a ifort myprogram.f90 -lsample 26

LEO Compiler Command-line options Offload-specific arguments for the Intel Compiler Activate compiler reports for the offload -opt-report-phase:offload Deactivate offload support: -no-offload Build all functions and variables for the host and the MIC (might be very useful, especially for big code with many function calls) -offload-attribute-target=mic Set MIC specific option for the compiler or linker -offload-option,mic,tool, option-list Example icc g O2 -lmiclib xavx offload-option,mic,compiler, O3 -offload-option,mic,ld: -L/home/user/miclib foo.c 27

LEO Dealing with Multiple MICs In source code Include offload.h: #include <offload.h> determine the number of coprocessors in a system _Offload_number_of_devices() Determine the coprocessor on which a program is running _Offload_get_device_number() At runtime Restrict the devices that can be used for offload OFFLOAD_DEVICES=1 Memory management If using alloc_if or free_if always specify the target device 28

LEO MIC Specific Compiler Macros Compiler defines macros for Compiler independent code: INTEL_OFFLOAD MIC specific code: MIC Example #ifdef INTEL_OFFLOAD #include <offload.h> #endif #ifdef INTEL_OFFLOAD printf("%d MICS available\n", _Offload_number_of_devices()); #endif int main(){ #pragma offload target(mic) { #ifdef MIC printf("hello MIC number %d\n", _Offload_get_device_number()); #else printf("hello HOST\n"); #endif } } 29

LEO I/O in Offloaded Regions Stdout and stderr in offloaded codes Writes (e.g., printf) performed in offloaded code may be buffered Output data may be lost when directed to a file $./a.out >log.txt I/O to a file requires an additional fflush(0) on the coprocessor File I/O All processes on the MIC will be started as micuser on the device At the moment no file I/O within offloaded regions possible 30

LEO Asynchronous Computation Default offload causes CPU thread to wait for completion Asynchronous offload initiates the offload and continues immediately to the next statement Use signal clause to initiate Use offload_wait pragma to wait for completion These constructs always refer to a specific target Querying a signal before initiation will cause a runtime abort Example char signal_var; do { #pragma offload target (mic:0) signal(&signal_var) { long_running_mic_compute(); } concurrent_cpu_activity(); #pragma offload_wait target (mic:0) wait(&signal_var) } while (1); 31

Infiniband PCIe Symmetric Execution (1/2) Using MPI on MIC in general MPI is possible on MICs and host cpus host and cpu binaries are not compatible one for coprocessor, one for host cpu needed Host 0 MIC 0 MIC 1 (e.g., main and main.mic) compile code twice, use -mmic for the coprocessor binary be aware of load-balancing Host n MIC 0 MIC 1 hybrid codes using OpenMP and MPI might be very useful to avoid hundreds of processes on the relatively small coprocessor 33

Infiniband PCIe Symmetric Execution (2/2) Using MPI on MIC on the RWTH Aachen Compute Cluster At the moment we have a special module in our environment (intelmpi/4.1mic in BETA) ssh is wrapped automatically linuxp hic01 mic0 mic1 Environment variables are set I_MPI_MIC=enabled I_MPI_MIC_POSTFIX=.mic Execution expects a main and main.mic linuxp hic02 mic0 mic1 34 $ mpirun machinefile=hostfile n 512./main hostfile linuxphic01:16 linuxphic02:16 linuxphic01-mic0:120 linuxphic01-mic1:120 linuxphic02-mic0:120 linuxphic02-mic1:120

Debugging Latest and greatest TotalView works (load totalview/8.11.0-2) icc -g -offload-option,mic,compiler,"-g" -o main main.c 35

Optimizing for LEO Measuring timing using the offload report Set OFFLOAD_REPORT or use API Offload_report to activate OFFLOAD_REPORT Conditional offload for LEO #pragma offload if (cond.) Only offload if it is worth Description 1 Report about time taken. 2 Adds the amount of data transferred between the CPU and the coprocessor 3 Additional details on offload activity #pragma offload target (mic) in (b:length(size)) \ out (a:length(size)) if (size > 100) for (i=0; i<count; i++) { a[i] = b[i] * c + d; } 36

Vectorization (1/3) SIMD Vector Capabilities 128 bit SSE - Streaming SIMD Extension 256 bit 2 x DP 4 x SP SIMD = Single Instruction, Multiple Data AVX Advanced Vector Extension 512 bit 4 x DP 8 x SP MIC Many Integrated Core Architecture 8 x DP 16 x SP 37

Vectorization (2/3) SIMD Vector Basic Arithmetic 512 bit source 1 source 2 a7 a6 a5 a4 a3 a2 a1 a0 + b7 b6 b5 b4 b3 b2 b1 b0 8 x DP 8 x DP = destination a7+ b7 a6+ b6 a5+ b5 a4+ b4 a3+ b3 a2+ b2 a1+ b1 a0+ b0 8 x DP 38

Vectorization (3/3) SIMD Fused Multiply Add 512 bit source 1 source 2 source 3 a7 a6 a5 a4 a3 a2 a1 a0 * b7 b6 b5 b4 b3 b2 b1 b0 + c7 c6 c5 c4 c3 c2 c1 c0 8 x DP 8 x DP 8 x DP = destination a7* b7+ c7 a6* b6+ c6 a5* b5+ c5 a4* b4+ c4 a3* b3+ c3 a2* b2+ c2 a1* b1+ c1 a0* b0+ c0 8 x DP 39

Compiler Support for SIMD Vectorization Intel auto-vectorizer Combination of loop unrolling and SIMD instructions to get vectorized loops No guarantee given, compiler might need some hints Compiler feedback Use vec-report [n] to control the diagnostic information of the vectorizer n can be between 0-5 (recommended 3 or 5) concentrate on hotspots for optimization C/C++ aliasing: Use restricted keyword Intel specific pragma #pragma vector (Fortran:!DIR$ VECTOR) indicates to the compiler that the loop should be vectorized #pragma simd (Fortran:!DIR$ SIMD) enforces vectorization of the (innermost) loop SIMD support will be added in OpenMP 4.0 Refer to lab-exercises from Monday 40

Alignment Use efficient memory accesses The MIC architecture requires all data accesses to be properly aligned according to their size For better performance align data to 16 byte boundaries for SSE instructions 32 byte boundaries for AVX instructions 64 byte boundaries for MIC instructions Use #pragma offload in(a:length(count) align(64)) 41

SoA vs. AoS use Structure of Arrays (SoA) instead of Array of Structures (AoS) Color structure struct Color{ //AoS float r; float g; float b; } Color* c; R G B R G B R G B R G B struct Colors{ //SoA float* r; float* g; float* b; } R R R G G G B B B Detailed information: Intel Vectorization Guide http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/ 42

Performance Expectations WOW, 240 hardware threads on a single chip! My application will just rock! You really believe that? Remember the limitations! In-order cores limited hardware prefetching Running with 1GHz only Small Caches (2 levels) Poor single thread performance Small main memory PCIe as bottleneck + offload overhead 43

Case Study: CG solver CG solver Solves linear systems (A*x=b) dominated by sparse matrix vector mulitplication OpenMP first-touch optimal distribution (no static schedule) Testcase Runtime one Xeon Phi and 16 (!) Nehalem EX Fluorem/HV15R N=2,017,169, nnz=283,073,458 3.2 GB Memory footprint System #Threads Serial Time [s] Parallel Time [s] Speedup Xeon Phi (61 cores) 244 2387.40 32.24 74 BCS (128 cores) 128 1176.81 18.10 65 As expected: BCS is faster, but results on Xeon Phi are pretty good Experiences with other real-world applications are worse at the moment 44

RWTH MIC Environment MIC cluster is brand-new and in beta stage All configurations might change in future Filesystem HOME and WORK mounted in /{home,work}/<timid> Local filesystem in /michome/<timid> Software mounted in /sw on the device Interactive usage Linux is running on the device, but many features missing No modules available on the device Only one compiler version and one MPI version supported MPI Special module in BETA which wraps ssh and sets up specific environment Uses ssh keys in $HOME/.ssh/mic MPI will not be possible interactively in production stage (only in batch mode) 46

The End 47

Exercises Use linuxphic{01,03,04,05,06} (not linuxphic02!) Front Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 linuxphic01 linuxphic01 linuxphic03 linuxphic04 linuxphic05 linuxphic06 linuxphic06 linuxphic01 linuxphic03 linuxphic03 linuxphic04 linuxphic05 linuxphic06 48