A quick tutorial on Intel's Xeon Phi Coprocessor

Size: px

Start display at page:

Download "A quick tutorial on Intel's Xeon Phi Coprocessor"

Mildred Gardner
7 years ago
Views:

1 A quick tutorial on Intel's Xeon Phi Coprocessor Architecture Setup Programming

2 The beginning of wisdom is the definition of terms. * Name Is a... As opposed to... Just like... Xeon Phi Product series Xeon Tesla, Quadro, GeForce MIC Microprocessor architecture Itanium, Nehalem, Sandy-Bridge, Atom Tesla, Fermi, Kepler Shippable product (SKU) 3110D, 3110P, or E C1060, C2075, M2090 Chip code name Nehalem, Westmere, Sandy-Bridge, Ivy-Bridge Lincroft, Cedarview GF110, GK110 Set of software (drivers, kernel modules, etc.) N.A. CUDA toolkit Many Integrated Core Architecture 5110P Knights Corner MPSS Manycore Platform Software Stack Visual * Socrates ( B.C.)

Shippable product (SKU) 3110D, 3110P, or E5-2620 C1060, C2075, M2090 Chip code name Nehalem, Westmere, Sandy-Bridge, Ivy-Bridge Lincroft, Cedarview

3 60-cores Architecture

4 Xeon Phi core Architecture and core definition 1 to 1.3 GHz Xeon Phi core 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals

op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia

5 Xeon Phi core Architecture and core definition 1 to 1.3 GHz 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads Xeon Phi core A Xeon Phi core is much more complex than a CUDA core 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals

complex than a CUDA core 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4

6 Architecture and core definition Nehalem core

7 Architecture and core definition Nehalem core But still far less complex than a Xeon core

8 Architecture and core definition Two specificities: (1) In-Order architecture with hardware multithreading --> need multithreaded/multiprocessed code (2) Huge vector processing unit --> need vectorized code

9 When working with them: Think multithreading Think Vectorization

10 Setup

11 Accelerator mode vs Cluster mode * GbE GbE** PCIe 0 br PCIe /dev/ttymic0 * Host must route packets from Xeon Phi /dev/ttymic0 ** Or Infiniband with RDMA (OFED)

10.10.41 PCIe /dev/ttymic0 * Host must route packets

12 Accelerator mode vs Cluster mode * GbE PCIe /dev/ttymic0 * Host must route packets from Xeon Phi GbE** Our Xeon phi is installed in node mback40 of 0 clusterbrmanneback in 'accelerator mode' PCIe /dev/ttymic0 ** Or Infiniband with RDMA (OFED)

13 Slurm integration

14 Slurm integration As a so-called 'generic resource' Within a job allocation, users have ssh access to the Xeon Phi Mback40's scratch space is available from the Xeon Phi

15 Slurm integration You need to have a pair of corresponding SSH keys As a so-called 'generic resource' id_rsa / id_rsa.pub in your.ssh directory for this to work. The public key is copied to the Xeon Phi upon job startup Within a job allocation, users have ssh access to the Xeon Phi Mback40's scratch space is available from the Xeon Phi

The public key is copied to the Xeon Phi upon job startup Within a job allocation,

16 Programming

17 Intel: First optimize on Xeon then port to Xeon Phi

18 Execution models OpenCL Offload OpenMP Offload MPI MKL Offload mode Intel MPI Native OpenMP Native Intel MPI

19 Execution models CUDA OpenCL CuBLAS Intel MPI Native OpenMP Native Intel MPI

20 4 Programming models

21 (symmetric) Execution models Offload Hybrid Native Programming models OpenMP MPI MKL OpenCL Easy Bit more complex Truly complex Impossible

22 Native OpenMP

23 Native OpenMP Simple OpenMP Hello world program

24 Native OpenMP Classical compilation for Xeon Compilation for Xeon Phi Code transfer through micnativeloadex Code transfer through SSH Compile on the host, run on the Xeon Phi

25 Offload OpenMP

26 Offload OpenMP Same program with offload pragmas

27 Offload OpenMP Classical compilation Offloaded sections run on the Xeon Phi Same code runs on Xeon flawlessly when no Xeon Phi is available Compile on the host, launch on the host, offload to Xeon Phi

28 Hybrid OpenMP

29 Hybrid OpenMP This section will run on the host...

30 Hybrid OpenMP... in parallel with that section which will run on the Xeon Phi

31 Hybrid OpenMP You get some threads on the host and some others on the Phi Compile on the host, run some on the host, offload some to Phi

32 Native intel MPI

33 Native intel MPI Simple MPI hello world program

34 Native intel MPI Compile on the host, run on Xeon Phi

35 Hybrid intel MPI

36 Hybrid intel MPI Same MPI hello world program

37 Hybrid intel MPI Compile once for the host and once for the XeonPhi Add the Xeon Phi to the machine file You get 2 processes on the host and 2 other on the Phi Compile on the host, run some on the host, offload some to Phi

38 Offload intel MPI Hybrid OpenMP/MPI hello world program with offload sections

39 Offload intel MPI You get 2 MPI processes on the host, each offloading 4 OMP threads to the Xeon Phi

40 Native MKL

41 Native MKL Simple SGEMM usage (remaining of the code not shown... handles parameter parsing, matrix creation, initialization, etc.)

42 Native MKL Compile on the host, run on Xeon Phi

43 Automatic offload MKL

44 Automatic offload MKL Same Simple SGEMM usage (no change)

45 Automatic offload MKL Allow MKL to use the Xeon Phi and be verbose about offloading Half the work done by the host, the other half by the Phi Compile on the host, run some on the host, offload some to Phi

46 Automatic offload MKL When data are too small, the Xeon Phi is not used (transfers would cost proportionally too much)

47 When working with them: Porting should be easy Hybrid is doable

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems