A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming
The beginning of wisdom is the definition of terms. * Name Is a... As opposed to... Just like... Xeon Phi Product series Xeon Tesla, Quadro, GeForce MIC Microprocessor architecture Itanium, Nehalem, Sandy-Bridge, Atom Tesla, Fermi, Kepler Shippable product (SKU) 3110D, 3110P, or E5-2620 C1060, C2075, M2090 Chip code name Nehalem, Westmere, Sandy-Bridge, Ivy-Bridge Lincroft, Cedarview GF110, GK110 Set of software (drivers, kernel modules, etc.) N.A. CUDA toolkit Many Integrated Core Architecture 5110P Knights Corner MPSS Manycore Platform Software Stack Visual * Socrates (470-399 B.C.)
60-cores Architecture
Xeon Phi core Architecture and core definition 1 to 1.3 GHz Xeon Phi core 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals
Xeon Phi core Architecture and core definition 1 to 1.3 GHz 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads Xeon Phi core A Xeon Phi core is much more complex than a CUDA core 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals
Architecture and core definition Nehalem core
Architecture and core definition Nehalem core But still far less complex than a Xeon core
Architecture and core definition Two specificities: (1) In-Order architecture with hardware multithreading --> need multithreaded/multiprocessed code (2) Huge vector processing unit --> need vectorized code
When working with them: Think multithreading Think Vectorization
Setup
Accelerator mode vs Cluster mode 10.1 0.10.40* 1. 3 2. 7 1 1 GbE 10.1 0.10.40 GbE** 54 2. 172.31.1.1 PCIe 0 br 10.10.10.41 PCIe /dev/ttymic0 * Host must route packets from Xeon Phi /dev/ttymic0 ** Or Infiniband with RDMA (OFED)
Accelerator mode vs Cluster mode 10.1 0.10.40*.1 1.3 2 17 GbE 54 2. 172.31.1.1 PCIe /dev/ttymic0 * Host must route packets from Xeon Phi 10.1 0.10.40 GbE** Our Xeon phi is installed in node mback40 of 0 clusterbrmanneback 10.10.10.41 in 'accelerator mode' PCIe /dev/ttymic0 ** Or Infiniband with RDMA (OFED)
Slurm integration
Slurm integration As a so-called 'generic resource' Within a job allocation, users have ssh access to the Xeon Phi Mback40's scratch space is available from the Xeon Phi
Slurm integration You need to have a pair of corresponding SSH keys As a so-called 'generic resource' id_rsa / id_rsa.pub in your.ssh directory for this to work. The public key is copied to the Xeon Phi upon job startup Within a job allocation, users have ssh access to the Xeon Phi Mback40's scratch space is available from the Xeon Phi
Programming
Intel: First optimize on Xeon then port to Xeon Phi
Execution models OpenCL Offload OpenMP Offload MPI MKL Offload mode Intel MPI Native OpenMP Native Intel MPI
Execution models CUDA OpenCL CuBLAS Intel MPI Native OpenMP Native Intel MPI
4 Programming models
(symmetric) Execution models Offload Hybrid Native Programming models OpenMP MPI MKL OpenCL Easy Bit more complex Truly complex Impossible
Native OpenMP
Native OpenMP Simple OpenMP Hello world program
Native OpenMP Classical compilation for Xeon Compilation for Xeon Phi Code transfer through micnativeloadex Code transfer through SSH Compile on the host, run on the Xeon Phi
Offload OpenMP
Offload OpenMP Same program with offload pragmas
Offload OpenMP Classical compilation Offloaded sections run on the Xeon Phi Same code runs on Xeon flawlessly when no Xeon Phi is available Compile on the host, launch on the host, offload to Xeon Phi
Hybrid OpenMP
Hybrid OpenMP This section will run on the host...
Hybrid OpenMP... in parallel with that section which will run on the Xeon Phi
Hybrid OpenMP You get some threads on the host and some others on the Phi Compile on the host, run some on the host, offload some to Phi
Native intel MPI
Native intel MPI Simple MPI hello world program
Native intel MPI Compile on the host, run on Xeon Phi
Hybrid intel MPI
Hybrid intel MPI Same MPI hello world program
Hybrid intel MPI Compile once for the host and once for the XeonPhi Add the Xeon Phi to the machine file You get 2 processes on the host and 2 other on the Phi Compile on the host, run some on the host, offload some to Phi
Offload intel MPI Hybrid OpenMP/MPI hello world program with offload sections
Offload intel MPI You get 2 MPI processes on the host, each offloading 4 OMP threads to the Xeon Phi
Native MKL
Native MKL Simple SGEMM usage (remaining of the code not shown... handles parameter parsing, matrix creation, initialization, etc.)
Native MKL Compile on the host, run on Xeon Phi
Automatic offload MKL
Automatic offload MKL Same Simple SGEMM usage (no change)
Automatic offload MKL Allow MKL to use the Xeon Phi and be verbose about offloading Half the work done by the host, the other half by the Phi Compile on the host, run some on the host, offload some to Phi
Automatic offload MKL When data are too small, the Xeon Phi is not used (transfers would cost proportionally too much)
When working with them: Porting should be easy Hybrid is doable