Intel Xeon Phi Basic Tutorial

Size: px

Start display at page:

Download "Intel Xeon Phi Basic Tutorial"

Oliver Berry
9 years ago
Views:

1 Intel Xeon Phi Basic Tutorial Evan Bollig and Brent Swartz 1pm, 12/19/2013

2 Overview Intro to MSI Intro to the MIC Architecture Targeting the Xeon Phi Examples Automatic Offload Offload Mode Native Mode Distributed Jobs Symmetric MPI

3 A Quick Introduction to MSI

4 MSI at a Glance HPC Resources Koronis Itasca Calhoun Cascade GPUT Laboratories Software User Services Biomedical Modeling, Simulation and Design. Basic Sciences. Life Sciences. Scientific Development. Remote Visualization. Chemical and Physical Sciences Engineering Graphics and Visualization Life Sciences Development Tools Consulting Tutorials Code Porting Parallelization Visualization

Scientific Development. Remote Visualization.

5 HPC Resources MSI s Mission: provide researchers* access to and support for HPC resources to facilitate successful and cutting-edge research in all disciplines. * UMN and other MN institutions Koronis: SGI Altix 1140 Intel Nehalem Cores 2.96 TB of memory Itasca: Hewlett-Packard 3000BL 8728 Intel Nehalem Cores 26 TB of memory Calhoun: SGI Altix XE Intel Xeon Clovertown Cores 2.8 TB of memory Cascade: 15 Dell Compute Nodes 32 Nvidia M2070s (4:1) 8 Nvidia Kepler K20s (2:1) 4 Intel Xeon Phi (1:1, 2:1) GPUT: 4 Exxact Corp GPU Blades 16 Nvidia GeForce GTX 480 (4:1)

96 TB of memory Itasca: Hewlett-Packard 3000BL 8728 Intel Nehalem Cores 26 TB of memory Calhoun: SGI Altix XE 1300 1440 Intel Xeon Clovertown

6 Tutorials/Workshops Introductory Unix, Linux, remote computing, job submission, queue policy Programming & Scientific Computation Code parallelization, programming languages, math libraries Computational Physics Fluid dynamics, space physics, structural mechanics, material science Computational Chemistry Quantum chemistry, classical molecular modeling, drug design, cheminformatics Computational Biology Structural biology, computational genomics, proteomics, bioinformatics

structural mechanics, material science Computational Chemistry Quantum chemistry, classical molecular modeling, drug design,

7 Introduction to the MIC Architecture

8 What s in a name? Fee-fi-fo-fum Knights Corner Many Integrated Core (MIC) Xeon Phi Intel 5110P (B1)

9 PHI architecture PHI hardware is described here:

10 PHI Performance Briefly, PHI performance is described here: benchmarks/xeon-phi-product-familyperformance-brief.html

11 Phi vs GPU Why the Phi? ia64 Instructions Bandwidth: 320 GB/s IP Addressable Code portability Symmetric Mode MKL Auto Offload Why the GPU? Massive following and Literature SIMT Dynamic Parallelism OpenCL Drivers cublas, curand, cusparse, etc.

12 MSI PHI description An MSI PHI quickstart guide is described here:

13 Roofline Model Peak Possible GFLOP/sec (DP) Manage expectations of performance following with O.I. NVidia K20 and M GFLOP/sec 208 GByte/sec 144 GByte/sec 515 GFLOP/sec Peak Possible GFLOP/sec (DP) Intel Xeon Phi 5110P (B1) 1011 GFLOP/sec 320 GByte/sec Operational Intensity (FLOPs:Byte) Operational Intensity (FLOPs:Byte)

NVidia K20 and M2070 1170 GFLOP/sec 208 GByte/sec 144 GByte/sec 515 GFLOP/sec Peak Possible GFLOP/sec

14 Targeting the Xeon Phi

15 MSI PHI demonstration At MSI, the only compiler which currently has OpenMP 4.0 support is the latest Intel/ cluster module, loaded using: % module load intel/cluster

16 MSI PHI demonstration Can obtain an interactive PHI node using: % qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi,pmem=200mb

17 MSI PHI demonstration Can obtain info about the Phi using: % /opt/intel/mic/bin/micinfo As shown from this micinfo output, each of the current 2 Phi nodes have 1 attached Phi coprocessor containing 60 cores, with a frequency of GHz, for a peak of 1011 GFLOPS, and 7936 MB of memory.

the current 2 Phi nodes have 1 attached Phi coprocessor containing 60

18 PHI Execution Mode Phi Execution mode figure: phi/pdfs/intel-xeon-phi- Coprocessor_ProductBrief.pdf

19 MKL PHI usage Intel Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.):

20 MKL PHI usage Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors section in the User s Guide: documentation/doclib/mkl_sa/11/ mkl_userguide_lnx/index.htm

21 MKL PHI code examples $MKLROOT/examples/mic_ao $MKLROOT/examples/mic_offload - dexp VML example (vdexp) - dgaussian double precision Gaussian RNG - fft complex-to-complex 1D FFT - sexp VML example (vsexp) - sgaussian single precision Gaussian RNG

22 MKL PHI code examples sgemm SGEMM example sgemm_f SGEMM example(fortran 90) sgemm_reuse SGEMM with data persistence sgeqrf QR factorization sgetrf LU factorization spotrf Cholesky

23 MKL PHI usage Intel Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.): Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors section in the User s Guide: doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

24 PHI Optimization Tips Problem size considerations: Large problems have more parallelism. But not too large (8GB memory on a coprocessor). FFT prefers power-of-2 sizes.

25 PHI Optimization Tips Data alignment consideration: 64-byte alignment for better vectorization.

26 PHI Optimization Tips OpenMP thread count and thread affinity: Avoid thread migration for better data locality.

27 PHI Optimization Tips Large (2MB) pages for memory allocation: Reduce TLB misses and memory allocation overhead.

28 KMP_AFFINITY Pin threads to cores Compact Scatter Balanced Explicit None StampedeJune2013/mic pdf, Slide 29

29 Native Mode (via MPIrun)

30 SSH to cascade Git Checkout module load cmake intel/cluster git clone /home/support/public/tutorials/ phi_cmake_example.git

31 Build cd phi_cmake_example mkdir build cd build cmake.. make

32 Run cd mic_mpi cp../../mic_mpi/job_simple.pbs. qsub job_simple.pbs

33 Interactive Mode qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi export I_MPI_MIC=enable export I_MPI_MIC_POSTFIX=.mic mpirun host ${HOSTNAME}-mic0 np 4 `readlink f quad.x`

34 An OpenCL Example (Research in progress)

35 What is an RBF?

36 RBF-FD? Classical FD: Vandermonde System Subsitute for each

37 RBF-FD?

38 RBF-FD Stencils

39 Sparse Mat-Vec Multiply (SpMV) D x Lu(x) x=xc nx c j u(x j ) j=1 c L u(x k ) = du(x c ) dx

40 Sparse Formats COO Value Row Col CSR ELL Value Value Row Ptr Col Col

41 ViennaCL Performance GPU to Phi Performance is NOT portable. 1) OpenCL driver is still BETA! 2) Loops vectorize differently

42 SpMM with MIC Intrinsics (Content from submitted paper; slides kept separate)

43 Additional Items

44 Optimal Mapping Work to Cores/Accelerators Still an outstanding issue wrt which programming model is optimal. Model for shared memory / accelerator programming options include OpenMP 3.1, OpenMP 4.0 (with accelerator, affinity, and SIMD directives), OpenACC, nvidia specific CUDA, or OpenCL.

45 OpenACC OpenACC 2.0 was released this summer: Improvements include: procedure calls, nested parallelism, more dynamic data management support and more. OpenACC 2.0 additions described by PGI's Michael Wolfe at SC13:

46 OpenACC PGI will support OpenACC 2.0 starting in Jan 2014, with PGI Current MSI module pgi/13.9 supports OpenACC 1.0 directives. GCC will support OpenACC soon: OpenACC 2.0 expected in 2014

47 OpenMP 4.0 MSI Intel module intel/cluster/2013 supports OpenMP 4.0, except for combined directives. openmp-40-features-in-intel-fortrancomposer-xe-2013 For more information on OpenMP, see:

48 Knight s Landing Information on the Intel PHI follow-on due out in 2014/2015, Knight's Landing: intel_knights_landing_xeon_phi_fabric_interconne cts/ Expect much more memory per Knight's Landing socket, and significantly improved memory latency and bandwidth

49 MSI home page Software Password reset Questions? Tutorials FAQ

50 Questions? MSI help desk is staffed Monday through Friday from 8:30AM to 7:00PM. Walk-in help available in room 569 Walter. Phone

51 Thank You The University of Minnesota is an equal opportunity educator and employer. This PowerPoint is available in alternative formats upon request. Direct requests to Minnesota Supercomputing Institute, 599 Walter library, 117 Pleasant St. SE, Minneapolis, Minnesota, 55455,

A quick tutorial on Intel's Xeon Phi Coprocessor

A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed