Intel Xeon Phi Basic Tutorial Evan Bollig and Brent Swartz 1pm, 12/19/2013
Overview Intro to MSI Intro to the MIC Architecture Targeting the Xeon Phi Examples Automatic Offload Offload Mode Native Mode Distributed Jobs Symmetric MPI
A Quick Introduction to MSI
MSI at a Glance HPC Resources Koronis Itasca Calhoun Cascade GPUT Laboratories Software User Services Biomedical Modeling, Simulation and Design. Basic Sciences. Life Sciences. Scientific Development. Remote Visualization. Chemical and Physical Sciences Engineering Graphics and Visualization Life Sciences Development Tools Consulting Tutorials Code Porting Parallelization Visualization
HPC Resources MSI s Mission: provide researchers* access to and support for HPC resources to facilitate successful and cutting-edge research in all disciplines. * UMN and other MN institutions Koronis: SGI Altix 1140 Intel Nehalem Cores 2.96 TB of memory Itasca: Hewlett-Packard 3000BL 8728 Intel Nehalem Cores 26 TB of memory Calhoun: SGI Altix XE 1300 1440 Intel Xeon Clovertown Cores 2.8 TB of memory Cascade: 15 Dell Compute Nodes 32 Nvidia M2070s (4:1) 8 Nvidia Kepler K20s (2:1) 4 Intel Xeon Phi (1:1, 2:1) GPUT: 4 Exxact Corp GPU Blades 16 Nvidia GeForce GTX 480 (4:1)
Tutorials/Workshops Introductory Unix, Linux, remote computing, job submission, queue policy Programming & Scientific Computation Code parallelization, programming languages, math libraries Computational Physics Fluid dynamics, space physics, structural mechanics, material science Computational Chemistry Quantum chemistry, classical molecular modeling, drug design, cheminformatics Computational Biology Structural biology, computational genomics, proteomics, bioinformatics www.msi.umn.edu/tutorial
Introduction to the MIC Architecture
What s in a name? Fee-fi-fo-fum Knights Corner Many Integrated Core (MIC) Xeon Phi Intel 5110P (B1)
PHI architecture PHI hardware is described here: http://software.intel.com/en-us/articles/intelxeon-phi-coprocessor-codename-knightscorner
PHI Performance Briefly, PHI performance is described here: http://www.intel.com/content/www/us/en/ benchmarks/xeon-phi-product-familyperformance-brief.html
Phi vs GPU Why the Phi? ia64 Instructions Bandwidth: 320 GB/s IP Addressable Code portability Symmetric Mode MKL Auto Offload Why the GPU? Massive following and Literature SIMT Dynamic Parallelism OpenCL Drivers cublas, curand, cusparse, etc.
MSI PHI description An MSI PHI quickstart guide is described here: https://www.msi.umn.edu/content/intel-phiquickstart
Roofline Model Peak Possible GFLOP/sec (DP) 1024 256 64 16 4 Manage expectations of performance following with O.I. NVidia K20 and M2070 1170 GFLOP/sec 208 GByte/sec 144 GByte/sec 515 GFLOP/sec Peak Possible GFLOP/sec (DP) 1024 256 64 16 4 Intel Xeon Phi 5110P (B1) 1011 GFLOP/sec 320 GByte/sec 1 0.0625 0.25 1 4 16 Operational Intensity (FLOPs:Byte) 1 0.0625 0.25 1 4 16 Operational Intensity (FLOPs:Byte)
Targeting the Xeon Phi
MSI PHI demonstration At MSI, the only compiler which currently has OpenMP 4.0 support is the latest Intel/ cluster module, loaded using: % module load intel/cluster
MSI PHI demonstration Can obtain an interactive PHI node using: % qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi,pmem=200mb
MSI PHI demonstration Can obtain info about the Phi using: % /opt/intel/mic/bin/micinfo As shown from this micinfo output, each of the current 2 Phi nodes have 1 attached Phi coprocessor containing 60 cores, with a frequency of 1.053 GHz, for a peak of 1011 GFLOPS, and 7936 MB of memory.
PHI Execution Mode Phi Execution mode figure: http://download.intel.com/newsroom/kits/xeon/ phi/pdfs/intel-xeon-phi- Coprocessor_ProductBrief.pdf
MKL PHI usage Intel Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.): http://software.intel.com/sites/products/mkl/
MKL PHI usage Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors section in the User s Guide: http://software.intel.com/sites/products/ documentation/doclib/mkl_sa/11/ mkl_userguide_lnx/index.htm
MKL PHI code examples $MKLROOT/examples/mic_ao $MKLROOT/examples/mic_offload - dexp VML example (vdexp) - dgaussian double precision Gaussian RNG - fft complex-to-complex 1D FFT - sexp VML example (vsexp) - sgaussian single precision Gaussian RNG
MKL PHI code examples sgemm SGEMM example sgemm_f SGEMM example(fortran 90) sgemm_reuse SGEMM with data persistence sgeqrf QR factorization sgetrf LU factorization spotrf Cholesky
MKL PHI usage Intel Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.): http://software.intel.com/sites/products/mkl/ Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors section in the User s Guide: http://software.intel.com/sites/products/documentation/ doclib/mkl_sa/11/mkl_userguide_lnx/index.htm
PHI Optimization Tips Problem size considerations: Large problems have more parallelism. But not too large (8GB memory on a coprocessor). FFT prefers power-of-2 sizes.
PHI Optimization Tips Data alignment consideration: 64-byte alignment for better vectorization.
PHI Optimization Tips OpenMP thread count and thread affinity: Avoid thread migration for better data locality.
PHI Optimization Tips Large (2MB) pages for memory allocation: Reduce TLB misses and memory allocation overhead.
KMP_AFFINITY Pin threads to cores Compact Scatter Balanced Explicit None http://www.cac.cornell.edu/education/training/ StampedeJune2013/mic-130618.pdf, Slide 29
Native Mode (via MPIrun)
SSH to cascade Git Checkout module load cmake intel/cluster git clone /home/support/public/tutorials/ phi_cmake_example.git
Build cd phi_cmake_example mkdir build cd build cmake.. make
Run cd mic_mpi cp../../mic_mpi/job_simple.pbs. qsub job_simple.pbs
Interactive Mode qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi export I_MPI_MIC=enable export I_MPI_MIC_POSTFIX=.mic mpirun host ${HOSTNAME}-mic0 np 4 `readlink f quad.x`
An OpenCL Example (Research in progress)
What is an RBF?
RBF-FD? Classical FD: Vandermonde System Subsitute for each
RBF-FD?
RBF-FD Stencils
Sparse Mat-Vec Multiply (SpMV) D x L = @ @x Lu(x) x=xc nx c j u(x j ) j=1 c L u(x k ) = du(x c ) dx
Sparse Formats COO 1 5 0 0 1 0 2 0 7 2 5 7 Value 1 5 2 7 6 3 8 4 0 6 0 8 3 0 0 4 6 3 8 4 Row 0 0 1 1 2 2 3 3 Col 0 1 1 3 1 2 1 3 CSR ELL Value 1 5 2 7 6 3 8 4 Value 1 5 2 7 6 3 8 4 Row Ptr 0 2 4 6 Col 0 1 1 3 1 2 1 3 Col 0 1 1 3 1 2 1 3
ViennaCL Performance GPU to Phi Performance is NOT portable. 1) OpenCL driver is still BETA! 2) Loops vectorize differently
SpMM with MIC Intrinsics (Content from submitted paper; slides kept separate)
Additional Items
Optimal Mapping Work to Cores/Accelerators Still an outstanding issue wrt which programming model is optimal. Model for shared memory / accelerator programming options include OpenMP 3.1, OpenMP 4.0 (with accelerator, affinity, and SIMD directives), OpenACC, nvidia specific CUDA, or OpenCL. http://www.hpcwire.com/2013/12/03/compilersaccelerated-programming/
OpenACC OpenACC 2.0 was released this summer: http://www.openacc-standard.org/ Improvements include: procedure calls, nested parallelism, more dynamic data management support and more. OpenACC 2.0 additions described by PGI's Michael Wolfe at SC13: http://www.nvidia.com/object/sc13-technologytheater.html
OpenACC PGI will support OpenACC 2.0 starting in Jan 2014, with PGI 14.1. Current MSI module pgi/13.9 supports OpenACC 1.0 directives. GCC will support OpenACC soon: http://www.hpcwire.com/2013/11/14/openaccbroadens-appeal-gcc-compiler-support/ OpenACC 2.0 expected in 2014
OpenMP 4.0 MSI Intel module intel/cluster/2013 supports OpenMP 4.0, except for combined directives. http://software.intel.com/en-us/articles/ openmp-40-features-in-intel-fortrancomposer-xe-2013 For more information on OpenMP, see: http://openmp.org/wp/
Knight s Landing Information on the Intel PHI follow-on due out in 2014/2015, Knight's Landing: http://www.theregister.co.uk/2013/06/17/ intel_knights_landing_xeon_phi_fabric_interconne cts/ http://www.hpcwire.com/2013/11/23/intel-bringsknights-roundtable-sc13/ Expect much more memory per Knight's Landing socket, and significantly improved memory latency and bandwidth
MSI home page www.msi.umn.edu Software www.msi.umn.edu/sw Password reset Questions? www.msi.umn.edu/password Tutorials www.msi.umn.edu/tutorial FAQ www.msi.umn.edu/support/faq.html
Questions? MSI help desk is staffed Monday through Friday from 8:30AM to 7:00PM. Walk-in help available in room 569 Walter. Phone 612.626.0802 Email help@msi.umn.edu
Thank You The University of Minnesota is an equal opportunity educator and employer. This PowerPoint is available in alternative formats upon request. Direct requests to Minnesota Supercomputing Institute, 599 Walter library, 117 Pleasant St. SE, Minneapolis, Minnesota, 55455, 612-624-0528.