Intel Xeon Phi Basic Tutorial

Similar documents

A quick tutorial on Intel's Xeon Phi Coprocessor

Introduction to MSI* for PubH 8403

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Introduction to High-Performance Computing and the Supercomputing Institute

ST810 Advanced Computing

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Big Data Visualization on the MIC

Part I Courses Syllabus

Overview of HPC Resources at Vanderbilt

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Case Study on Productivity and Performance of GPGPUs

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Overview of HPC systems and software available within

A Crash course to (The) Bighouse

GPUs for Scientific Computing

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Evaluation of CUDA Fortran for the CFD code Strukti

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

HPC with Multicore and GPUs

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Accelerating CFD using OpenFOAM with GPUs

Performance Characteristics of Large SMP Machines

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

OpenACC 2.0 and the PGI Accelerator Compilers

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Parallel Programming Survey

Introduction to GPU Programming Languages

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Using NeSI HPC Resources. NeSI Computational Science Team

HP ProLiant SL270s Gen8 Server. Evaluation Report

The CNMS Computer Cluster

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

CUDA programming on NVIDIA GPUs

Introduction to HPC Workshop. Center for e-research

Jean-Pierre Panziera Teratec 2011

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Remote & Collaborative Visualization. Texas Advanced Compu1ng Center

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Introduction to GPU hardware and to CUDA

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Turbomachinery CFD on many-core platforms experiences and strategies

Retargeting PLAPACK to Clusters with Hardware Accelerators

Programming the Intel Xeon Phi Coprocessor

YALES2 porting on the Xeon- Phi Early results

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

MAQAO Performance Analysis and Optimization Tool

A Case Study - Scaling Legacy Code on Next Generation Platforms

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Multi-Threading Performance on Commodity Multi-Core Processors

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

INTEL Software Development Conference - LONDON High Performance Computing - BIG DATA ANALYTICS - FINANCE

PLGrid Infrastructure Solutions For Computational Chemistry

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Multicore Parallel Computing with OpenMP

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

OpenMP and Performance

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

OpenMP Programming on ScaleMP

HPC Wales Skills Academy Course Catalogue 2015

GPU Parallel Computing Architecture and CUDA Programming Model

Keys to node-level performance analysis and threading in HPC applications

Introduction. Xiangke Liao 1, Shaoliang Peng, Yutong Lu, Chengkun Wu, Yingbo Cui, Heng Wang, Jiajun Wen

Using the Windows Cluster

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Getting Started with HPC

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Introduction to GPGPU. Tiziano Diamanti

GPU Acceleration of the SENSEI CFD Code Suite

Cloud Computing through Virtualization and HPC technologies

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Achieving Performance Isolation with Lightweight Co-Kernels

1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

PRIMERGY server-based High Performance Computing solutions

Scalability and Classifications

1 Bull, 2011 Bull Extreme Computing

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Transcription:

Intel Xeon Phi Basic Tutorial Evan Bollig and Brent Swartz 1pm, 12/19/2013

Overview Intro to MSI Intro to the MIC Architecture Targeting the Xeon Phi Examples Automatic Offload Offload Mode Native Mode Distributed Jobs Symmetric MPI

A Quick Introduction to MSI

MSI at a Glance HPC Resources Koronis Itasca Calhoun Cascade GPUT Laboratories Software User Services Biomedical Modeling, Simulation and Design. Basic Sciences. Life Sciences. Scientific Development. Remote Visualization. Chemical and Physical Sciences Engineering Graphics and Visualization Life Sciences Development Tools Consulting Tutorials Code Porting Parallelization Visualization

HPC Resources MSI s Mission: provide researchers* access to and support for HPC resources to facilitate successful and cutting-edge research in all disciplines. * UMN and other MN institutions Koronis: SGI Altix 1140 Intel Nehalem Cores 2.96 TB of memory Itasca: Hewlett-Packard 3000BL 8728 Intel Nehalem Cores 26 TB of memory Calhoun: SGI Altix XE 1300 1440 Intel Xeon Clovertown Cores 2.8 TB of memory Cascade: 15 Dell Compute Nodes 32 Nvidia M2070s (4:1) 8 Nvidia Kepler K20s (2:1) 4 Intel Xeon Phi (1:1, 2:1) GPUT: 4 Exxact Corp GPU Blades 16 Nvidia GeForce GTX 480 (4:1)

Tutorials/Workshops Introductory Unix, Linux, remote computing, job submission, queue policy Programming & Scientific Computation Code parallelization, programming languages, math libraries Computational Physics Fluid dynamics, space physics, structural mechanics, material science Computational Chemistry Quantum chemistry, classical molecular modeling, drug design, cheminformatics Computational Biology Structural biology, computational genomics, proteomics, bioinformatics www.msi.umn.edu/tutorial

Introduction to the MIC Architecture

What s in a name? Fee-fi-fo-fum Knights Corner Many Integrated Core (MIC) Xeon Phi Intel 5110P (B1)

PHI architecture PHI hardware is described here: http://software.intel.com/en-us/articles/intelxeon-phi-coprocessor-codename-knightscorner

PHI Performance Briefly, PHI performance is described here: http://www.intel.com/content/www/us/en/ benchmarks/xeon-phi-product-familyperformance-brief.html

Phi vs GPU Why the Phi? ia64 Instructions Bandwidth: 320 GB/s IP Addressable Code portability Symmetric Mode MKL Auto Offload Why the GPU? Massive following and Literature SIMT Dynamic Parallelism OpenCL Drivers cublas, curand, cusparse, etc.

MSI PHI description An MSI PHI quickstart guide is described here: https://www.msi.umn.edu/content/intel-phiquickstart

Roofline Model Peak Possible GFLOP/sec (DP) 1024 256 64 16 4 Manage expectations of performance following with O.I. NVidia K20 and M2070 1170 GFLOP/sec 208 GByte/sec 144 GByte/sec 515 GFLOP/sec Peak Possible GFLOP/sec (DP) 1024 256 64 16 4 Intel Xeon Phi 5110P (B1) 1011 GFLOP/sec 320 GByte/sec 1 0.0625 0.25 1 4 16 Operational Intensity (FLOPs:Byte) 1 0.0625 0.25 1 4 16 Operational Intensity (FLOPs:Byte)

Targeting the Xeon Phi

MSI PHI demonstration At MSI, the only compiler which currently has OpenMP 4.0 support is the latest Intel/ cluster module, loaded using: % module load intel/cluster

MSI PHI demonstration Can obtain an interactive PHI node using: % qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi,pmem=200mb

MSI PHI demonstration Can obtain info about the Phi using: % /opt/intel/mic/bin/micinfo As shown from this micinfo output, each of the current 2 Phi nodes have 1 attached Phi coprocessor containing 60 cores, with a frequency of 1.053 GHz, for a peak of 1011 GFLOPS, and 7936 MB of memory.

PHI Execution Mode Phi Execution mode figure: http://download.intel.com/newsroom/kits/xeon/ phi/pdfs/intel-xeon-phi- Coprocessor_ProductBrief.pdf

MKL PHI usage Intel Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.): http://software.intel.com/sites/products/mkl/

MKL PHI usage Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors section in the User s Guide: http://software.intel.com/sites/products/ documentation/doclib/mkl_sa/11/ mkl_userguide_lnx/index.htm

MKL PHI code examples $MKLROOT/examples/mic_ao $MKLROOT/examples/mic_offload - dexp VML example (vdexp) - dgaussian double precision Gaussian RNG - fft complex-to-complex 1D FFT - sexp VML example (vsexp) - sgaussian single precision Gaussian RNG

MKL PHI code examples sgemm SGEMM example sgemm_f SGEMM example(fortran 90) sgemm_reuse SGEMM with data persistence sgeqrf QR factorization sgetrf LU factorization spotrf Cholesky

MKL PHI usage Intel Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.): http://software.intel.com/sites/products/mkl/ Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors section in the User s Guide: http://software.intel.com/sites/products/documentation/ doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

PHI Optimization Tips Problem size considerations: Large problems have more parallelism. But not too large (8GB memory on a coprocessor). FFT prefers power-of-2 sizes.

PHI Optimization Tips Data alignment consideration: 64-byte alignment for better vectorization.

PHI Optimization Tips OpenMP thread count and thread affinity: Avoid thread migration for better data locality.

PHI Optimization Tips Large (2MB) pages for memory allocation: Reduce TLB misses and memory allocation overhead.

KMP_AFFINITY Pin threads to cores Compact Scatter Balanced Explicit None http://www.cac.cornell.edu/education/training/ StampedeJune2013/mic-130618.pdf, Slide 29

Native Mode (via MPIrun)

SSH to cascade Git Checkout module load cmake intel/cluster git clone /home/support/public/tutorials/ phi_cmake_example.git

Build cd phi_cmake_example mkdir build cd build cmake.. make

Run cd mic_mpi cp../../mic_mpi/job_simple.pbs. qsub job_simple.pbs

Interactive Mode qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi export I_MPI_MIC=enable export I_MPI_MIC_POSTFIX=.mic mpirun host ${HOSTNAME}-mic0 np 4 `readlink f quad.x`

An OpenCL Example (Research in progress)

What is an RBF?

RBF-FD? Classical FD: Vandermonde System Subsitute for each

RBF-FD?

RBF-FD Stencils

Sparse Mat-Vec Multiply (SpMV) D x L = @ @x Lu(x) x=xc nx c j u(x j ) j=1 c L u(x k ) = du(x c ) dx

Sparse Formats COO 1 5 0 0 1 0 2 0 7 2 5 7 Value 1 5 2 7 6 3 8 4 0 6 0 8 3 0 0 4 6 3 8 4 Row 0 0 1 1 2 2 3 3 Col 0 1 1 3 1 2 1 3 CSR ELL Value 1 5 2 7 6 3 8 4 Value 1 5 2 7 6 3 8 4 Row Ptr 0 2 4 6 Col 0 1 1 3 1 2 1 3 Col 0 1 1 3 1 2 1 3

ViennaCL Performance GPU to Phi Performance is NOT portable. 1) OpenCL driver is still BETA! 2) Loops vectorize differently

SpMM with MIC Intrinsics (Content from submitted paper; slides kept separate)

Additional Items

Optimal Mapping Work to Cores/Accelerators Still an outstanding issue wrt which programming model is optimal. Model for shared memory / accelerator programming options include OpenMP 3.1, OpenMP 4.0 (with accelerator, affinity, and SIMD directives), OpenACC, nvidia specific CUDA, or OpenCL. http://www.hpcwire.com/2013/12/03/compilersaccelerated-programming/

OpenACC OpenACC 2.0 was released this summer: http://www.openacc-standard.org/ Improvements include: procedure calls, nested parallelism, more dynamic data management support and more. OpenACC 2.0 additions described by PGI's Michael Wolfe at SC13: http://www.nvidia.com/object/sc13-technologytheater.html

OpenACC PGI will support OpenACC 2.0 starting in Jan 2014, with PGI 14.1. Current MSI module pgi/13.9 supports OpenACC 1.0 directives. GCC will support OpenACC soon: http://www.hpcwire.com/2013/11/14/openaccbroadens-appeal-gcc-compiler-support/ OpenACC 2.0 expected in 2014

OpenMP 4.0 MSI Intel module intel/cluster/2013 supports OpenMP 4.0, except for combined directives. http://software.intel.com/en-us/articles/ openmp-40-features-in-intel-fortrancomposer-xe-2013 For more information on OpenMP, see: http://openmp.org/wp/

Knight s Landing Information on the Intel PHI follow-on due out in 2014/2015, Knight's Landing: http://www.theregister.co.uk/2013/06/17/ intel_knights_landing_xeon_phi_fabric_interconne cts/ http://www.hpcwire.com/2013/11/23/intel-bringsknights-roundtable-sc13/ Expect much more memory per Knight's Landing socket, and significantly improved memory latency and bandwidth

MSI home page www.msi.umn.edu Software www.msi.umn.edu/sw Password reset Questions? www.msi.umn.edu/password Tutorials www.msi.umn.edu/tutorial FAQ www.msi.umn.edu/support/faq.html

Questions? MSI help desk is staffed Monday through Friday from 8:30AM to 7:00PM. Walk-in help available in room 569 Walter. Phone 612.626.0802 Email help@msi.umn.edu

Thank You The University of Minnesota is an equal opportunity educator and employer. This PowerPoint is available in alternative formats upon request. Direct requests to Minnesota Supercomputing Institute, 599 Walter library, 117 Pleasant St. SE, Minneapolis, Minnesota, 55455, 612-624-0528.