A quick tutorial on Intel's Xeon Phi Coprocessor



Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center

Parallel Programming Survey

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Big Data Visualization on the MIC

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Case Study on Productivity and Performance of GPGPUs

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Overview of HPC Resources at Vanderbilt

White Paper. Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE. Version 1.5

Retargeting PLAPACK to Clusters with Hardware Accelerators

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Pedraforca: ARM + GPU prototype

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Programming the Intel Xeon Phi Coprocessor

CFD Implementation with In-Socket FPGA Accelerators

Resource Scheduling Best Practice in Hybrid Clusters

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

A Case Study - Scaling Legacy Code on Next Generation Platforms

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Introduction to GPU hardware and to CUDA

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

ST810 Advanced Computing

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Intel Xeon Phi Basic Tutorial

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Intel Many Integrated Core Architecture: An Overview and Programming Models

Next Generation GPU Architecture Code-named Fermi

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

FLOW-3D Performance Benchmark and Profiling. September 2012

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

HPC with Multicore and GPUs

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

The Fastest, Most Efficient HPC Architecture Ever Built

CUDA programming on NVIDIA GPUs

Kalray MPPA Massively Parallel Processing Array

GPUs for Scientific Computing

Introduction to Hybrid Programming

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Using NeSI HPC Resources. NeSI Computational Science Team

The Asterope compute cluster

1 Bull, 2011 Bull Extreme Computing

HPC Software Requirements to Support an HPC Cluster Supercomputer

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Introduction to GPU Programming Languages

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Multi-Threading Performance on Commodity Multi-Core Processors

Turbomachinery CFD on many-core platforms experiences and strategies

Evaluation of CUDA Fortran for the CFD code Strukti

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

CERN openlab III. Major Review Platform CC. Sverre Jarp Alfio Lazzaro Julien Leduc Andrzej Nowak

Building a Top500-class Supercomputing Cluster at LNS-BUAP

High Performance Computing in CST STUDIO SUITE

Debugging with TotalView

GPU Usage. Requirements

Xeon Phi Application Development on Windows OS

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Deep Learning GPU-Based Hardware Platform

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Introduction to HPC Workshop. Center for e-research

Keys to node-level performance analysis and threading in HPC applications

HP ProLiant SL270s Gen8 Server. Evaluation Report

SLURM Workload Manager

Introduction to GPGPU. Tiziano Diamanti

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

SR-IOV In High Performance Computing

Performance Characteristics of Large SMP Machines

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Xeon+FPGA Platform for the Data Center

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

REMOTE VISUALIZATION ON SERVER-CLASS TESLA GPUS

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Intel Xeon +FPGA Platform for the Data Center

PCI Express IO Virtualization Overview

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Transcription:

A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming

The beginning of wisdom is the definition of terms. * Name Is a... As opposed to... Just like... Xeon Phi Product series Xeon Tesla, Quadro, GeForce MIC Microprocessor architecture Itanium, Nehalem, Sandy-Bridge, Atom Tesla, Fermi, Kepler Shippable product (SKU) 3110D, 3110P, or E5-2620 C1060, C2075, M2090 Chip code name Nehalem, Westmere, Sandy-Bridge, Ivy-Bridge Lincroft, Cedarview GF110, GK110 Set of software (drivers, kernel modules, etc.) N.A. CUDA toolkit Many Integrated Core Architecture 5110P Knights Corner MPSS Manycore Platform Software Stack Visual * Socrates (470-399 B.C.)

60-cores Architecture

Xeon Phi core Architecture and core definition 1 to 1.3 GHz Xeon Phi core 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals

Xeon Phi core Architecture and core definition 1 to 1.3 GHz 1 SPU 1 double op/cycle In-order architecture x86 + mic extensions 4 hardware threads Xeon Phi core A Xeon Phi core is much more complex than a CUDA core 1 VPU 32 float op/cycle 16 double op/cycle Supports fused mult-add Supports transcendentals 4 clock latency 4 hardware threads nvidia Kepler SMX 735 to 745 MHz 192 SP CUDA cores 2 double op/cycle Supports fused mult-add CUDA core 64 DPUnits 2 double op/cycle Supports fused mult-add 32 SFU units 1 double op/cycle Supports transcendentals

Architecture and core definition Nehalem core

Architecture and core definition Nehalem core But still far less complex than a Xeon core

Architecture and core definition Two specificities: (1) In-Order architecture with hardware multithreading --> need multithreaded/multiprocessed code (2) Huge vector processing unit --> need vectorized code

When working with them: Think multithreading Think Vectorization

Setup

Accelerator mode vs Cluster mode 10.1 0.10.40* 1. 3 2. 7 1 1 GbE 10.1 0.10.40 GbE** 54 2. 172.31.1.1 PCIe 0 br 10.10.10.41 PCIe /dev/ttymic0 * Host must route packets from Xeon Phi /dev/ttymic0 ** Or Infiniband with RDMA (OFED)

Accelerator mode vs Cluster mode 10.1 0.10.40*.1 1.3 2 17 GbE 54 2. 172.31.1.1 PCIe /dev/ttymic0 * Host must route packets from Xeon Phi 10.1 0.10.40 GbE** Our Xeon phi is installed in node mback40 of 0 clusterbrmanneback 10.10.10.41 in 'accelerator mode' PCIe /dev/ttymic0 ** Or Infiniband with RDMA (OFED)

Slurm integration

Slurm integration As a so-called 'generic resource' Within a job allocation, users have ssh access to the Xeon Phi Mback40's scratch space is available from the Xeon Phi

Slurm integration You need to have a pair of corresponding SSH keys As a so-called 'generic resource' id_rsa / id_rsa.pub in your.ssh directory for this to work. The public key is copied to the Xeon Phi upon job startup Within a job allocation, users have ssh access to the Xeon Phi Mback40's scratch space is available from the Xeon Phi

Programming

Intel: First optimize on Xeon then port to Xeon Phi

Execution models OpenCL Offload OpenMP Offload MPI MKL Offload mode Intel MPI Native OpenMP Native Intel MPI

Execution models CUDA OpenCL CuBLAS Intel MPI Native OpenMP Native Intel MPI

4 Programming models

(symmetric) Execution models Offload Hybrid Native Programming models OpenMP MPI MKL OpenCL Easy Bit more complex Truly complex Impossible

Native OpenMP

Native OpenMP Simple OpenMP Hello world program

Native OpenMP Classical compilation for Xeon Compilation for Xeon Phi Code transfer through micnativeloadex Code transfer through SSH Compile on the host, run on the Xeon Phi

Offload OpenMP

Offload OpenMP Same program with offload pragmas

Offload OpenMP Classical compilation Offloaded sections run on the Xeon Phi Same code runs on Xeon flawlessly when no Xeon Phi is available Compile on the host, launch on the host, offload to Xeon Phi

Hybrid OpenMP

Hybrid OpenMP This section will run on the host...

Hybrid OpenMP... in parallel with that section which will run on the Xeon Phi

Hybrid OpenMP You get some threads on the host and some others on the Phi Compile on the host, run some on the host, offload some to Phi

Native intel MPI

Native intel MPI Simple MPI hello world program

Native intel MPI Compile on the host, run on Xeon Phi

Hybrid intel MPI

Hybrid intel MPI Same MPI hello world program

Hybrid intel MPI Compile once for the host and once for the XeonPhi Add the Xeon Phi to the machine file You get 2 processes on the host and 2 other on the Phi Compile on the host, run some on the host, offload some to Phi

Offload intel MPI Hybrid OpenMP/MPI hello world program with offload sections

Offload intel MPI You get 2 MPI processes on the host, each offloading 4 OMP threads to the Xeon Phi

Native MKL

Native MKL Simple SGEMM usage (remaining of the code not shown... handles parameter parsing, matrix creation, initialization, etc.)

Native MKL Compile on the host, run on Xeon Phi

Automatic offload MKL

Automatic offload MKL Same Simple SGEMM usage (no change)

Automatic offload MKL Allow MKL to use the Xeon Phi and be verbose about offloading Half the work done by the host, the other half by the Phi Compile on the host, run some on the host, offload some to Phi

Automatic offload MKL When data are too small, the Xeon Phi is not used (transfers would cost proportionally too much)

When working with them: Porting should be easy Hybrid is doable