Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Similar documents

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Case Study on Productivity and Performance of GPGPUs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Parallel Programming Survey

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center

Introduction to GPU Programming Languages

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Overview of HPC Resources at Vanderbilt

GPUs for Scientific Computing

Introduction to GPU hardware and to CUDA

A quick tutorial on Intel's Xeon Phi Coprocessor

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

10- High Performance Compu5ng

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

GPGPU accelerated Computational Fluid Dynamics

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Infrastructure Matters: POWER8 vs. Xeon x86

HP ProLiant SL270s Gen8 Server. Evaluation Report

Parallel Firewalls on General-Purpose Graphics Processing Units

Part I Courses Syllabus

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Resource Scheduling Best Practice in Hybrid Clusters

High Performance Computing in CST STUDIO SUITE

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Data Center and Cloud Computing Market Landscape and Challenges

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Introduction to Cloud Computing

Parallel Algorithm Engineering

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

Accelerating CFD using OpenFOAM with GPUs

Multicore Parallel Computing with OpenMP

Next Generation GPU Architecture Code-named Fermi

Turbomachinery CFD on many-core platforms experiences and strategies

GPU Accelerated Signal Processing in OpenStack. John Paul Walters. Computer Scien5st, USC Informa5on Sciences Ins5tute

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

GPU Parallel Computing Architecture and CUDA Programming Model

Big Data Visualization on the MIC

Binary search tree with SIMD bandwidth optimization using SSE

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

benchmarking Amazon EC2 for high-performance scientific computing

Chapter 2 Parallel Architecture, Software And Performance

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Introduction to HPC Workshop. Center for e-research

Embedded Systems: map to FPGA, GPU, CPU?

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Jean-Pierre Panziera Teratec 2011

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Auto-Tunning of Data Communication on Heterogeneous Systems

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Evaluation of CUDA Fortran for the CFD code Strukti

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems. Jason Cope

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

An Introduction to Parallel Computing/ Programming

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

GPU-based Decompression for Medical Imaging Applications

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

CERN openlab III. Major Review Platform CC. Sverre Jarp Alfio Lazzaro Julien Leduc Andrzej Nowak

Direct GPU/FPGA Communication Via PCI Express

CUDA programming on NVIDIA GPUs

ultra fast SOM using CUDA

Next Generation Operating Systems

FPGA-based Multithreading for In-Memory Hash Joins

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Optimizing Application Performance with CUDA Profiling Tools

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Retargeting PLAPACK to Clusters with Hardware Accelerators

Parallel Computing. Benson Muite. benson.

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

BENCHMARKING V ISUALIZATION TOOL

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Introduction. Xiangke Liao 1, Shaoliang Peng, Yutong Lu, Chengkun Wu, Yingbo Cui, Heng Wang, Jiajun Wen

Large-Data Software Defined Visualization on CPUs

Transcription:

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP)

Mul(ple Socket CPUs + Acceleretors Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 2

Accelerated co- Processors A set of simplified execu(on units that can perform few opera(ons (with respect to standard CPU) with very high efficiency. When combined with full featured CPU can accelerate the nominal speed of a system. CPU Single thread perf. ACC. Throughput CPU ACC. CPU & ACC Architectural integra(on Physical integra(on Main approaches to accelerators: Ø Task Parallelism (MIMD) à MIC Ø Data Parallelism (SIMD) à GPU Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 3

The General Concept of Accelerated Compu(ng Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 4

Host Memory ~ 30/40 GBytes CPU 1. Copy Data 4. Copy Result 2. Launch Kernel GPU Device Memory ~ 110/120 GByte 3. Execute GPU kernel Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 5

NVIDIA GPU Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 6

Why Does GPU Accelerate Compu(ng? Highly scalable design Higher aggregate memory bandwidth Huge number of low frequency cores Higher aggregate computa(onal power Massively parallel processors for data processing Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 7

SMX Processor & Warp Scheduler & Core

Why Does GPU Not Accelerate Compu(ng? PCI Bus bocleneck Synchroniza(on weakness Extremely slow serialized execu(on High complexity SPMD(T) + SIMD & Memory Model People forget about the Amdahl s law accelera(ng only the 50% of the original code, the expected speedup can get at most a value of 2!! Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 10

What is CUDA? NVIDIA compute architecture Quickly maturing software development capability provided free of charge by NVIDIA C and C++ programming language extension that simplifies creation of efficient applications for CUDAenabled GPGPUs Available for Linux, Windows and Mac OS X Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 11

INTEL MIC Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 13

TASK Parallelism (MIMD) Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 14

Xeon PHI Architecture Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 15

Core Architecture Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 16

The Increasing Parallelism Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 17

Execu(on Models: Offload Execu(on Host system offloads part or all of the computa(on from one or mul(ple processes or threads running on host The applica(on starts execu(on on the host As the computa(on proceeds it can decide to send data to the coprocessor and let that work on it and the host and the coprocessor may or may not work in parallel. OpenMP 4.0 TR being proposed and implemented in Intel Composer XE provides direc(ves to perform offload computa(ons. Composer XE also provides some custom direc(ves to perform offload opera(ons. Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 18

Execu(on Models: Na(ve Execu(on An Xeon Phi hosts a Linux micro OS in it and can appear as another machine connected to the host like another node in a cluster. This execu(on environment allows the users to view the coprocessor as another compute node. In order to run na(vely, an applica(on has to be cross compiled for Xeon Phi opera(ng environment. Intel Composer XE provides simple switch to generate cross compiled code. Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 19

Execu(on Models: Symmetric Execu(on The applica(on processes run on both the host and the Phi coprocessor and communicate through some sort of message passing interface like MPI. This execu(on environment treats Xeon Phi card as another node in a cluster in a heterogeneous cluster environment. Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 20

Execu(on Models: Summary Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 21

Programming PHI Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 22

Heterogeneous Compiler Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 23

OpenCL Open Compute Language Open, royalty- free standard for cross- planorm, For heterogeneous parallel-computing systems Cross-platform. Implementations for ATI GPUs NVIDIA GPUs x86 CPUs Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 24

CPU & GPU ~ 8 GBytes The Intel Xeon E5-2665 Sandy Bridge- EP 2.4GHz Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 25

CPU & GPU ~ 8 GBytes The Intel Xeon E5-2665 Sandy Bridge- EP 2.4GHz Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 26

CPU & GPU ~ 8 GBytes The Intel Xeon E5-2665 Sandy Bridge- EP 2.4GHz Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 27

Higher aggregate computa(onal power Do we really... need it?... have it available? Can we really exploit it? Remember the key- factors for performance #opera(ons per clock cycle x frequency x #cores the DP power is dras(cally reduced if the compute capability is only par(ally exploited How much is my GPU becer than my CPU? Can data move from CPU2GPU and from GPU2CPU be reduced? For general purpose and scalable applica(ons, both CPU and GPU must usually be exploited 17/07/2014 Ivan GiroCo igiroco@ictp.it Indian Ins(tute of Technology (IIT) Bombay, Mumbay (India) 28

Conclusions A low number of applica(ons and scien(fic codes are enabled for accelerators: some for GPU, few for Intel Xeon Phi For general DP intensive applica(ons the average speedup is of a factor between 2x and 3x using two accelerators on top of the CPU planorm Fast GPU compu(ng requires the technological background for exploi(ng the compute power available, manage the balance between CPU and GPU along with the effort for the system management 17/07/2014 Ivan GiroCo igiroco@ictp.it Indian Ins(tute of Technology (IIT) Bombay, Mumbay (India) 29

25/05/2015-05/06/2015 WORKSHOP ON ACCELERATED HIGH- PERFORMANCE COMPUTING IN COMPUTATIONAL SCIENCES (SMR 2760) Ivan GiroCo igiroco@ictp.it Overview on Modern Accelerators and Programming Paradigms 30