Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Size: px

Start display at page:

Download "Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]"

Tamsin Banks
10 years ago
Views:

1 Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP)

it Informa(on & Communica(on Technology Sec(on

2 Mul(ple Socket CPUs + Acceleretors Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 2

3 Accelerated co- Processors A set of simplified execu(on units that can perform few opera(ons (with respect to standard CPU) with very high efficiency. When combined with full featured CPU can accelerate the nominal speed of a system. CPU Single thread perf. ACC. Throughput CPU ACC. CPU & ACC Architectural integra(on Physical integra(on Main approaches to accelerators: Ø Task Parallelism (MIMD) à MIC Ø Data Parallelism (SIMD) à GPU Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 3

4 The General Concept of Accelerated Compu(ng Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 4

5 Host Memory ~ 30/40 GBytes CPU 1. Copy Data 4. Copy Result 2. Launch Kernel GPU Device Memory ~ 110/120 GByte 3. Execute GPU kernel Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 5

6 NVIDIA GPU Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 6

7 Why Does GPU Accelerate Compu(ng? Highly scalable design Higher aggregate memory bandwidth Huge number of low frequency cores Higher aggregate computa(onal power Massively parallel processors for data processing Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 7

low frequency cores Higher aggregate computa(onal power Massively parallel

9 SMX Processor & Warp Scheduler & Core

10 Why Does GPU Not Accelerate Compu(ng? PCI Bus bocleneck Synchroniza(on weakness Extremely slow serialized execu(on High complexity SPMD(T) + SIMD & Memory Model People forget about the Amdahl s law accelera(ng only the 50% of the original code, the expected speedup can get at most a value of 2!! Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 10

SPMD(T) + SIMD & Memory Model People forget about the Amdahl s law accelera(ng only the 50% of

11 What is CUDA? NVIDIA compute architecture Quickly maturing software development capability provided free of charge by NVIDIA C and C++ programming language extension that simplifies creation of efficient applications for CUDAenabled GPGPUs Available for Linux, Windows and Mac OS X Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 11

of charge by NVIDIA C and C++ programming language extension that simplifies creation of

13 INTEL MIC Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 13

14 TASK Parallelism (MIMD) Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 14

15 Xeon PHI Architecture Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 15

16 Core Architecture Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 16

17 The Increasing Parallelism Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 17

18 Execu(on Models: Offload Execu(on Host system offloads part or all of the computa(on from one or mul(ple processes or threads running on host The applica(on starts execu(on on the host As the computa(on proceeds it can decide to send data to the coprocessor and let that work on it and the host and the coprocessor may or may not work in parallel. OpenMP 4.0 TR being proposed and implemented in Intel Composer XE provides direc(ves to perform offload computa(ons. Composer XE also provides some custom direc(ves to perform offload opera(ons. Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 18

may or may not work in parallel. OpenMP 4.0 TR being proposed and implemented in Intel Composer XE provides direc(ves to perform offload computa(ons.

19 Execu(on Models: Na(ve Execu(on An Xeon Phi hosts a Linux micro OS in it and can appear as another machine connected to the host like another node in a cluster. This execu(on environment allows the users to view the coprocessor as another compute node. In order to run na(vely, an applica(on has to be cross compiled for Xeon Phi opera(ng environment. Intel Composer XE provides simple switch to generate cross compiled code. Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 19

In order to run na(vely, an applica(on has to be cross compiled for Xeon Phi opera(ng environment.

20 Execu(on Models: Symmetric Execu(on The applica(on processes run on both the host and the Phi coprocessor and communicate through some sort of message passing interface like MPI. This execu(on environment treats Xeon Phi card as another node in a cluster in a heterogeneous cluster environment. Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 20

This execu(on environment treats Xeon Phi card as another node in a cluster in a heterogeneous

21 Execu(on Models: Summary Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 21

22 Programming PHI Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 22

23 Heterogeneous Compiler Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 23

24 OpenCL Open Compute Language Open, royalty- free standard for cross- planorm, For heterogeneous parallel-computing systems Cross-platform. Implementations for ATI GPUs NVIDIA GPUs x86 CPUs Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 24

25 CPU & GPU ~ 8 GBytes The Intel Xeon E Sandy Bridge- EP 2.4GHz Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 25

26 CPU & GPU ~ 8 GBytes The Intel Xeon E Sandy Bridge- EP 2.4GHz Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 26

27 CPU & GPU ~ 8 GBytes The Intel Xeon E Sandy Bridge- EP 2.4GHz Ivan GiroCo [email protected] Overview on Modern Accelerators and Programming Paradigms 27

28 Higher aggregate computa(onal power Do we really... need it?... have it available? Can we really exploit it? Remember the key- factors for performance #opera(ons per clock cycle x frequency x #cores the DP power is dras(cally reduced if the compute capability is only par(ally exploited How much is my GPU becer than my CPU? Can data move from CPU2GPU and from GPU2CPU be reduced? For general purpose and scalable applica(ons, both CPU and GPU must usually be exploited 17/07/2014 Ivan GiroCo [email protected] Indian Ins(tute of Technology (IIT) Bombay, Mumbay (India) 28

29 Conclusions A low number of applica(ons and scien(fic codes are enabled for accelerators: some for GPU, few for Intel Xeon Phi For general DP intensive applica(ons the average speedup is of a factor between 2x and 3x using two accelerators on top of the CPU planorm Fast GPU compu(ng requires the technological background for exploi(ng the compute power available, manage the balance between CPU and GPU along with the effort for the system management 17/07/2014 Ivan GiroCo [email protected] Indian Ins(tute of Technology (IIT) Bombay, Mumbay (India) 29

30 25/05/ /06/2015 WORKSHOP ON ACCELERATED HIGH- PERFORMANCE COMPUTING IN COMPUTATIONAL SCIENCES (SMR 2760) Ivan GiroCo Overview on Modern Accelerators and Programming Paradigms 30

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems