GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture EPCC The University of Edinburgh

Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems From desktop to supercomputer Accelerator Technology NVIDIA AMD Intel Example GPU machines

The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers got faster by increasing the frequency Voltage was decreased to keep power reasonable. Now, voltage cannot be decreased any further 1s and 0s in a system are represented by different voltages Reducing overall voltage further would reduce this difference to a point where 0s and 1s cannot be properly distinguished

Instead, performance increases can be achieved through exploiting parallelism Need a chip which can perform many parallel operations every clock cycle Many cores and/or many operations per core Want to keep power/core as low as possible Much of the power expended by CPU cores is on functionality not generally that useful for HPC Branch prediction, out-of-order execution etc

So, for HPC, we want chips with simple, low power, number-crunching cores But we need our machine to do other things as well as the number crunching Run an operating system, perform I/O, set up calculation etc Solution: Hybrid system containing both CPU and accelerator chips

It costs a huge amount of money to design and fabricate new chips Not feasible for relatively small HPC market Luckily, over the last few years, Graphics Processing Units (GPUs) have evolved for the highly lucrative gaming market And largely possess the right characteristics for HPC Many number-crunching cores GPU vendors have tailored existing GPU architectures to the HPC market

GPUs in HPC November 2011 Top500 list: GPUs (next generation will have GPUs) GPUs GPUs GPUs now firmly established in HPC industry

AMD 12-core CPU = compute unit (= core) Not much space on CPU is dedicated to compute

NVIDIA Fermi GPU = compute unit (= SM = 32 CUDA cores) GPU dedicates much more space to compute At expense of caches, controllers, sophistication etc

Memory For many applications, performance is very sensitive to memory bandwidth CPUs use DRAM GPUs use Graphics DRAM Graphics memory: much higher bandwidth than standard CPU memory

GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz) 12 (2.1GHz) Operations/cycle 1 4 DP Performance (peak) 515 GFlops 101 GFlops Memory Bandwidth (peak) 144 GB/s 27.5 GB/s For these particular products, GPU theoretical advantage is ~5x for both compute and main memory bandwidth Application performance very much depends on application Typically a small fraction of peak Depends how well application is suited to/tuned for architecture

GPUs cannot be used instead of CPUs They must be used together GPUs act as accelerators Responsible for the computationally expensive parts of the code DRAM GDRAM CPU GPU I/O PCIe I/O

Scaling to larger systems I/O PCIe I/O Interconnect CPU DRAM GPU + GDRAM Interconnect allows multiple nodes to be connected CPU I/O PCIe GPU + GDRAM I/O Can have multiple CPUs and GPUs within each workstation or shared memory node E.g. 2 CPUs +2 GPUs (above) CPUs share memory, but GPUs do not

GPU Accelerated Supercomputer GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node

To utilise a GPU, programs must Contain parts targeted at host CPU (most lines of source code) Contain parts targeted at GPU (key computational kernels) Manage data transfers between distinct CPU and GPU memory spaces Traditional language (e.g C/Fortran) does not provide these facilities To run on multiple GPUs in parallel Normally use one host CPU core (thread) per GPU Program manages communication between host CPUs in the same fashion as traditional parallel programs e.g. MPI and/or OpenMP (latter shared memory node only)

Latest Technology NVIDIA Tesla HPC specific GPUs have evolved from GeForce series Intel AMD Knights Corner many-core x86 chip is like hybrid between a GPU and many-core CPU FireStream HPC specific GPUs have evolved from (ATI) Radeon series

NVIDIA Tesla Series GPU Chip partitioned into Streaming Multiprocessors (SMs) Multiple cores per SM Sophisticated thread engine Shared L2 cache

NVIDIA GPU SM Less scheduling units than cores Threads are scheduled in groups of 32, called a warp Threads within a warp always execute the same instruction in lock-step (on different data elements) Configurable L1 Cache/ Shared Memory

NVIDIA Tesla Series Tesla 1060 Fermi 2050 Fermi 2070 Fermi 2090 Kepler K20 (not yet released) CUDA cores 240 448 448 512 1536 SP Performance 933 GFlops 1030 GFlops 1030 GFlops 1330 GFlops DP Performance 78 GFlops 515 GFlops 515 GFlops 665 GFlops?? Memory Bandwidth 102 GB/s 144 GB/s 144 GB/s 178 GB/s?? Memory 4 GB 3 GB 6 GB 6 GB?? Variety of packaging solutions??

NVIDIA Roadmap

Programming NVIDIA Tesla CUDA C: proprietary interface to the architecture. Extensions to the C language which allow interfacing to the hardware Now relatively mature, and supported by comprehensive documentation, user forums, etc Fortran support provided by third party PGI Cuda Fortran compiler OpenCL Cross platform API Similar in design to CUDA, but lower level and not so mature Directives based approach OpenMP style directives - abstract complexities away from programmer Several products with varying syntax etc. Still lack maturity OpenMP committee currently considering adopting accelerators as part of OpenMP standard With participation from all main vendors plus others such as EPCC

AMD FirePro AMD acquired ATI in 2006 AMD FirePro series: derivative of Radeon chips with HPC enhancements FirePro W9000 Released Aug 2012 Peak 1 TFLOP (double precision) Peak 4 TFLOPS (single precision) 6GB GDDR5 SDRAM Now with full ECC support Programming: OpenCL Packaging: just cards for workstations at the moment. Server solutions may follow.

Intel Xeon Phi Intel Larrabee: A Many-Core x86 Architecture for Visual Computing Release delayed such that the chip missed competitive window of opportunity. Larrabee was not released as a competitive product, but instead a platform for research and development (Knight s Ferry). Knights Corner derivative chip Many Integrated Cores (MIC) architecture. No longer aimed at graphics market Instead Accelerating Science and Discovery more than 50 cores Release expected in 2012 Now branded as Intel Xeon Phi Hybrid between GPU and many-core CPU

Intel Xeon Phi CPU-like x86 cores: same instruction set as standard CPUs Relatively small number cores/chip Fully cache coherent In principle could support an OS GPU-like Simple cores, lack sophistication e.g. no out-oforder execution Each core contains 512-bit vector processing unit (16 SP or 8 DP numbers) Supports multithreading Not expected to run OS, but used as accelerator (at least initially) Programming: Possible to use standard models such as OpenMP but still need additional directives to manage data

DIY GPU Workstation Just need to slot GPU card into PCI-e Need to make sure there is enough space and power in workstation

GPU Servers Several vendors offer GPU Servers Example Configuration: 4 GPUs plus 2 (multi-core) CPUs Multiple servers can be connected via interconnect

EPCC HECToR GPU Machine EPCC own a GPU System which is partly available to UK researchers As part of the HECToR UK National Service Comprises 3 GPU servers, each with 4 NVIDIA Fermi GPUs Connected via infiniband interconnect Plus another node with single AMD Firestream GPU Plus a front end with a single NVIDIA Fermi More details at http://www.hector.ac.uk/howcan/admin/apply/ HECToRGPU.php

Cray XK6 The Cray XK6 supercomputer is a trifecta of scalar, network and many-core innovation. It combines Cray s proven Gemini interconnect, AMD s leading multi-core scalar processors and NVIDIA s powerful manycore GPU processors to create a true, productive hybrid supercomputer. Each compute node contains 1 CPU + 1 GPU Can scale up to 500,000 nodes http://www.cray.com/products/xk6/kx6.aspx

Cray XK6 Compute Node XK6 Compute Node Characteristics AMD Series 6200 (Interlagos) NVIDIA Tesla X2090 PCIe Gen2 Host Memory 16 or 32GB 1600 MHz DDR3 NVIDIA Tesla X2090 Memory 6GB GDDR5 capacity Gemini High Speed Interconnect Upgradeable to future GPUs H T 3 H T 3 Z Y X

Cray XK6 Compute Blade Compute Blade: 4 Compute Nodes 4 CPUs (middle) + 4 GPUs (right) + 2 interconnect chips (left) (2 compute nodes share a single interconnect chip)

Summary GPUs have higher compute and memory bandwidth capabilities than CPUs GPUs are not used alone, but work in tandem with CPUs Leading to complications with task decomposition and memory management GPU accelerated systems scale from simple workstations to largescale supercomputers Traditional languages are not enough to program such systems Current/imminent accelerator products from NVIDIA,AMD and Intel NVIDIA are market leaders at the moment, with the proprietary CUDA the most common programming method AMD largely rely on the cross-platform OpenCL Intel will enter the market later this year with a hybrid CPU/GPU product