GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Similar documents
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Parallel Programming Survey

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

GPUs for Scientific Computing

A quick tutorial on Intel's Xeon Phi Coprocessor

Case Study on Productivity and Performance of GPGPUs

Next Generation GPU Architecture Code-named Fermi

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Trends in High-Performance Computing for Power Grid Applications

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

ST810 Advanced Computing

Accelerating CFD using OpenFOAM with GPUs

Introduction to GPU hardware and to CUDA

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Introduction to GPGPU. Tiziano Diamanti

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Introduction to GPU Programming Languages

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Turbomachinery CFD on many-core platforms experiences and strategies

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

GPU Parallel Computing Architecture and CUDA Programming Model

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

High Performance Computing in CST STUDIO SUITE

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

HPC Wales Skills Academy Course Catalogue 2015

Computer Graphics Hardware An Overview

CUDA programming on NVIDIA GPUs

Overview of HPC Resources at Vanderbilt

Parallel Algorithm Engineering

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Introduction to GPU Computing

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Kalray MPPA Massively Parallel Processing Array

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Enabling Technologies for Distributed Computing

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

~ Greetings from WSU CAPPLab ~

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

HP Workstations graphics card options

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Enabling Technologies for Distributed and Cloud Computing

Multi-Threading Performance on Commodity Multi-Core Processors

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

1 Bull, 2011 Bull Extreme Computing

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Overview of HPC systems and software available within

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Lecture 1: an introduction to CUDA

HP ProLiant SL270s Gen8 Server. Evaluation Report

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center

NVIDIA GRID OVERVIEW SERVER POWERED BY NVIDIA GRID. WHY GPUs FOR VIRTUAL DESKTOPS AND APPLICATIONS? WHAT IS A VIRTUAL DESKTOP?

GPGPU accelerated Computational Fluid Dynamics

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Introduction to GPU Architecture

System Software for Big Data Computing. Cho-Li Wang The University of Hong Kong

HPC Software Requirements to Support an HPC Cluster Supercomputer

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Introduction to Cloud Computing

Evaluation of CUDA Fortran for the CFD code Strukti

Retargeting PLAPACK to Clusters with Hardware Accelerators

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

NVIDIA Solves the GPU Computing Puzzle

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

HP Z Workstations graphics card options

NVIDIA GeForce GTX 580 GPU Datasheet

HP Workstations graphics card options

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

GeoImaging Accelerator Pansharp Test Results

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

GPU Computing - CUDA

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Transcription:

GPU System Architecture EPCC The University of Edinburgh

Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems From desktop to supercomputer Accelerator Technology NVIDIA AMD Intel Example GPU machines

Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems From desktop to supercomputer Accelerator Technology NVIDIA AMD Intel Example GPU machines

The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers got faster by increasing the frequency Voltage was decreased to keep power reasonable. Now, voltage cannot be decreased any further 1s and 0s in a system are represented by different voltages Reducing overall voltage further would reduce this difference to a point where 0s and 1s cannot be properly distinguished

Instead, performance increases can be achieved through exploiting parallelism Need a chip which can perform many parallel operations every clock cycle Many cores and/or many operations per core Want to keep power/core as low as possible Much of the power expended by CPU cores is on functionality not generally that useful for HPC Branch prediction, out-of-order execution etc

So, for HPC, we want chips with simple, low power, number-crunching cores But we need our machine to do other things as well as the number crunching Run an operating system, perform I/O, set up calculation etc Solution: Hybrid system containing both CPU and accelerator chips

It costs a huge amount of money to design and fabricate new chips Not feasible for relatively small HPC market Luckily, over the last few years, Graphics Processing Units (GPUs) have evolved for the highly lucrative gaming market And largely possess the right characteristics for HPC Many number-crunching cores GPU vendors have tailored existing GPU architectures to the HPC market

GPUs in HPC November 2011 Top500 list: GPUs (next generation will have GPUs) GPUs GPUs GPUs now firmly established in HPC industry

Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems From desktop to supercomputer Accelerator Technology NVIDIA AMD Intel Example GPU machines

AMD 12-core CPU = compute unit (= core) Not much space on CPU is dedicated to compute

NVIDIA Fermi GPU = compute unit (= SM = 32 CUDA cores) GPU dedicates much more space to compute At expense of caches, controllers, sophistication etc

Memory For many applications, performance is very sensitive to memory bandwidth CPUs use DRAM GPUs use Graphics DRAM Graphics memory: much higher bandwidth than standard CPU memory

GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz) 12 (2.1GHz) Operations/cycle 1 4 DP Performance (peak) 515 GFlops 101 GFlops Memory Bandwidth (peak) 144 GB/s 27.5 GB/s For these particular products, GPU theoretical advantage is ~5x for both compute and main memory bandwidth Application performance very much depends on application Typically a small fraction of peak Depends how well application is suited to/tuned for architecture

Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems From desktop to supercomputer Accelerator Technology NVIDIA AMD Intel Example GPU machines

GPUs cannot be used instead of CPUs They must be used together GPUs act as accelerators Responsible for the computationally expensive parts of the code DRAM GDRAM CPU GPU I/O PCIe I/O

Scaling to larger systems I/O PCIe I/O Interconnect CPU DRAM GPU + GDRAM Interconnect allows multiple nodes to be connected CPU I/O PCIe GPU + GDRAM I/O Can have multiple CPUs and GPUs within each workstation or shared memory node E.g. 2 CPUs +2 GPUs (above) CPUs share memory, but GPUs do not

GPU Accelerated Supercomputer GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node GPU/CPU Node

To utilise a GPU, programs must Contain parts targeted at host CPU (most lines of source code) Contain parts targeted at GPU (key computational kernels) Manage data transfers between distinct CPU and GPU memory spaces Traditional language (e.g C/Fortran) does not provide these facilities To run on multiple GPUs in parallel Normally use one host CPU core (thread) per GPU Program manages communication between host CPUs in the same fashion as traditional parallel programs e.g. MPI and/or OpenMP (latter shared memory node only)

Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems From desktop to supercomputer Accelerator Technology NVIDIA AMD Intel Example GPU machines

Latest Technology NVIDIA Tesla HPC specific GPUs have evolved from GeForce series Intel AMD Knights Corner many-core x86 chip is like hybrid between a GPU and many-core CPU FireStream HPC specific GPUs have evolved from (ATI) Radeon series

NVIDIA Tesla Series GPU Chip partitioned into Streaming Multiprocessors (SMs) Multiple cores per SM Sophisticated thread engine Shared L2 cache

NVIDIA GPU SM Less scheduling units than cores Threads are scheduled in groups of 32, called a warp Threads within a warp always execute the same instruction in lock-step (on different data elements) Configurable L1 Cache/ Shared Memory

NVIDIA Tesla Series Tesla 1060 Fermi 2050 Fermi 2070 Fermi 2090 Kepler K20 (not yet released) CUDA cores 240 448 448 512 1536 SP Performance 933 GFlops 1030 GFlops 1030 GFlops 1330 GFlops DP Performance 78 GFlops 515 GFlops 515 GFlops 665 GFlops?? Memory Bandwidth 102 GB/s 144 GB/s 144 GB/s 178 GB/s?? Memory 4 GB 3 GB 6 GB 6 GB?? Variety of packaging solutions??

NVIDIA Roadmap

Programming NVIDIA Tesla CUDA C: proprietary interface to the architecture. Extensions to the C language which allow interfacing to the hardware Now relatively mature, and supported by comprehensive documentation, user forums, etc Fortran support provided by third party PGI Cuda Fortran compiler OpenCL Cross platform API Similar in design to CUDA, but lower level and not so mature Directives based approach OpenMP style directives - abstract complexities away from programmer Several products with varying syntax etc. Still lack maturity OpenMP committee currently considering adopting accelerators as part of OpenMP standard With participation from all main vendors plus others such as EPCC

AMD FirePro AMD acquired ATI in 2006 AMD FirePro series: derivative of Radeon chips with HPC enhancements FirePro W9000 Released Aug 2012 Peak 1 TFLOP (double precision) Peak 4 TFLOPS (single precision) 6GB GDDR5 SDRAM Now with full ECC support Programming: OpenCL Packaging: just cards for workstations at the moment. Server solutions may follow.

Intel Xeon Phi Intel Larrabee: A Many-Core x86 Architecture for Visual Computing Release delayed such that the chip missed competitive window of opportunity. Larrabee was not released as a competitive product, but instead a platform for research and development (Knight s Ferry). Knights Corner derivative chip Many Integrated Cores (MIC) architecture. No longer aimed at graphics market Instead Accelerating Science and Discovery more than 50 cores Release expected in 2012 Now branded as Intel Xeon Phi Hybrid between GPU and many-core CPU

Intel Xeon Phi CPU-like x86 cores: same instruction set as standard CPUs Relatively small number cores/chip Fully cache coherent In principle could support an OS GPU-like Simple cores, lack sophistication e.g. no out-oforder execution Each core contains 512-bit vector processing unit (16 SP or 8 DP numbers) Supports multithreading Not expected to run OS, but used as accelerator (at least initially) Programming: Possible to use standard models such as OpenMP but still need additional directives to manage data

Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems From desktop to supercomputer Accelerator Technology NVIDIA AMD Intel Example GPU machines

DIY GPU Workstation Just need to slot GPU card into PCI-e Need to make sure there is enough space and power in workstation

GPU Servers Several vendors offer GPU Servers Example Configuration: 4 GPUs plus 2 (multi-core) CPUs Multiple servers can be connected via interconnect

EPCC HECToR GPU Machine EPCC own a GPU System which is partly available to UK researchers As part of the HECToR UK National Service Comprises 3 GPU servers, each with 4 NVIDIA Fermi GPUs Connected via infiniband interconnect Plus another node with single AMD Firestream GPU Plus a front end with a single NVIDIA Fermi More details at http://www.hector.ac.uk/howcan/admin/apply/ HECToRGPU.php

Cray XK6 The Cray XK6 supercomputer is a trifecta of scalar, network and many-core innovation. It combines Cray s proven Gemini interconnect, AMD s leading multi-core scalar processors and NVIDIA s powerful manycore GPU processors to create a true, productive hybrid supercomputer. Each compute node contains 1 CPU + 1 GPU Can scale up to 500,000 nodes http://www.cray.com/products/xk6/kx6.aspx

Cray XK6 Compute Node XK6 Compute Node Characteristics AMD Series 6200 (Interlagos) NVIDIA Tesla X2090 PCIe Gen2 Host Memory 16 or 32GB 1600 MHz DDR3 NVIDIA Tesla X2090 Memory 6GB GDDR5 capacity Gemini High Speed Interconnect Upgradeable to future GPUs H T 3 H T 3 Z Y X

Cray XK6 Compute Blade Compute Blade: 4 Compute Nodes 4 CPUs (middle) + 4 GPUs (right) + 2 interconnect chips (left) (2 compute nodes share a single interconnect chip)

Summary GPUs have higher compute and memory bandwidth capabilities than CPUs GPUs are not used alone, but work in tandem with CPUs Leading to complications with task decomposition and memory management GPU accelerated systems scale from simple workstations to largescale supercomputers Traditional languages are not enough to program such systems Current/imminent accelerator products from NVIDIA,AMD and Intel NVIDIA are market leaders at the moment, with the proprietary CUDA the most common programming method AMD largely rely on the cross-platform OpenCL Intel will enter the market later this year with a hybrid CPU/GPU product