GPUs for Scientific Computing



Similar documents
CUDA programming on NVIDIA GPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Lecture 1: an introduction to CUDA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GPU hardware and to CUDA

Introduction to GPU Programming Languages

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Turbomachinery CFD on many-core platforms experiences and strategies

Parallel Programming Survey

Introduction to GPU Architecture

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Next Generation GPU Architecture Code-named Fermi

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Overview of HPC Resources at Vanderbilt

~ Greetings from WSU CAPPLab ~

Multicore Parallel Computing with OpenMP

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

HPC with Multicore and GPUs

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

GPGPU Computing. Yong Cao

Introduction to GPGPU. Tiziano Diamanti

Parallel Computing with MATLAB

Introduction to Cloud Computing

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Generations of the computer. processors.

10- High Performance Compu5ng

Parallel Algorithm Engineering

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

ultra fast SOM using CUDA

BLM 413E - Parallel Programming Lecture 3

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

ST810 Advanced Computing

Accelerating CFD using OpenFOAM with GPUs

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

NVIDIA GeForce GTX 580 GPU Datasheet

L20: GPU Architecture and Models

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

IP Video Rendering Basics

Choosing a Computer for Running SLX, P3D, and P5

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

HPC Wales Skills Academy Course Catalogue 2015

LSN 2 Computer Processors

High Performance GPGPU Computer for Embedded Systems

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Making the Monte Carlo Approach Even Easier and Faster. By Sergey A. Maidanov and Andrey Naraikin

Clustering Billions of Data Points Using GPUs

QCD as a Video Game?

Final Project Report. Trading Platform Server

Multi-Threading Performance on Commodity Multi-Core Processors

Guided Performance Analysis with the NVIDIA Visual Profiler

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Chapter 1 Computer System Overview

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

GeoImaging Accelerator Pansharp Test Results

ADVANCED COMPUTER ARCHITECTURE

Chapter 2 Parallel Architecture, Software And Performance

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Case Study on Productivity and Performance of GPGPUs

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Evaluation of CUDA Fortran for the CFD code Strukti

GPGPU accelerated Computational Fluid Dynamics

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Stream Processing on GPUs Using Distributed Multimedia Middleware

Performance monitoring of the software frameworks for LHC experiments

The Future Of Animation Is Games

Speeding up GPU-based password cracking

Computer Graphics Hardware An Overview

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPGPU acceleration in OpenFOAM

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

benchmarking Amazon EC2 for high-performance scientific computing

Rethinking SIMD Vectorization for In-Memory Databases

CISC, RISC, and DSP Microprocessors

Transcription:

GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research Centre NAG AGM, Sept 25th, 2009

GPUs for Scientific Computing p. 2/16 Computing Recent Past driven by the cost benefits of massive economies of scale, specialised chips (e.g. CRAY vector chips) died out, leaving Intel/AMD dominant Intel/AMD chips designed for office/domestic use, not for high performance computing increased speed through higher clock frequencies, and complex parallelism within each CPU PC clusters provided the high-end compute power, initially in universities and then in industry at same time, NVIDIA and ATI grew big on graphics chip sales driven by computer games

GPUs for Scientific Computing p. 3/16 Computing Present/Future move to faster clock frequencies stopped due to high power consumption big push now is to multicore at (slightly) reduced clock frequencies graphics chips have even more cores (up to 240) big new development here is a more general purpose programming environment Why? Initially, because computer games do increasing amounts of physics simulation on the GPU, but branching out into other areas.

GPUs for Scientific Computing p. 4/16 CPUs vs GPUs CPUs have up to 6 cores on a single chip, and 10-20 GB/s bandwidth to main system memory NVIDIA GPUs have up to 240 cores on a single chip, and 100+ GB/s bandwidth to graphics memory offer 50 100 speedup relative to a single CPU core roughly 10 speedup relative to two quad-core Xeons (if not using SSE instructions) also 10 improvement in price/performance and energy efficiency How is this possible? Logically simpler cores (SIMD, no out-of-order execution, minimal caching) suitable for vector computing, not for general purpose computing

GPUs for Scientific Computing p. 5/16 GPUs Is this sustainable? Yes! IBM, AMD and Intel all producing GPUs NVIDIA has good headstart on software side with CUDA environment new OpenCL software standard (based on CUDA and pushed by Apple) will probably run on all platforms driving applications are: computer games physics video (e.g. HD video decoding) computational science computational finance oil and gas

GPUs for Scientific Computing p. 6/16 GPUs NVIDIA: GTX graphics cards for consumers 100-400 C1060 for scientific computing no graphics output but 4GB of graphics memory and costs about 800 next-generation chip due soon IBM: Cell processor is expensive and hard to program future development may have been cancelled AMD: has good hardware, but poor software will support the new OpenCL standard Intel: developing its own GPU code-named Larrabee 2nd-gen in 2011 will have compute capability

GPUs for Scientific Computing p. 7/16 NVIDIA GPUs basic building block is a multiprocessor with 8 cores, 16384 registers and a small amount of shared memory different chips have different numbers of these: product multiprocs bandwidth cost GTX 260 27 112GB/s 140 GTX 295 2 30 2 112GB/s 360 each card also has 1-4GB graphics memory for: global memory accessible by all multiprocessors special read-only constant memory additional local memory for each multiprocessor

GPUs for Scientific Computing p. 8/16 NVIDIA GPUs The 8 cores in a multiprocessor are SIMD (Single Instruction Multiple Data) cores: all cores execute the same instructions simultaneously vector style of programming harks back to CRAY vector supercomputing natural for graphics processing and much scientific computing SIMD is also a natural choice for massively multicore to simplify each core requires specialised programming (no standard)

GPUs for Scientific Computing p. 9/16 CUDA programming CUDA is NVIDIA s program development environment: based on C with some extensions lots of example code and good documentation 2-4 week learning curve for those with experience of OpenMP and MPI programming growing user community active on NVIDIA forum main process runs on host system (Intel/AMD CPU) and launches multiple copies of kernel process on graphics card communication is through data transfers to/from graphics memory minimum of 4 threads per core, but more is better

GPUs for Scientific Computing p. 10/16 CUDA programming How hard is it to program? Needs combination of skills: splitting the application between the multiple multiprocessors is similar to MPI programming, but no need to split data it all resides in main graphics memory SIMD CUDA programming within each multiprocessor is a bit like OpenMP programming needs a good understanding of memory operation (but that s becoming less critical over time) difficulty also depends a lot on application

GPUs for Scientific Computing p. 11/16 NVIDIA multithreading Lots of active threads is the key to high performance: no context switching ; each thread has its own registers, which limits the number of active threads threads execute in warps of 32 threads per multiprocessor (4 per core) execution alternates between active warps, with warps becoming temporarily inactive when waiting for data

GPUs for Scientific Computing p. 12/16 NVIDIA multithreading for each thread, one operation completes long before the next starts avoids the complexity of pipeline overlaps which can limit the performance of modern processors 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 time memory access from device memory has a delay of 400-600 cycles; with 40 threads this is equivalent to 10-15 operations and can be managed by the compiler

GPUs for Scientific Computing p. 13/16 My experience Random number generation (mrg32k3a/normal): 2500M values/sec on GTX 280 70M values/sec/core on Xeon using Intel s VSL LIBOR Monte Carlo testcase: 180x speedup on GTX 280 compared to single thread on Xeon 3D PDE application: factor 50x speedup on GTX 280 compared to single thread on Xeon factor 10x speedup compared to two quad-core Xeons GPU results are all single precision double precision is currently 2-4 times slower, no more than factor 2 in future?

GPUs for Scientific Computing p. 14/16 NAG Numerical Routines for GPUs Random number generation: Pseudo-random (L Ecuyer s mrg32k3a) and quasi-random (sobol) generation uniform, exponential, Normal and gamma output distributions available free to academics under a development license being assessed by a number of banks lots of interest at conferences and industry events Other capabilities will be added in the future.

GPUs for Scientific Computing p. 15/16 What is needed now? Skilled manpower, training: 50+ on Oxford CUDA mailing list: students and post-docs in almost all science departments 1-week CUDA course this summer in 3 years time, many PhDs in computational science will have these skills More development of libraries, high-level packages: Monte Carlo simulation PDE solvers...

GPUs for Scientific Computing p. 16/16 Further information LIBOR and finite difference test codes www.maths.ox.ac.uk/ gilesm/hpc/ NAG Numerical Routines for GPUs www.nag.co.uk/numeric/gpus/ NVIDIA s CUDA homepage www.nvidia.com/object/cuda home.html