HPC with Multicore and GPUs



Similar documents
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GPU hardware and to CUDA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Parallel Computing Architecture and CUDA Programming Model

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Next Generation GPU Architecture Code-named Fermi

Introduction to GPU Programming Languages

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

CUDA Basics. Murphy Stein New York University

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Experiences With Mobile Processors for Energy Efficient HPC

High Performance Matrix Inversion with Several GPUs

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPUs for Scientific Computing

CUDA Programming. Week 4. Shared memory and register

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

ST810 Advanced Computing

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Computer Graphics Hardware An Overview

CUDA programming on NVIDIA GPUs

GPU Computing - CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Graphic Processing Units: a possible answer to High Performance Computing?

Parallel Programming Survey

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Spring 2011 Prof. Hyesoon Kim

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Programming GPUs with CUDA

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Introduction to GPGPU. Tiziano Diamanti

Parallel Computing with MATLAB

Parallel Firewalls on General-Purpose Graphics Processing Units

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Part I Courses Syllabus

Lecture 1: an introduction to CUDA

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

L20: GPU Architecture and Models

Le langage OCaml et la programmation des GPU

Retargeting PLAPACK to Clusters with Hardware Accelerators

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

NVIDIA Jetson TK1 Development Kit

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Trends in High-Performance Computing for Power Grid Applications

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Scalability and Classifications

GPGPU accelerated Computational Fluid Dynamics

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

ultra fast SOM using CUDA

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Final Project Report. Trading Platform Server

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

NVIDIA GeForce GTX 580 GPU Datasheet

~ Greetings from WSU CAPPLab ~

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Hardware design for ray tracing

Optimizing Application Performance with CUDA Profiling Tools

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Accelerating CFD using OpenFOAM with GPUs

Data Centric Systems (DCS)

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

OpenCL Programming for the CUDA Architecture. Version 2.3

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Batched Matrix Computations on Hardware Accelerators Based on GPUs

Stream Processing on GPUs Using Distributed Multimedia Middleware

Case Study on Productivity and Performance of GPGPUs

A quick tutorial on Intel's Xeon Phi Coprocessor

1. If we need to use each thread to calculate one output element of a vector addition, what would

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU-Based Deep Learning Inference:

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Transcription:

HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18

Outline! Introduction - Hardware trends! Challenges of using multicore+gpus! How to code for GPUs and multicore - An approach that we will study! Introduction to CUDA and the cs954 project/ library! Conclusions 2/18

Speeding up Computer Simulations Better numerical methods e.g. a posteriori error analysis: solving for much less DOF but achieving the same accuracy http://www.cs.utk.edu/~tomov/cflow/ Exploit advances in hardware l Manage to use hardware efficiently for real-world HPC applications l Match LU benchmark in performance! 3/18

Why multicore and GPUs? Hardware trends Power is the root cause of all this! Multicore ( Yelick (Source: slide from Kathy! GPU Accelerators 4/18 (Source: NVIDIA's GT200: Inside a Parallel Processor )

Main Issues! Increase in parallelism * 1 How to code (programming model, language, productivity, etc.)? Despite issues, high speedups on HPC applications are reported using GPUs ( homepage (from NVIDIA CUDA Zone! Increase in commun. * 2 cost (vs computation) How to redesign algorithms?! Hybrid Computing * 3 How to split and schedule the computation between hybrid hardware components? CUDA architecture & programming: * 1 - A data-parallel approach that scales - Similar amount of efforts on using CPUs vs GPUs by domain scientists demonstrate the GPUs' potential Processor speed * 2 improves 59% / year but memory bandwidth by 23% latency by 5.5% e.g., schedule small * 3 non-parallelizable tasks on the CPU, and large and paralelizable on the GPU 5/18

Evolution of GPUs GPUs: excelling in graphics rendering Scene model streams of data Graphics pipelined computation Final image Repeated fast over and over: e.g. TV refresh rate is 30 fps; limit is 60 fps This type of computation:! Requires enormous computational power! Allows for high parallelism! Needs high bandwidth vs low latency! Currently, can be viewed as multithreaded multicore vector units ( as low latencies can be compensated with deep graphics pipeline ) Obviously, this pattern of computation is common with many other applications 6/18

Challenges of using multicore+gpus! Massive parallelism Many GPU cores, serial kernel execution [ e.g. 240 in the GTX280; up to 512 in Fermi to have concurrent kernel execution ]! Hybrid/heterogeneous architectures Match algorithmic requirements to architectural strengths [ e.g. small, non-parallelizable tasks to run on CPU, large and parallelizable on GPU ]! Compute vs communication gap Exponentially growing gap; persistent challenge [ Processor speed improves 59%, memory bandwidth 23%, latency 5.5% ] [ on all levels, e.g. a GPU Tesla C1070 (4 x C1060) has compute power of O(1,000) Gflop/s but GPUs communicate through the CPU using O(1) GB/s connection ] 7/18

How to Code for GPUs? 400 GPU vs CPU GEMM! Complex question - Language, programming model, user productivity, etc! Recommendations - Use CUDA / OpenCL [already demonstrated benefits in many areas; data-based parallelism; move to support task-based] - Use GPU BLAS [high level; available after introduction of shared memory can do data reuse; leverage existing developments ] - Use Hybrid Algorithms [currently GPUs massive parallelism but serial kernel execution; hybrid approach small non-parallelizable tasks on the CPU, large parallelizable tasks on the GPU ] GFlop/s GFlop/s 350 300 250 200 150 100 50 70 60 50 40 30 20 10 GPU SGEMM GPU DGEMM CPU SGEMM CPU DGEMM 0 1000 2000 3000 4000 5000 6000 7000 Matrix s iz e G P U vs C P U G E MV G P U S G E MV G P U D G E MV C P U S G E MV C P U D G E MV 0 1000 2000 3000 4000 5000 6000 7000 Matrix s ize Typical order of acceleration: dense matrix-matrix O( 1) X dense matrix-vector O( 10) X sparse matrix-vector O(100) X 8/18

An approach for multicore+gpus! Split algorithms into tasks and dependencies between them, e.g., represented as DAGs! Schedule the execution in parallel without violating data dependencies Algorithms as DAGs (small tasks/tiles for homogeneous multicore) Hybrid CPU+GPU algorithms (small tasks for multicores and ( GPUs large tasks for e.g., in the PLASMA library for Dense Linear Algebra http://icl.cs.utk.edu/plasma/ e.g., in the MAGMA library for Dense Linear Algebra http://icl.cs.utk.edu/magma/ 9/18

An approach for multicore+gpus! Split algorithms into tasks and dependencies between them, e.g., represented as DAGs! Schedule the execution in parallel without violating data dependencies Want to develop libcs594: a framework for parallel programming of linear algebra algorithms for systems of multicores accelerated with GPUs Hybrid CPU+GPU algorithms (small tasks for multicores and ( GPUs large tasks for e.g., in the MAGMA library for Dense Linear Algebra http://icl.cs.utk.edu/magma/ 10/18

A Parallel Programming Framework Inspired by CUDA / OpenCL l l l To experiment with techniques To try to develop hybrid algorithms (? projects To be built as part of homeworks (maybe final - Data-parallel tasks (on GPUs, multicores or both) - To be started asynchronously from a master going over the critical path - The critical path to be overlapped with the data-parallel tasks? 11/18

How to program in parallel?! There are many parallel programming paradigms (to be covered w/ Prof. George Bosilca), e.g., master K K K worker worker worker P1 P2 P3 ( SPMD ) master/worker divide and conquer pipeline work pool data parallel! In reality applications usually combine different paradigms! CUDA and OpenCL have roots in the data-parallel approach (now adding support for task parallelism) http://developer.download.nvidia.com/compute/cuda/2_0/docs/nvidia_cuda_programming_guide_2.0.pdf http://developer.download.nvidia.com/opencl/nvidia_opencl_jumpstart_guide.pdf 12/18

Compute Unified Device Architecture (CUDA) Software Stack CUBLAS, CUFFT, MAGMA,... C like API (Source: NVIDIA CUDA ( Guide Programming 13/18

CUDA Memory Model (Source: NVIDIA CUDA ( Guide Programming 14/18

CUDA Hardware Model (Source: NVIDIA CUDA ( Guide Programming 15/18

CUDA Programming Model! Grid of thread blocks (blocks of the same dimension, grouped ( kernel together to execute the same! Thread block (a batch of threads with fast shared ( kernel memory executes a! Sequential code launches asynchronously GPU kernels // set the grid and thread configuration Dim3 dimblock(3,5); Dim3 dimgrid(2,3); // Launch the device computation MatVec<<<dimGrid, dimblock>>>(... ); C Program Sequential Execution Serial Code ( Guide (Source: NVIDIA CUDA Programming global void MatVec(... ) { // Block index int bx = blockidx.x; int by = blockidx.y; } // Thread index int tx = threadidx.x; int ty = threadidx.y;... 16/18

Jetson TK1! A full-featured platform for embedded applications! It allows you to unleash the power of 192 CUDA cores to develop solutions in computer vision, robotics, medicine, security, and automotive;! You have accounts on astro.icl.utk.edu (for Ozgur Cekmer and Yasser Gandomi) rudi.icl.utk.edu (for Yuping Lu and Eduardo Ponce )! Board features - Tegra K1 SOC l l Kepler GPU with 192 CUDA cores 4-Plus-1 quad-core ARM Cortex A15 CPU - 2 GB x16 memory with 64 bit width - 16 GB 4.51 emmc memory - https://developer.nvidia.com/jetson-tk1 17/18

Conclusions! Hybrid Multicore+GPU computing:! Architecture trends: towards heterogeneous/hybrid designs! Can significantly accelerate linear algebra [vs just multicores] ;! Can significantly accelerate algorithms that are slow on homogeneous architectures 18/18