E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Similar documents

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Guided Performance Analysis with the NVIDIA Visual Profiler

GPU Parallel Computing Architecture and CUDA Programming Model

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Optimizing Application Performance with CUDA Profiling Tools

GPU Performance Analysis and Optimisation

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

HPC with Multicore and GPUs

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Introduction to GPU Programming Languages

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Next Generation GPU Architecture Code-named Fermi

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Parallel Programming Survey

Introduction to GPU hardware and to CUDA

Texture Cache Approximation on GPUs

ultra fast SOM using CUDA

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

CUDA Basics. Murphy Stein New York University

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 1: an introduction to CUDA

GPU Computing - CUDA

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Tools Sandra Wienke

Stream Processing on GPUs Using Distributed Multimedia Middleware

1. If we need to use each thread to calculate one output element of a vector addition, what would

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPUs for Scientific Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

OpenCL Programming for the CUDA Architecture. Version 2.3

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPGPU Computing. Yong Cao

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Computer Graphics Hardware An Overview

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Parallel Firewalls on General-Purpose Graphics Processing Units

Fast Implementations of AES on Various Platforms

CUDA programming on NVIDIA GPUs

The Fastest, Most Efficient HPC Architecture Ever Built

Intrusion Detection Architecture Utilizing Graphics Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introduction to GPU Architecture

~ Greetings from WSU CAPPLab ~

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

NVIDIA CUDA INSTALLATION GUIDE FOR MICROSOFT WINDOWS

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Learn CUDA in an Afternoon: Hands-on Practical Exercises

GPU for Scientific Computing. -Ali Saleh

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Computing for Data Science

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

GeoImaging Accelerator Pansharp Test Results

Introduction to GPGPU. Tiziano Diamanti

NVIDIA GeForce GTX 580 GPU Datasheet

CUDA Programming. Week 4. Shared memory and register

HIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS

MIDeA: A Multi-Parallel Intrusion Detection Architecture

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Embedded Systems: map to FPGA, GPU, CPU?

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Clustering Billions of Data Points Using GPUs

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU-based Decompression for Medical Imaging Applications

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Fast Software AES Encryption

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Advanced Rendering for Engineering & Styling

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Eloquence Training What s new in Eloquence B.08.00

RevoScaleR Speed and Scalability

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Parallel Replication for MySQL in 5 Minutes or Less

Transcription:

E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing Research CY Lin, 2015 Columbia University

Outline GPU Architecture Execution Model Examples in tuning configurations of grid and thread Resource Allocation Examples in examining the affect of configuration setting in grid and thread Memory Architecture Concurrent Processing cublas Library for Neural Network 2

GPU Architecture GPU Architecture built by several streaming multiprocessors (SMs) In each SM: CUDA cores Shared Memory/L1 Cache Kepler Architecture, K20X Register File Load/Store Units Special Function Units Warp Scheduler In each device: L2 Cache 3

Execution Model - Warp Concept NVIDIA GPU groups 32 threads into a warp, and then execute warps sequentially. Number of concurrent warps are based on the number of warp scheduler. (Kepler has 4 warp scheduler) All threads are arranged one-dimensionally. Relationship between logical view and hardware view Inefficient way to allocate a thread block 4

Execution Model - Warp Divergence GPU has light-weight control module, complicated control flow might hurt the performance of GPU. In the same warp, e.g., if you allocate 16 threads to do A; and 16 threads to do B. A and B will be executed serially. Example: simpledivergence.cu Test with optimization and without optimization 5

Execution Model - Warp Divergence compilation turn off the optimization nvcc -g -G -o simpledivergence simpledivergence.cu execute through nvprof to extract profiling information nvprof --metrics branch_efficiency --events branch,divergent_branch./ simpledivergence 6

Resource Allocation A warp is consisted of Program counters Registers Shared memory If a thread use lots of resource, fewer threads will be allocated in one SM 7

Occupancy Toolkit, CUDA Occupancy Calculator: in /usr/local/cuda/tools/cuda_occupancy_calculator.xls It could assist in measuring the occupancy of your configuration 8

Occupancy, Memory Load Efficiency, Memory Load Throughput View number of registers, set the constraints on number of registers per thread nvcc -g -G -arch=sm_30 --ptxas-options=-v --maxrregcount=31 -o summatrix summatrix.cu Check occupancy, memory load efficiency, memory load throughput to explore the suitable configuration of size of thread block nvprof --metrics gld_throughput,gld_efficiency,achieved_occupancy./summatrix dimx dimy do example on summatrix.cu with dim of thread block {4,4}, {4,8}, {8,4}, {8,8}, {16,16}, {32,32} {16,16} is the fast one. 9

Profiling Results 10

Memory Type 1. Register per SM, An automatic variable in kernel function, low latency, high bandwidth 2. Local memory Variable in a kernel but can not be fitted in register Cached in per SM L1 cache and per device L2 cache 3. Shared memory ( shared ) per SM, faster than local and global memory, share among thread block Use for inter-thread communication, 64KB, shared with L1 cache 4. Constant memory ( constant ) per device, read-only memory 5. Texture memory persm, read-only cache, optimized for 2D spatial locality 6. Global memory 11

Memory Type 12

Concurrent Processing Concurrent handle data transfer + computation For NVIDIA GT 650M (laptop GPU), there is one copy engine. For NVIDIA Tesla K40 (high-end GPU), there are two copy engines computation + computation if your CUDA core is enough. Check example, summatrixongpustream.cu and summatrixongpunostream.cu The latency in data transfer could be hidden during computing To handle two tasks, which both are matrix multiplications. Copy two inputs to GPU, copy one output from GPU 13

Concurrent Processing No concurrent processing Concurrent processing (concurrent computation) 14 Concurrent processing (transfer + computation)

cublas Library for Neural Network In neural network, the most important operation is inner-product a=f(x T w+b) x is a matrix that records the input which is fed to neural network w is a matrix that records the weights of network connection b is a matrix that records the bias of network connection f is an activation function that used to activate the neuron a is output w x GPU is more suitable for such intensive operations. Example, x T w+b cublas (GPU) vs. OpenBLAS (CPU) GPU computation includes data transfer between host and device. 15

Reference Check all metrics and events for nvprof, it will also explain the meaning of options nvprof --query-metrics nvprof --query-events Professional CUDA C Programming http://www.wrox.com/wileycda/wroxtitle/professional-cuda-c- Programming.productCd-1118739329,descCd-DOWNLOAD.html source code are available on the above website GTC On-Demand: http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php Developer Zone: http://www.gputechconf.com/resource/developer-zone NVIDIA Parallel Programming Blog: http://devblogs.nvidia.com/parallelforall NVIDIA Developer Zone Forums: http://devtalk.nvidia.com 16

Metal Programming Model It integrates the support for both graphics and compute operations. Three command encoder: Graphics Rendering: Render Command Encoder Data-Parallel Compute Processing: Compute Command Encoder Transfer Data between Resource: Blitting Command Encoder Multi-threading in encoding command is supported Typical flow in compute command encoder Prepare data Put your function into pipeline Command encoder Put command into command buffer Commit it to command queue Execute the command Get result back 17

Metal Programming, Kernel Function Compute command Two parameters, threadspergroup and numthreadgroups, determines number of threads. equivalent to grid and thread block in CUDA. They are all 3-D variable. The total of all threadgroup memory allocations must not exceed 16 KB. Kernel function: sigmoid function 18

Example 19

Example 20

Example: devicequery Understand the hardware constraint via devicequery (in example code of CUDA toolkit) 21

Homework #3 (due May 22) Choose 2 of the algorithms you use in your final project. Convert them into a GPU version. Show your code, and performance measurement comparing to your non-gpu version. The more innovative/complex your algorithms are, the higher score you will get on your Homework #3. 22