Radar signal processing on graphics processors (Nvidia/CUDA)


 Dylan Sharp
 2 years ago
 Views:
Transcription
1 Radar signal processing on graphics processors (Nvidia/CUDA) Jimmy Pettersson & Ian Wainwright Full master thesis presentation will be held in january
2 CUDA is a framework to access the GPU for nongraphics computing, known as GPGPU computing
3 Outline Why GPGPU? CUDA Hardware CUDA Programming Results OpenCL Final thoughts Questions!
4 Why GPGPU? Theoretical peak performance in GFLOPS
5 Why GPGPU? Theoretical peak performance in GFLOPS GFLOPS
6 Why GPGPU? Theoretical GFLOPS Theoretical Bandwidth GB/s Watt Quadro FX GeForce GTS GeForce GTS AMD/ATI Intel Core i Intel Core i7820 QM GFLOPS/Watt
7 Why GPGPU? Theoretical GFLOPS Theoretical Bandwidth GB/s GFLOPS / W Watt Quadro FX GeForce GTS GeForce GTS AMD/ATI Intel Core i Intel Core i7820 QM GFLOPS/Watt
8 CUDA Hardware environment
9 CUDA Hardware environment
10 CUDA Hardware environment Onchip Offchip
11 CUDA Hardware environment
12 CUDA Hardware environment 24 * 8 * 1.2GHz * 2 = 460 GFLOPS
13 CUDA Hardware environment 24 * 8 * 1.2GHz * 2 = 460 GFLOPS 24 * (16 KB + 16K * 4 B) = 1920 KB instant onchip memory
14 CUDA Hardware environment
15 CUDA Hardware environment
16 CUDA Hardware environment
17 CUDA programming Programming is done in C with some C++ extensions like templates.
18 CUDA programming Programming is done in C with some C++ extensions like templates. Yields high level programming while still close to the hardware.
19 CUDA programming Code execution: CPU / GPU
20 CUDA programming Threads & Warps
21 CUDA programming Threads & Warps A warp consists of 32 consecutive threads
22 CUDA programming Warp execution ~ SIMD similarities
23 CUDA programming Warps are then grouped into work groups (blocks) for execution:
24 CUDA programming Work group (block) scheduling:
25 CUDA programming Communication within a work group
26 CUDA programming Communication within a work group Managed through fast onchip shared memory.
27 CUDA programming Communication within a work group Managed through fast onchip shared memory. Threads can be synchronized in order to avoid read/write dependencies.
28 CUDA programming A CUDA programming example:
29 CUDA programming A CUDA programming example: SpaceTime Adaptive Processing (STAP) with covariance matrix estimation.
30 CUDA programming STAP covariance matrix estimation A sliding volume in a 3D data set.
31 CUDA programming STAP covariance matrix estimation A sliding volume in a 3D data set. For each doppler we iterate over the range, read more data, and construct a new covariance matrix from the beams.
32 CUDA programming STAP covariance matrix estimation A sliding volume in a 3D data set. For each doppler we iterate over the range, read more data, and construct a new covariance matrix from the beams.
33 CUDA programming STAP covariance matrix estimation A sliding volume in a 3D data set. For each doppler we iterate over the range, read more data, and construct a new covariance matrix from the beams. The covariance matrix is given by x*transpose(x)
34 CUDA programming STAP covariance matrix estimation A sliding volume in a 3D data set. These are summed in each step. A(i,j,r) += A(i,j, r1) for every 'i' and 'j'
35 CUDA programming STAP covariance matrix estimation A sliding volume in a 3D data set. And finally normalized at the end of each range.
36 CUDA programming STAP covariance matrix estimation Difficult to implement in CUDA architecture?
37 CUDA programming STAP covariance matrix estimation Difficult to implement in CUDA architecture? Nope! The CUDA architecture can parallelize over each doppler channel quite easily!
38 CUDA programming STAP: Difficult to implement in CUDA architecture? Nope! The CUDA architecture can parallelize over each doppler channel quite easily! We assigned work groups over each doppler channel.
39 CUDA programming STAP covariance matrix estimation We assign one thread per x element and let it compute one column each of the covariance matrix.
40 CUDA programming STAP covariance matrix estimation We assign one thread per x element and let it compute one column each of the covariance matrix.
41 CUDA programming STAP covariance matrix estimation We assign one thread per x element and let it compute one column each of the covariance matrix. All the data is stored in onchip shared memory.
42 CUDA programming STAP covariance matrix estimation We assign one thread per x element and let it compute one column each of the covariance matrix. All the data is stored in onchip shared memory. This is then iterated all over the range and then finally normalized.
43 CUDA programming STAP covariance matrix estimation We used 2 different data sets. Doppler Range Beams Range blocks MITRE Extended
44 CUDA programming STAP covariance matrix estimation We used 2 different data sets. Doppler Range Beams Range blocks MITRE Extended The GPU results were benchmarked against a CPU implementation.
45 CUDA programming STAP covariance matrix estimation We used 2 different data sets. Doppler Range Beams Range blocks MITRE Extended The GPU results were benchmarked against a CPU implementation. Intel core 2 duo at 3 GHz running on one core.
46 CUDA programming STAP covariance matrix estimation Results for the first implementation
47 CUDA programming STAP covariance matrix estimation Optimization: Store the matrices in thread local registers instead. The register file is 4 times larger than the onchip shared memory.
48 CUDA programming STAP, Results: Comparison, shared memory VS register memory.
49 CUDA programming STAP: Conclusions More data and higher arithmetic intensity benefit the GPU tremendously.
50 CUDA programming STAP: Conclusions More data and higher arithmetic intensity benefit the GPU tremendously. The game is to keep the streaming processors fed with data and never have them idle waiting for more.
51 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL)
52 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times
53 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times
54 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times
55 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times
56 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times
57 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times FFT times
58 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times FFT times Benchmark GFLOPS Speedup(SMW) Speedup(MKL)
59 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times FFT times Benchmark GFLOPS Speedup(SMW) Speedup(MKL) STAP: Covariance matrix times
60 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times FFT times Benchmark GFLOPS Speedup(SMW) Speedup(MKL) STAP: Covariance matrix times SAR: Tilted matrix addition 55 GB/s 80 times
61 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times FFT times Benchmark GFLOPS Speedup(SMW) Speedup(MKL) STAP: Covariance matrix times SAR: Tilted matrix addition 55 GB/s 80 times Cubic interpolation times
62 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times FFT times Benchmark GFLOPS Speedup(SMW) Speedup(MKL) STAP: Covariance matrix times SAR: Tilted matrix addition 55 GB/s 80 times Cubic interpolation times Bicubic interpolation times
63 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times FFT times Benchmark GFLOPS Speedup(SMW) Speedup(MKL) STAP: Covariance matrix times SAR: Tilted matrix addition 55 GB/s 80 times Cubic interpolation times Bicubic interpolation times Picture correlation times
64 Results Benchmark GFLOPS Speedup(HPEC) Speedup(MKL) TDFIR times QR times 9 times CT 16 GB/s 100 times SVD times 3 times CFAR times FFT times Benchmark GFLOPS Speedup(SMW) Speedup(MKL) STAP: Covariance matrix times SAR: Tilted matrix addition 55 GB/s 80 times Cubic interpolation times Bicubic interpolation times Picture correlation times These are reasonably big data sets
65 OpenCL New standardized, open, cross platform, heterogeneous API. Aim: Code once, compile on various hardware, such as CPUs, GPUs, Cell Broadband Engine, DSP... Problems: Only AMD / Nvidia GPUs and Cell have supported HW. Still immature CUDA and OpenCL are very similar.
66 What to remember! When to use the GPU: Large data sets and/or have a high arithmetic intensity.
67 What to remember! When to use the GPU: Large data sets and/or have a high arithmetic intensity. When not to use the GPU: The host to device data transfer is a significant part of the computation. Entirely serial computations.
68 What to remember! When to use the GPU: Large data sets and/or have a high arithmetic intensity. When not to use the GPU: The host to device data transfer is a significant part of the computation. Entirely serial computations. Easy: Basically C with good hardware knowledge.
69 What to remember! When to use the GPU: Large data sets and/or have a high arithmetic intensity. When not to use the GPU: The host to device data transfer is a significant part of the computation. Entirely serial computations. Easy: Basically C with good hardware knowledge. Future: New architecture called Fermi in Q brings lots of nice features, such as caches, C++ support, IEEE compliant, and more!
70 Questions! Vi sitter mitt emot projekttavlorna vid Atrappan!
71 What typical CUDA code might look like
72 What typical CUDA code might look like // vector addition A[threadIdx.x] = B[threadIdx.x]+C[threadIdx.x];
73 What typical CUDA code might look like // vector addition A[threadIdx.x] = B[threadIdx.x]+C[threadIdx.x]; // A = B.*C  matrix elementwise multiplication int i = threadidx.y + blockidx.y*block_dim_y; int j = threadidx.x + blockidx.x*block_dim_x; A[i][j] = B[i][j]*C[i][j];
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation  from Models to Software
GPU Computing Numerical Simulation  from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800GTX200 Fermi Cayman Performance Analysis
More informationProgram Optimization Study on a 128Core GPU
Program Optimization Study on a 128Core GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, SainZee Ueng, and Wenmei W. Hwu Yu, Xuan Dept of Computer & Information Sciences University
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPUCPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationIntroduction to GPGPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GPGPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationIntroduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild  1 Table of Contents 1. Architecture
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationNext Generation GPU Architecture Codenamed Fermi
Next Generation GPU Architecture Codenamed Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationLecture 11: MultiCore and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: MultiCore and GPU Multicore computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 MultiCore System Integration of multiple processor cores on a single chip. To provide
More informationGPU Computing  CUDA
GPU Computing  CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute OxfordMan Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction  Hardware
More informationTowards LargeScale Molecular Dynamics Simulations on Graphics Processors
Towards LargeScale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture SharedMemory
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
More informationGraphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre
Graphics Processing Unit (GPU) Memory Hierarchy Presented by Vu Dinh and Donald MacIntyre 1 Agenda Introduction to Graphics Processing CPU Memory Hierarchy GPU Memory Hierarchy GPU Architecture Comparison
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationGPU Architecture Overview. John Owens UC Davis
GPU Architecture Overview John Owens UC Davis The RightHand Turn [H&P Figure 1.1] Why? [Architecture Reasons] ILP increasingly difficult to extract from instruction stream Control hardware dominates µprocessors
More informationHIGH PERFORMANCE CONSULTING COURSE OFFERINGS
Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
More informationHardwareAware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
HardwareAware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationOutline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary
OpenCL Optimization Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary 2 Overall Optimization Strategies Maximize parallel
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationHome Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015
INF5063: Programming heterogeneous multicore processors because the OScourse is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale
More informationAutotuning dense linear algebra libraries on GPUs and overview of the MAGMA library
Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library Rajib Nath, Stan Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee, Knoxville Speaker: Emmanuel
More informationPerformance Portability Study of Linear Algebra Kernels in OpenCL
Performance Portability Study of Linear Algebra Kernels in OpenCL Karl Rupp 1,2, Philippe Tillet 1, Florian Rudolf 1, Josef Weinbub 1, Ansgar Jüngel 2, Tibor Grasser 1 rupp@iue.tuwien.ac.at @karlrupp 1
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More information1. If we need to use each thread to calculate one output element of a vector addition, what would
Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x
More informationINF5063: Programming heterogeneous multicore processors. September 13, 2010
INF5063: Programming heterogeneous multicore processors September 13, 2010 Overview Course topic and scope Background for the use and parallel processing using heterogeneous multicore processors Examples
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk OxfordMan Institute of Quantitative Finance Oxford University Mathematical Institute Oxford eresearch
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (SelfOrganizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More informationCUDA SKILLS. YuHang Tang. June 2326, 2015 CSRC, Beijing
CUDA SKILLS YuHang Tang June 2326, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html YuHang Tang @
More informationGPU Computing with CUDA Lecture 4  Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4  Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More informationAnalysis of GPU Parallel Computing based on Matlab
Analysis of GPU Parallel Computing based on Matlab Mingzhe Wang, Bo Wang, Qiu He, Xiuxiu Liu, Kunshuai Zhu (School of Computer and Control Engineering, University of Chinese Academy of Sciences, Huairou,
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationGPU for Scientific Computing. Ali Saleh
1 GPU for Scientific Computing Ali Saleh Contents Introduction What is GPU GPU for Scientific Computing KMeans Clustering Knearest Neighbours When to use GPU and when not Commercial Programming GPU
More informationHigh Performance GPGPU Computer for Embedded Systems
High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationGPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g CarlFriedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford eresearch Centre Lecture 1 p. 1 Lecture 1 p.
More informationImage Processing & Video Algorithms with CUDA
Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly
More informationOpenCL for programming shared memory multicore CPUs
Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG212 Workshop at HiPEAC212, Paris, Jan. 212. OpenCL for programming shared memory multicore CPUs
More informationProgramming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University johnmc@rice.edu COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
More informationMONTECARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTECARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STACA2 BENCHMARK STACA2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
More informationENHANCEMENT OF TEGRA TABLET'S COMPUTATIONAL PERFORMANCE BY GEFORCE DESKTOP AND WIFI
ENHANCEMENT OF TEGRA TABLET'S COMPUTATIONAL PERFORMANCE BY GEFORCE DESKTOP AND WIFI Di Zhao The Ohio State University GPU Technology Conference 2014, March 2427 2014, San Jose California 1 TEGRAWIFIGEFORCE
More informationMulticore Systems What can we buy today?
Multicore Systems What can we buy today? Ian Watson & Mikel Lujan Advanced Processor Technologies Group COMP60012 Future Multicore Computing 1 A Bit of History AMD Opteron introduced in 2003 Hypertransport
More informationAccelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More informationGPGPU acceleration in OpenFOAM
CarlFriedrich Gauß Faculty GPGPU acceleration in OpenFOAM Northern germany OpenFoam User meeting Braunschweig Institute of Technology Thorsten Grahs Institute of Scientific Computing/movecsc 2nd October
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationINF5063: Programming heterogeneous multicore processors Introduction
INF5063: Programming heterogeneous multicore processors Introduction 28/82009 Overview Course topic and scope Background for the use and parallel processing using heterogeneous multicore processors
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationParallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná  Brazil Hochschule Regensburg 02.05.2011 Daniel
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCIGA.3033012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationAccelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
More informationGPU Computing with CUDA Lecture 2  CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2  CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationIntroduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose generalpurpose GPU computing as firstclass capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industrystandard
More information~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
More informationRealtime Visual Tracker by Stream Processing
Realtime Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
More informationGPGPUs, CUDA and OpenCL
GPGPUs, CUDA and OpenCL Timo Lilja January 21, 2010 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 1 / 42 Course arrangements Course code: T106.5800 Seminar on Software Techniques Credits: 3 Thursdays
More informationGraphic Processing Units: a possible answer to High Performance Computing?
4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF  Grenoble 26 March 2009 http://inac.cea.fr/l_sim/
More informationOptimizing a 3DFWT code in a cluster of CPUs+GPUs
Optimizing a 3DFWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationAccelerating CST MWS Performance with GPU and MPI Computing. CST workshop series
Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques  Overview  Multithreading GPU Computing Distributed Computing
More informationIntelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble UltraDeep Field
More informationCUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
More informationAdvanced CUDA Webinar. Memory Optimizations
Advanced CUDA Webinar Memory Optimizations Outline Overview Hardware Memory Optimizations Data transfers between host and device Device memory optimizations Summary Measuring performance effective bandwidth
More informationCSCIGA Graphics Processing Units (GPUs): Architecture and Programming Lecture 11: OpenCL
CSCIGA.3033012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 11: OpenCL Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Open Computing Language Design Goals
More informationA GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
More informationGPU Accelerated Pathfinding
GPU Accelerated Pathfinding By: Avi Bleiweiss NVIDIA Corporation Graphics Hardware (2008) Editors: David Luebke and John D. Owens NTNU, TDT24 Presentation by Lars Espen Nordhus http://delivery.acm.org/10.1145/1420000/1413968/p65bleiweiss.pdf?ip=129.241.138.231&acc=active
More informationAccelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
More informationThe Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
More informationGPU Programming Strategies and Trends in GPU Computing
GPU Programming Strategies and Trends in GPU Computing André R. Brodtkorb 1 Trond R. Hagen 1,2 Martin L. Sætra 2 1 SINTEF, Dept. Appl. Math., P.O. Box 124, Blindern, NO0314 Oslo, Norway 2 Center of Mathematics
More informationCUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrixmatrix multiplication Homework SHARED MEMORY AND BANK
More informationThe High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
More informationMAGMA: Matrix Algebra on GPU and Multicore Architectures
MAGMA: Matrix Algebra on GPU and Multicore Architectures Presented by Scott Wells Assistant Director Innovative Computing Laboratory (ICL) College of Engineering University of Tennessee, Knoxville Overview
More informationOptimization. NVIDIA OpenCL Best Practices Guide. Version 1.0
Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What
More informationGPU multiprocessing. Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain)
GPU multiprocessing Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain) Outline 1. Multichip solutions [10 slides] 2. Multicard solutions [2 slides] 3. Multichip + multicard
More informationCS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)
CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationA New, HighPerformance, LowPower, FloatingPoint Embedded Processor for Scientific Computing and DSP Applications
1 A New, HighPerformance, LowPower, FloatingPoint Embedded Processor for Scientific Computing and DSP Applications Simon McIntoshSmith Director of Architecture 2 MultiThreaded Array Processing Architecture
More informationGPU Accelerated Monte Carlo Simulations and Time Series Analysis
GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes GutenbergUniversity of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis
More informationGuided Performance Analysis with the NVIDIA Visual Profiler
Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof commandline profiler Guided
More informationRadar Processing: FPGAs or GPUs?
Radar Processing: FPGAs or GPUs? WP011972.0 White Paper While generalpurpose graphics processing units (GPGPUs) offer high rates of peak floatingpoint operations per second (FLOPs), FPGAs now offer competing
More informationRetargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. QuintanaOrtí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
More informationParallel Firewalls on GeneralPurpose Graphics Processing Units
Parallel Firewalls on GeneralPurpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More informationTurbomachinery CFD on manycore platforms experiences and strategies
Turbomachinery CFD on manycore platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 2729
More informationRWTH GPU Cluster. Sandra Wienke wienke@rz.rwthaachen.de November 2012. Rechen und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwthaachen.de November 2012 Rechen und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More information