1. If we need to use each thread to calculate one output element of a vector addition, what would

Size: px
Start display at page:

Download "1. If we need to use each thread to calculate one output element of a vector addition, what would"

Transcription

1 Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x + threadidx.y; (B) i=blockidx.x + threadidx.x; (C) i=blockidx.x*blockdim.x + threadidx.x; (D) i=blockidx.x * threadidx.x; 2. We want to use each thread to calculate two (adjacent) elements of a vector addition, Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index? (A) i=blockidx.x*blockdim.x + threadidx.x +2; (B) i=blockidx.x*threadidx.x*2 (C) i=(blockidx.x*blockdim.x + threadidx.x)*2 (D) i=blockidx.x*blockdim.x*2 + threadidx.x 3. If a CUDA device s SM (streaming multiprocessor) can take up to 1536 threads and up to 4 thread blocks. Which of the following block configuration would result in the most number of threads in the SM? 28 threads per block (B) 256 threads per block (C) 512 threads per block (D) 1024 threads per block 4. For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid? (A) 2000 (B) 2024 (C) 2048 (D) 2096

2 5. If the previous question, how many warps do you expect to have divergence due to the boundary check on vector length? (B) 2 (C) 3 (D) 6 Answer: (A) Quiz Questions Lecture 3: 1. For our tiled matrix matrix multiplication kernel, if we use a 32X32 tile, what is the reduction of memory bandwidth usage for input matrices M and N? /8 of the original usage (B) 1/16 of the original usage (C) 1/32 of the original usage (D) 1/64 of the original usage 2. Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel? (B) 1000 (C) 512 (D) In the previous question, if a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel? (B) 1000 (C) 512 (D) 51200

3 4. For the simple matrix matrix multiplication (MxN) based on row major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither 5. For the tiled matrix matrix multiplication (MxN) based on row major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither Quiz Questions: Lecture 4 1. For the simple reduction kernel, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the 5 th iteration? (A) 0 (B) 1 (C) 16 (D) 32, All warps will have divergence throughout the execution. 2. For the improved reduction kernel, if the block size is 1024 and warp size is 32, how many warps will have divergence during the 5 th iteration? (A) 0 (B) 1 (C) 16 (D) 32 Answer: (A), There are 64 consecutive active threads, more than warp size.

4 3. For the work efficient scan kernel, assume that we have 2048 elements, how many add operations will be performed in both the reduction tree phase and the inverse reduction tree phase? (A) (2048 1)*2 (B) (1024 1)*2 (C) 1024*1024 (D) 10*1024 Answer: (A) 4. For the work inefficient scan kernel based on reduction trees, assume that we have 2048 elements, which of the following gives the closest approximation on how many add operations will be performed? (A) (2048 1)*2 (B) (1024 1)*2 (C) 1024*1024 (D) 10* For the vector addition example where input vectors are read from disk, if the GPU kernel runs at 190GFLOPS, and the PCIe is able to deliver a bandwidth of 6GBps, which of the following is the closest approximation of the minimum time it would take to add two 190 mega element vectors stored in the host memory and get the result back to the host memory? 90 / 190 ms (B) 190 / 6 ms (C) 8 * 190 / 6 ms (D) 2 * 190 / 6 ms Lecture 5 1. What is the CUDA API call that make sure that all previous kernel executions and memory copies have been completed? (A) syncthreads() (B) cudadevicesynchronize() (C) cudastreamsynchronize() (D) barrier()

5 2. Which of the following statements is true? (A) The data transfer between device and host is done by DMA hardware using virtual addresses. (B) The OS automatically guarantees that any memory being used by a DMA device is not swapped out. (C) If a swapped page is to be transferred by cudymemcpy(), it needs to be first copied to a pinned memory buffer before transferred. (D) Pinned memory is allocated with cudamalloc() function. Lecture 6 1. For vector addition, if there are 100,000 elements in each vector and we are using 3 compute processes. How many elements are we sending to the last compute process? (A) 5 (B) 300 (C) 333 (D) If the MPI call MPI_Send(ptr_a, 1000, MPI_FLOAT, 2000, 4, MPI_COMM_WORLD) resulted in a data transfer of bytes, what is the size of each data element being sent? byte (B) 2 bytes (C) 4 bytes (D) 8 bytes 3. Which of the following statements is true? (A) MPI_send() is blocking by default. (B) MPI_recv() is blocking by default. (C) MPI messages must be at least 128 bytes. (D) MPI processes can access the same variable through shared memory.

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

Program Optimization Study on a 128-Core GPU

Program Optimization Study on a 128-Core GPU Program Optimization Study on a 128-Core GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, and Wen-mei W. Hwu Yu, Xuan Dept of Computer & Information Sciences University

More information

Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary

Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary OpenCL Optimization Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary 2 Overall Optimization Strategies Maximize parallel

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

GPU Performance Analysis and Optimisation

GPU Performance Analysis and Optimisation GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail

More information

CUDA Basics. Murphy Stein New York University

CUDA Basics. Murphy Stein New York University CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture

More information

GPU Memory. Memory access:100 times more time to access local/global memory. Maximize shared/ register memory.

GPU Memory. Memory access:100 times more time to access local/global memory. Maximize shared/ register memory. GPU Memory Local registers per thread. A parallel data cache or shared memory that is shared by all the threads. A read-only constant cache that is shared by all the threads. A read-only texture cache

More information

CUDA Programming. Week 4. Shared memory and register

CUDA Programming. Week 4. Shared memory and register CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK

More information

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer Stanford University mebauer@cs.stanford.edu Henry Cook UC Berkeley hcook@cs.berkeley.edu Brucek Khailany NVIDIA Research bkhailany@nvidia.com

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Hands-on CUDA exercises

Hands-on CUDA exercises Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

Guided Performance Analysis with the NVIDIA Visual Profiler

Guided Performance Analysis with the NVIDIA Visual Profiler Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided

More information

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

Image Processing & Video Algorithms with CUDA

Image Processing & Video Algorithms with CUDA Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

CUDA STREAMS BEST PRACTICES AND COMMON PITFALLS. Justin Luitjens - NVIDIA

CUDA STREAMS BEST PRACTICES AND COMMON PITFALLS. Justin Luitjens - NVIDIA CUDA STREAMS BEST PRACTICES AND COMMON PITFALLS Justin Luitjens - NVIDIA Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Launch a GPU Kernel 3. Copy results from GPU

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Hands On CUDA Tools and Performance-Optimization

Hands On CUDA Tools and Performance-Optimization Mitglied der Helmholtz-Gemeinschaft Hands On CUDA Tools and Performance-Optimization JSC GPU Programming Course 26. März 2011 Dominic Eschweiler Outline of This Talk Introduction Setup CUDA-GDB Profiling

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

Parallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com

Parallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

GPU Accelerated Pathfinding

GPU Accelerated Pathfinding GPU Accelerated Pathfinding By: Avi Bleiweiss NVIDIA Corporation Graphics Hardware (2008) Editors: David Luebke and John D. Owens NTNU, TDT24 Presentation by Lars Espen Nordhus http://delivery.acm.org/10.1145/1420000/1413968/p65-bleiweiss.pdf?ip=129.241.138.231&acc=active

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set

More information

Advanced CUDA Webinar. Memory Optimizations

Advanced CUDA Webinar. Memory Optimizations Advanced CUDA Webinar Memory Optimizations Outline Overview Hardware Memory Optimizations Data transfers between host and device Device memory optimizations Summary Measuring performance effective bandwidth

More information

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0 Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

An Introduction to GPU Computing and CUDA Architecture

An Introduction to GPU Computing and CUDA Architecture An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density

More information

AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS. A Thesis by ANDREW KEITH LACHANCE May 2011

AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS. A Thesis by ANDREW KEITH LACHANCE May 2011 AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS A Thesis by ANDREW KEITH LACHANCE May 2011 Submitted to the Graduate School Appalachian State University in partial fulfillment of

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard

More information

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL Michał Wójcik, Tomasz Boiński Katedra Architektury Systemów Komputerowych Wydział Elektroniki, Telekomunikacji i Informatyki Politechnika

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

GTC 2014 San Jose, California

GTC 2014 San Jose, California GTC 2014 San Jose, California An Approach to Parallel Processing of Big Data in Finance for Alpha Generation and Risk Management Yigal Jhirad and Blay Tarnoff March 26, 2014 GTC 2014: Table of Contents

More information

Parallel Computing for Data Science

Parallel Computing for Data Science Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011. Assignment 5 Posted. Project

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011. Assignment 5 Posted. Project Administrivia OpenCL Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 5 Posted Due Friday, 03/25, at 11:59pm Project One page pitch due Sunday, 03/20, at 11:59pm 10 minute pitch

More information

GPU Tools Sandra Wienke

GPU Tools Sandra Wienke Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction

First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction A.P. Lowell P. Kahn J. Ku 25 March 2014 Overview Application Algorithms History and Limitations of Traditional Processors

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

Sélection adaptative de codes polyédriques pour GPU/CPU

Sélection adaptative de codes polyédriques pour GPU/CPU Sélection adaptative de codes polyédriques pour GPU/CPU Jean-François DOLLINGER, Vincent LOECHNER, Philippe CLAUSS INRIA - Équipe CAMUS Université de Strasbourg Saint-Hippolyte - Le 6 décembre 2011 1 Sommaire

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

Interconnect. Jesús Labarta. Index

Interconnect. Jesús Labarta. Index Interconnect Jesús Labarta Index 1 Interconnection networks Need to send messages (commands/responses, message passing) Processors Memory Node Node Interconnection networks Components Links Switches Network

More information

Operating Systems. Design and Implementation. Andrew S. Tanenbaum Melanie Rieback Arno Bakker. Vrije Universiteit Amsterdam

Operating Systems. Design and Implementation. Andrew S. Tanenbaum Melanie Rieback Arno Bakker. Vrije Universiteit Amsterdam Operating Systems Design and Implementation Andrew S. Tanenbaum Melanie Rieback Arno Bakker Vrije Universiteit Amsterdam Operating Systems - Winter 2012 Outline Introduction What is an OS? Concepts Processes

More information

Outline. Operating Systems Design and Implementation. Chap 1 - Overview. What is an OS? 28/10/2014. Introduction

Outline. Operating Systems Design and Implementation. Chap 1 - Overview. What is an OS? 28/10/2014. Introduction Operating Systems Design and Implementation Andrew S. Tanenbaum Melanie Rieback Arno Bakker Outline Introduction What is an OS? Concepts Processes and Threads Memory Management File Systems Vrije Universiteit

More information

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices The Yin and Yang of Processing Data Warehousing Queries on GPU Devices Yuan Yuan Rubao Lee Xiaodong Zhang Department of Computer Science and Engineering The Ohio State University {yuanyu, liru, zhang}@cse.ohio-state.edu

More information

Best Practice mini-guide accelerated clusters

Best Practice mini-guide accelerated clusters Using General Purpose GPUs Alan Gray, EPCC Anders Sjöström, LUNARC Nevena Ilieva-Litova, NCSA Partial content by CINECA: http://www.hpc.cineca.it/content/gpgpu-general-purpose-graphics-processing-unit

More information

GPU Computing - CUDA

GPU Computing - CUDA GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

More information

CUDA C/C++ Basics Supercomputing 2011 Tutorial Cyril Zeller, NVIDIA Corporation

CUDA C/C++ Basics Supercomputing 2011 Tutorial Cyril Zeller, NVIDIA Corporation CUDA C/C++ Basics Supercomputing 2011 Tutorial Cyril Zeller, NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Black-Scholes option pricing. Victor Podlozhnyuk vpodlozhnyuk@nvidia.com

Black-Scholes option pricing. Victor Podlozhnyuk vpodlozhnyuk@nvidia.com Black-Scholes option pricing Victor Podlozhnyuk vpodlozhnyuk@nvidia.com June 007 Document Change History Version Date Responsible Reason for Change 0.9 007/03/19 vpodlozhnyuk Initial release 1.0 007/04/06

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Operating Systems: Internals and Design Principles. Chapter 12 File Management Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 12 File Management Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 12 File Management Seventh Edition By William Stallings Operating Systems: Internals and Design Principles If there is one singular characteristic

More information

Pricing of cross-currency interest rate derivatives on Graphics Processing Units

Pricing of cross-currency interest rate derivatives on Graphics Processing Units Pricing of cross-currency interest rate derivatives on Graphics Processing Units Duy Minh Dang Department of Computer Science University of Toronto Toronto, Canada dmdang@cs.toronto.edu Joint work with

More information

Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster

Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster Jonatan Ward Sergey Andreev Francisco Heredia Bogdan Lazar Zlatka Manevska Eindhoven University of Technology,

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

www.quilogic.com SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology

www.quilogic.com SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology The parallel revolution Future computing systems are parallel, but Programmers

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

GPU Acceleration of the SENSEI CFD Code Suite

GPU Acceleration of the SENSEI CFD Code Suite GPU Acceleration of the SENSEI CFD Code Suite Chris Roy, Brent Pickering, Chip Jackson, Joe Derlaga, Xiao Xu Aerospace and Ocean Engineering Primary Collaborators: Tom Scogland, Wu Feng (Computer Science)

More information

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Heavy Parallelization of Alternating Direction Schemes in Multi-Factor Option Valuation Models. Cris Doloc, Ph.D.

Heavy Parallelization of Alternating Direction Schemes in Multi-Factor Option Valuation Models. Cris Doloc, Ph.D. Heavy Parallelization of Alternating Direction Schemes in Multi-Factor Option Valuation Models Cris Doloc, Ph.D. WHO INTRO Ex-physicist Ph.D. in Computational Physics - Applied TN Plasma (10 yrs) Working

More information

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems Lecture Outline Operating Systems Objectives Describe the functions and layers of an operating system List the resources allocated by the operating system and describe the allocation process Explain how

More information

Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development

Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development Official Use Only Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development Christian Trott Unclassified, Unlimited release Sandia National Laboratories is a multi-program

More information

Multi-GPU Programming Supercomputing 2011

Multi-GPU Programming Supercomputing 2011 Multi-GPU Programming Supercomputing 2011 Paulius Micikevicius NVIDIA November 14, 2011 Outline Usecases and a taxonomy of scenarios Inter-GPU communication: Single host, multiple GPUs Multiple hosts Case

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off) The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off) CS 223 Guest Lecture John Owens Electrical and Computer Engineering, UC Davis Credits This lecture is originally derived

More information

Rootbeer: Seamlessly using GPUs from Java

Rootbeer: Seamlessly using GPUs from Java Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java

More information

Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre

Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre Graphics Processing Unit (GPU) Memory Hierarchy Presented by Vu Dinh and Donald MacIntyre 1 Agenda Introduction to Graphics Processing CPU Memory Hierarchy GPU Memory Hierarchy GPU Architecture Comparison

More information

CUDA. Multicore machines

CUDA. Multicore machines CUDA GPU vs Multicore computers Multicore machines Emphasize multiple full-blown processor cores, implementing the complete instruction set of the CPU The cores are out-of-order implying that they could

More information

Optimisation and SPH Tricks

Optimisation and SPH Tricks Optimisation and SPH Tricks + José Manuel Domínguez Alonso jmdominguez@uvigo.es EPHYSLAB, Universidade de Vigo, Spain Outline DualSPHysics Users Workshop 2015, 8-9 September 2015, Manchester (United Kingdom)

More information

Configuring Apache Derby for Performance and Durability Olav Sandstå

Configuring Apache Derby for Performance and Durability Olav Sandstå Configuring Apache Derby for Performance and Durability Olav Sandstå Database Technology Group Sun Microsystems Trondheim, Norway Overview Background > Transactions, Failure Classes, Derby Architecture

More information

CUDA for Real Time Multigrid Finite Element Simulation of

CUDA for Real Time Multigrid Finite Element Simulation of CUDA for Real Time Multigrid Finite Element Simulation of SoftTissue Deformations Christian Dick Computer Graphics and Visualization Group Technische Universität München, Germany Motivation Real time physics

More information

JPEG Compression Algorithm Using CUDA Abstract 1. Introduction

JPEG Compression Algorithm Using CUDA Abstract 1. Introduction JPEG Compression Algorithm Using CUDA Course Project for ECE 1724 Pranit Patel, Jeff Wong, Manisha Tatikonda, and Jarek Marczewski Department of Computer Engineering University of Toronto Abstract The

More information

Efficient Sparse Matrix-Vector Multiplication on CUDA

Efficient Sparse Matrix-Vector Multiplication on CUDA Efficient Sparse -Vector Multiplication on CUDA Nathan Bell and Michael Garland December 11, 2008 Abstract The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many

More information

Chapter 12 File Management. Roadmap

Chapter 12 File Management. Roadmap Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access

More information

Chapter 12 File Management

Chapter 12 File Management Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Technical Note. Enhancing Burst Performance on Micron and Crucial SSDs With Momentum Cache. Introduction

Technical Note. Enhancing Burst Performance on Micron and Crucial SSDs With Momentum Cache. Introduction Technical Note Enhancing Burst Performance on Micron and Crucial SSDs With Momentum Cache Introduction What is Momentum Cache? How Does Momentum Cache Work? Introduction Micron's Momentum Cache is an intelligent

More information

Device Management Functions

Device Management Functions REAL TIME OPERATING SYSTEMS Lesson-6: Device Management Functions 1 1. Device manager functions 2 Device Driver ISRs Number of device driver ISRs in a system, Each device or device function having s a

More information

ELEC 377 Operating Systems. Thomas R. Dean

ELEC 377 Operating Systems. Thomas R. Dean ELEC 377 Operating Systems Thomas R. Dean Instructor Tom Dean Office:! WLH 421 Email:! tom.dean@queensu.ca Hours:! Wed 14:30 16:00 (Tentative)! and by appointment! 6 years industrial experience ECE Rep

More information

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples

More information

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware COS 318: Operating Systems I/O and Drivers Input and Output A computer s job is to process data Computation (, cache, and memory) Move data into and out of a system (between I/O devices and memory) Challenges

More information