# 1. If we need to use each thread to calculate one output element of a vector addition, what would

Save this PDF as:

Size: px
Start display at page:

## Transcription

2 5. If the previous question, how many warps do you expect to have divergence due to the boundary check on vector length? (B) 2 (C) 3 (D) 6 Answer: (A) Quiz Questions Lecture 3: 1. For our tiled matrix matrix multiplication kernel, if we use a 32X32 tile, what is the reduction of memory bandwidth usage for input matrices M and N? /8 of the original usage (B) 1/16 of the original usage (C) 1/32 of the original usage (D) 1/64 of the original usage 2. Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel? (B) 1000 (C) 512 (D) In the previous question, if a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel? (B) 1000 (C) 512 (D) 51200

3 4. For the simple matrix matrix multiplication (MxN) based on row major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither 5. For the tiled matrix matrix multiplication (MxN) based on row major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither Quiz Questions: Lecture 4 1. For the simple reduction kernel, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the 5 th iteration? (A) 0 (B) 1 (C) 16 (D) 32, All warps will have divergence throughout the execution. 2. For the improved reduction kernel, if the block size is 1024 and warp size is 32, how many warps will have divergence during the 5 th iteration? (A) 0 (B) 1 (C) 16 (D) 32 Answer: (A), There are 64 consecutive active threads, more than warp size.

4 3. For the work efficient scan kernel, assume that we have 2048 elements, how many add operations will be performed in both the reduction tree phase and the inverse reduction tree phase? (A) (2048 1)*2 (B) (1024 1)*2 (C) 1024*1024 (D) 10*1024 Answer: (A) 4. For the work inefficient scan kernel based on reduction trees, assume that we have 2048 elements, which of the following gives the closest approximation on how many add operations will be performed? (A) (2048 1)*2 (B) (1024 1)*2 (C) 1024*1024 (D) 10* For the vector addition example where input vectors are read from disk, if the GPU kernel runs at 190GFLOPS, and the PCIe is able to deliver a bandwidth of 6GBps, which of the following is the closest approximation of the minimum time it would take to add two 190 mega element vectors stored in the host memory and get the result back to the host memory? 90 / 190 ms (B) 190 / 6 ms (C) 8 * 190 / 6 ms (D) 2 * 190 / 6 ms Lecture 5 1. What is the CUDA API call that make sure that all previous kernel executions and memory copies have been completed? (A) syncthreads() (B) cudadevicesynchronize() (C) cudastreamsynchronize() (D) barrier()

5 2. Which of the following statements is true? (A) The data transfer between device and host is done by DMA hardware using virtual addresses. (B) The OS automatically guarantees that any memory being used by a DMA device is not swapped out. (C) If a swapped page is to be transferred by cudymemcpy(), it needs to be first copied to a pinned memory buffer before transferred. (D) Pinned memory is allocated with cudamalloc() function. Lecture 6 1. For vector addition, if there are 100,000 elements in each vector and we are using 3 compute processes. How many elements are we sending to the last compute process? (A) 5 (B) 300 (C) 333 (D) If the MPI call MPI_Send(ptr_a, 1000, MPI_FLOAT, 2000, 4, MPI_COMM_WORLD) resulted in a data transfer of bytes, what is the size of each data element being sent? byte (B) 2 bytes (C) 4 bytes (D) 8 bytes 3. Which of the following statements is true? (A) MPI_send() is blocking by default. (B) MPI_recv() is blocking by default. (C) MPI messages must be at least 128 bytes. (D) MPI processes can access the same variable through shared memory.

### CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

### Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

### Program Optimization Study on a 128-Core GPU

Program Optimization Study on a 128-Core GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, and Wen-mei W. Hwu Yu, Xuan Dept of Computer & Information Sciences University

### GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

### Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

### Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

### Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary

OpenCL Optimization Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary 2 Overall Optimization Strategies Maximize parallel

### GPU Performance Analysis and Optimisation

GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler

### Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

### Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

### CUDA Basics. Murphy Stein New York University

CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture

### E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

### A Model- Driven Partitioning and Auto- tuning Integrated Framework for Sparse Matrix- Vector Multiplication on GPUs

A Model- Driven Partitioning and Auto- tuning Integrated Framework for Sparse Matrix- Vector Multiplication on GPUs HUANG, He Department of Computer Science, University of Wyoming Ping Guo, He Huang, Qichang

### GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

### GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail

### CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

### GPU Memory. Memory access:100 times more time to access local/global memory. Maximize shared/ register memory.

GPU Memory Local registers per thread. A parallel data cache or shared memory that is shared by all the threads. A read-only constant cache that is shared by all the threads. A read-only texture cache

### CUDA Programming. Week 4. Shared memory and register

CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK

### OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

### Hands-on CUDA exercises

Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished

### Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

### CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer Stanford University mebauer@cs.stanford.edu Henry Cook UC Berkeley hcook@cs.berkeley.edu Brucek Khailany NVIDIA Research bkhailany@nvidia.com

### CS 179 Lecture 13. Host-Device Data Transfer

CS 179 Lecture 13 Host-Device Data Transfer 1 Moving data is slow So far we ve only considered performance when the data is already on the GPU This neglects the slowest part of GPU programming: getting

### Guided Performance Analysis with the NVIDIA Visual Profiler

Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided

### MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

### Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

### CS 179 Lecture 6. Synchronization, Matrix Transpose, Profiling, AWS Cluster

CS 179 Lecture 6 Synchronization, Matrix Transpose, Profiling, AWS Cluster Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many algorithms

### Image Processing & Video Algorithms with CUDA

Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly

### Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

### CUDA STREAMS BEST PRACTICES AND COMMON PITFALLS. Justin Luitjens - NVIDIA

CUDA STREAMS BEST PRACTICES AND COMMON PITFALLS Justin Luitjens - NVIDIA Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Launch a GPU Kernel 3. Copy results from GPU

### GPU Accelerated Pathfinding

GPU Accelerated Pathfinding By: Avi Bleiweiss NVIDIA Corporation Graphics Hardware (2008) Editors: David Luebke and John D. Owens NTNU, TDT24 Presentation by Lars Espen Nordhus http://delivery.acm.org/10.1145/1420000/1413968/p65-bleiweiss.pdf?ip=129.241.138.231&acc=active

### GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

### Hands On CUDA Tools and Performance-Optimization

Mitglied der Helmholtz-Gemeinschaft Hands On CUDA Tools and Performance-Optimization JSC GPU Programming Course 26. März 2011 Dominic Eschweiler Outline of This Talk Introduction Setup CUDA-GDB Profiling

### ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

### CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum

### CUDA programming on NVIDIA GPUs

p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

### HPC with Multicore and GPUs

HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

### OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

### Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

### Parallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com

Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007

### Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set

### NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

### Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What

### Improving SIMD Efficiency for Parallel Monte Carlo Light Transport on the GPU. by Dietger van Antwerpen

Improving SIMD Efficiency for Parallel Monte Carlo Light Transport on the GPU by Dietger van Antwerpen Outline Introduction Path Tracing Bidirectional Path Tracing Metropolis Light Transport Results Demo

### Parallel Computing for Data Science

Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

### An Introduction to GPU Computing and CUDA Architecture

An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density

### GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

### ~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

### AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS. A Thesis by ANDREW KEITH LACHANCE May 2011

AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS A Thesis by ANDREW KEITH LACHANCE May 2011 Submitted to the Graduate School Appalachian State University in partial fulfillment of

### First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction

First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction A.P. Lowell P. Kahn J. Ku 25 March 2014 Overview Application Algorithms History and Limitations of Traditional Processors

### Introduction to CUDA C

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard

### HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

### Operating Systems. Design and Implementation. Andrew S. Tanenbaum Melanie Rieback Arno Bakker. Vrije Universiteit Amsterdam

Operating Systems Design and Implementation Andrew S. Tanenbaum Melanie Rieback Arno Bakker Vrije Universiteit Amsterdam Operating Systems - Winter 2012 Outline Introduction What is an OS? Concepts Processes

### Outline. Operating Systems Design and Implementation. Chap 1 - Overview. What is an OS? 28/10/2014. Introduction

Operating Systems Design and Implementation Andrew S. Tanenbaum Melanie Rieback Arno Bakker Outline Introduction What is an OS? Concepts Processes and Threads Memory Management File Systems Vrije Universiteit

### Advanced CUDA Webinar. Memory Optimizations

Advanced CUDA Webinar Memory Optimizations Outline Overview Hardware Memory Optimizations Data transfers between host and device Device memory optimizations Summary Measuring performance effective bandwidth

### Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL Michał Wójcik, Tomasz Boiński Katedra Architektury Systemów Komputerowych Wydział Elektroniki, Telekomunikacji i Informatyki Politechnika

### GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

### GTC 2014 San Jose, California

GTC 2014 San Jose, California An Approach to Parallel Processing of Big Data in Finance for Alpha Generation and Risk Management Yigal Jhirad and Blay Tarnoff March 26, 2014 GTC 2014: Table of Contents

### 8 GPU & Cuda. Quellen: ents.pdf

8 GPU & Cuda Quellen: http://courses.cs.washington.edu/courses/cse471/13sp/lectures/gpusstud ents.pdf http://on-demand.gputechconf.com/gtcexpress/2011/presentations/gtc_express_sarah_tariq_june2011.pdf

### OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011. Assignment 5 Posted. Project

Administrivia OpenCL Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 5 Posted Due Friday, 03/25, at 11:59pm Project One page pitch due Sunday, 03/20, at 11:59pm 10 minute pitch

### Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

### GPU Tools Sandra Wienke

Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA

### GOJAN SCHOOL OF BUSINESS AND TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY CS2411-OPERATING SYSTEM QUESTION BANK UNIT-I (PROCESSES AND THREADS)

GOJAN SCHOOL OF BUSINESS AND TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY CS2411-OPERATING SYSTEM QUESTION BANK UNIT-I (PROCESSES AND THREADS) 1. What is an Operating system? What are the various OS

### Interconnect. Jesús Labarta. Index

Interconnect Jesús Labarta Index 1 Interconnection networks Need to send messages (commands/responses, message passing) Processors Memory Node Node Interconnection networks Components Links Switches Network

### Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

### OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

### Sélection adaptative de codes polyédriques pour GPU/CPU

Sélection adaptative de codes polyédriques pour GPU/CPU Jean-François DOLLINGER, Vincent LOECHNER, Philippe CLAUSS INRIA - Équipe CAMUS Université de Strasbourg Saint-Hippolyte - Le 6 décembre 2011 1 Sommaire

### Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

### IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA. A Thesis by. Deepthi Gummadi

IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA A Thesis by Deepthi Gummadi Bachelor of Engineering and Technology, Jawaharlal Nehru Technological University, 2009 Submitted to the Department of

### Lecture 5. An improved matrix multiply Occupancy Thread divergence

Lecture 5 An improved matrix multiply Occupancy Thread divergence Today s lecture Occupancy and latency Further Improvements to Matrix Multiply Thread divergence Scott B. Baden / CSE 262 / UCSD, Wi '15

### NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

### GPUs: Doing More Than Just Games. Mark Gahagan CSE 141 November 29, 2012

GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

### Operating Systems: Internals and Design Principles. Chapter 12 File Management Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles Chapter 12 File Management Seventh Edition By William Stallings Operating Systems: Internals and Design Principles If there is one singular characteristic

### Section I Section Real Time Systems. Processes. 1.7 Memory Management. (Textbook: A. S. Tanenbaum Modern OS - ch. 3) Memory Management Introduction

EE206: Software Engineering IV 1.7 Memory Management page 1 of 28 Section I Section Real Time Systems. Processes 1.7 Memory Management (Textbook: A. S. Tanenbaum Modern OS - ch. 3) Memory Management Introduction

### Last Class: File System Abstraction! Today: File System Implementation!

Last Class: File System Abstraction! Lecture 19, page 1 Today: File System Implementation! Disk management Brief review of how disks work. How to organize data on to disks. Lecture 19, page 2 How Disks

### Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

### Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

Lecture Outline Operating Systems Objectives Describe the functions and layers of an operating system List the resources allocated by the operating system and describe the allocation process Explain how

### High-Performance Software Rasterization on GPUs. NVIDIA Research

High-Performance Software Rasterization on GPUs Samuli Laine Tero Karras NVIDIA Research Graphics and Programmability Graphics pipeline (OpenGL/D3D) Driven by dedicated hardware Executes user code in shaders

### Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal

### Accelerating Wavelet-Based Video Coding on Graphics Hardware

Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing

### Device Management Functions

REAL TIME OPERATING SYSTEMS Lesson-6: Device Management Functions 1 1. Device manager functions 2 Device Driver ISRs Number of device driver ISRs in a system, Each device or device function having s a

### GPU Computing - CUDA

GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

### CUDA C/C++ Basics Supercomputing 2011 Tutorial Cyril Zeller, NVIDIA Corporation

CUDA C/C++ Basics Supercomputing 2011 Tutorial Cyril Zeller, NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard

### The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices Yuan Yuan Rubao Lee Xiaodong Zhang Department of Computer Science and Engineering The Ohio State University {yuanyu, liru, zhang}@cse.ohio-state.edu

### Configuring Apache Derby for Performance and Durability Olav Sandstå

Configuring Apache Derby for Performance and Durability Olav Sandstå Database Technology Group Sun Microsystems Trondheim, Norway Overview Background > Transactions, Failure Classes, Derby Architecture

### Best Practice mini-guide accelerated clusters

Using General Purpose GPUs Alan Gray, EPCC Anders Sjöström, LUNARC Nevena Ilieva-Litova, NCSA Partial content by CINECA: http://www.hpc.cineca.it/content/gpgpu-general-purpose-graphics-processing-unit

### LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

### Chapter 12 File Management. Roadmap

Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access

### Chapter 12 File Management

Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access

### Multi-GPU Programming Supercomputing 2011

Multi-GPU Programming Supercomputing 2011 Paulius Micikevicius NVIDIA November 14, 2011 Outline Usecases and a taxonomy of scenarios Inter-GPU communication: Single host, multiple GPUs Multiple hosts Case

### Last Class: File System Abstraction. Protection

Last Class: File System Abstraction Naming Protection Persistence Fast access Lecture 17, page 1 Protection The OS must allow users to control sharing of their files => control access to files Grant or

### Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster

Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster Jonatan Ward Sergey Andreev Francisco Heredia Bogdan Lazar Zlatka Manevska Eindhoven University of Technology,

### Black-Scholes option pricing. Victor Podlozhnyuk vpodlozhnyuk@nvidia.com

Black-Scholes option pricing Victor Podlozhnyuk vpodlozhnyuk@nvidia.com June 007 Document Change History Version Date Responsible Reason for Change 0.9 007/03/19 vpodlozhnyuk Initial release 1.0 007/04/06

### Multi-core Programming System Overview

Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

### ELEC 377 Operating Systems. Thomas R. Dean

ELEC 377 Operating Systems Thomas R. Dean Instructor Tom Dean Office:! WLH 421 Email:! tom.dean@queensu.ca Hours:! Wed 14:30 16:00 (Tentative)! and by appointment! 6 years industrial experience ECE Rep

### Pricing of cross-currency interest rate derivatives on Graphics Processing Units

Pricing of cross-currency interest rate derivatives on Graphics Processing Units Duy Minh Dang Department of Computer Science University of Toronto Toronto, Canada dmdang@cs.toronto.edu Joint work with

### High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples

### GPU Acceleration of the SENSEI CFD Code Suite

GPU Acceleration of the SENSEI CFD Code Suite Chris Roy, Brent Pickering, Chip Jackson, Joe Derlaga, Xiao Xu Aerospace and Ocean Engineering Primary Collaborators: Tom Scogland, Wu Feng (Computer Science)