# GPU Programming Strategies and Trends in GPU Computing

Save this PDF as:

Size: px
Start display at page:

## Transcription

9 Math Memory Total Math Memory Total Math Memory Total Math Memory Total Figure 7: Normalized run-time of modified kernels which are used to identify bottlenecks: (top left) a well balanced kernel, (top right) a latency bound kernel, (bottom left) a memory bound kernel, and (bottom right) an arithmetic bound kernel. Total refers to the total kernel time, whilst Memory refers to a kernel stripped of arithmetic operations, and Math refers to a kernel stripped of memory operations. It is important to note that latencies are part of the measured run-times for all kernel versions. To get the most accurate figures, we can follow another strategy which is to compare the run-time of three modified versions of the kernel: The original kernel, one Math version in which all memory loads and stores are removed, and one Memory version in which all arithmetic operations are removed (see Figure 7). If the Math version is significantly faster than the original and Memory kernels, we know that the kernel is memory bound, and conversely for arithmetics. This method has the added benefit of showing how well memory operations and arithmetic operations overlap. To create the Math kernel, we simply comment out all load operations, and move every store operation inside conditionals that will always evaluate to false. We do this to fool the compiler so that it does not optimize away the parts we want to profile, since the compiler will strip away all code not contributing to the final output to global memory. However, to make sure that the compiler does not move the computations inside the conditional as well, the result of the computations must also be used in the condition as shown in Listing 1. Creating the Memory kernel, on the other hand, is much simpler. Here, we can simply comment out all arithmetic operations, and instead add all data used by the kernel, and write out the sum as the result. Note that if control flow or addressing is dependent on data in memory the method becomes less straightforward and requires special care. A further issue with modifying the source code is that register count can change, which again can increase the occupancy and thereby invalidate the measured run-time. This, however, can be is solved by increasing the shared memory parameter in the launch configuration of the kernel, somekernel<<<grid_size, block_size, shared_mem_size,...>>>(...), until the occupancy of the unmodified version is matched. The occupancy can easily be examined using the profiler or the CUDA Occupancy Calculator. If a kernel appears to be well balanced (i.e., neither memory nor arithmetics appear to be the bottleneck), we must still check whether or not our kernel operates close to the theoretical performance numbers since it can suffer from latencies. These latencies are typically caused by problematic data dependencies or the inherent latencies of arithmetic operations. Thus, if your kernel is well balanced, but operates at only a fraction of the theoretical peak, it is probably bound by latencies. In this case, a reorganization of memory requests and arithmetic operations is often required. The goal should be to have many outstanding memory requests that can overlap with arithmetic operations. 9

13 gives a single rating (gigaflops) from a single benchmark (solving a linear system of equations). For GPUs, we want to see an end to the escalating and misleading speedup-race, and rather see detailed profiling results. This will give a much better view of how well the algorithm exploits the hardware, both on the CPU and on the GPU. Furthermore, reporting how well your implementation utilizes the hardware, and what the bottlenecks are, will give insight into how well it will perform on similar and even future hardware. Another benefit here, is that it becomes transparent what the bottleneck of the algorithm is, meaning it will become clear what hardware vendors and researchers should focus on for improving performance. 5 Debugging As an ever increasing portion of the C++ standard is supported by CUDA and more advanced debugging tools emerge, debugging GPU codes becomes more and more like debugging CPU codes. Many CUDA programmers have encountered the unspecified launch failure, which could be notoriously hard to debug. Such errors were typically only found by either modification and experimenting, or by careful examination of the source code. Today, however, there are powerful CUDA debugging tools for all the most commonly used operating systems. CUDA-GDB, available for Linux and Mac, can step through a kernel line by line at the granularity of a warp, e.g., identifying where an out-of-bounds memory access occurs, in a similar fashion to debugging a CPU program with GDB. In addition to stepping, CUDA-GDB also supports breakpoints, variable watches, and switching between blocks and threads. Other useful features include reports on the currently active CUDA threads on the GPU, reports on current hardware and memory utilization, and in-place substitution of changed code in running CUDA application. The tool enables debugging on hardware in real-time, and the only requirement for using CUDA-GDB is that the kernel is compiled with the -g -G flags. These flags make the compiler add debugging information into the executable, and the executable to spill all variables to memory. On Microsoft Windows, Parallel NSight is a plug-in for Microsoft Visual Studio which offers conditional breakpoints, assembly level debugging, and memory checking directly in the Visual Studio IDE. It furthermore offers an excellent profiling tool, and is freely available to developers. However, debugging requires two distinct GPUs; one for display, and one for running the actual code to be debugged. 6 Trends in GPU Computing GPUs have truly been a disruptive technology. Starting out as academic examples, they were shunned in scientific and high-performance communities: they were inflexible, inaccurate, and required a complete redesign of existing software. However, with time, there has been a low-end disruption, in which GPUs have slowly conquered a large portion of the high-performance computing segment, and three of five of todays fastest supercomputers are powered mainly by GPUs [23]. GPUs were never designed for high-performance computing in mind, but have nevertheless evolved into powerful processors that fit this market perfectly in less than ten years. The software and the hardware has developed in harmony, both due to the hardware vendors seeing new possibilities and due to researchers and industry identifying points of improvement. It is impossible to predict the future of GPUs, but by examining its history and current state, we might be able to identify some trends that are worth noting. The success of GPUs has partly been due to its price: GPUs are ridiculously inexpensive in terms of performance per dollar. This again comes from the mass production and target market. Essentially, GPUs are inexpensive because NVIDIA and AMD sell a lot of GPUs to the entertainment market. After researchers started to exploit GPUs, these hardware vendors have eventually developed functionality that is tailored for general-purpose computing and started selling GPUs intended for computing alone. Backed by the mass market, this means that the cost for the vendors to target this emerging market is very low, and the profits high. There are great similarities between the vector machines of the 1990ies and todays use of GPUs. The interest in vector machines eventually died out, as the x86 market took over. Will the same happen to GPUs once we conquer the power wall? In the short term, the answer is no for several reasons: First of all, conquering the power wall is not even on the horizon, meaning that for the foreseeable future, parallelism will be the key ingredient that will increase performance. Also, while vector machines were reserved for supercomputers alone, todays GPUs are available in 13

### Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

### Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

### GPUs: Doing More Than Just Games. Mark Gahagan CSE 141 November 29, 2012

GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

### Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

### Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

### CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

### Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

### Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

### GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

### Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

### OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

### GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

### Program Optimization Study on a 128-Core GPU

Program Optimization Study on a 128-Core GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, and Wen-mei W. Hwu Yu, Xuan Dept of Computer & Information Sciences University

### GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

### Building Blocks. CPUs, Memory and Accelerators

Building Blocks CPUs, Memory and Accelerators Outline Computer layout CPU and Memory What does performance depend on? Limits to performance Silicon-level parallelism Single Instruction Multiple Data (SIMD/Vector)

### GPU Performance Analysis and Optimisation

GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler

### NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

### White Paper COMPUTE CORES

White Paper COMPUTE CORES TABLE OF CONTENTS A NEW ERA OF COMPUTING 3 3 HISTORY OF PROCESSORS 3 3 THE COMPUTE CORE NOMENCLATURE 5 3 AMD S HETEROGENEOUS PLATFORM 5 3 SUMMARY 6 4 WHITE PAPER: COMPUTE CORES

### Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

### Lecture 3: Single processor architecture and memory

Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting

### Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary

OpenCL Optimization Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary 2 Overall Optimization Strategies Maximize parallel

### GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

### Windows Server Performance Monitoring

Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

### Parallel Programming Survey

Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

### GPU programming using C++ AMP

GPU programming using C++ AMP Petrika Manika petrika.manika@fshn.edu.al Elda Xhumari elda.xhumari@fshn.edu.al Julian Fejzaj julian.fejzaj@fshn.edu.al Abstract Nowadays, a challenge for programmers is to

### Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre

Graphics Processing Unit (GPU) Memory Hierarchy Presented by Vu Dinh and Donald MacIntyre 1 Agenda Introduction to Graphics Processing CPU Memory Hierarchy GPU Memory Hierarchy GPU Architecture Comparison

### Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

### Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

### GPU performance prediction using parametrized models

Utrecht University GPU performance prediction using parametrized models Andreas Resios A thesis submited in partial fulfillment of the requirements for the degree of Master of Science Supervisor: Utrecht

### GPU Architecture Overview. John Owens UC Davis

GPU Architecture Overview John Owens UC Davis The Right-Hand Turn [H&P Figure 1.1] Why? [Architecture Reasons] ILP increasingly difficult to extract from instruction stream Control hardware dominates µprocessors

### Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

### Direct GPU/FPGA Communication Via PCI Express

Direct GPU/FPGA Communication Via PCI Express Ray Bittner, Erik Ruf Microsoft Research Redmond, USA {raybit,erikruf}@microsoft.com Abstract Parallel processing has hit mainstream computing in the form

### L20: GPU Architecture and Models

L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

### Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

### Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

### Introduction to Cloud Computing

Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

### Whitepaper NVIDIA s Next Generation CUDA TM Compute Architecture: Fermi V1.1

Whitepaper NVIDIA s Next Generation CUDA TM Compute Architecture: TM Fermi V1.1 Table of Contents A Brief History of GPU Computing...3 The G80 Architecture...4 NVIDIA s Next Generation CUDA Compute and

### AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

### Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

### GPUs for Scientific Computing

GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

### Guided Performance Analysis with the NVIDIA Visual Profiler

Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided

### Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

### Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

### Ľudmila Jánošíková. Department of Transportation Networks Faculty of Management Science and Informatics University of Žilina

Assembly Language Ľudmila Jánošíková Department of Transportation Networks Faculty of Management Science and Informatics University of Žilina Ludmila.Janosikova@fri.uniza.sk 041/5134 220 Recommended texts

### GPU Accelerated Pathfinding

GPU Accelerated Pathfinding By: Avi Bleiweiss NVIDIA Corporation Graphics Hardware (2008) Editors: David Luebke and John D. Owens NTNU, TDT24 Presentation by Lars Espen Nordhus http://delivery.acm.org/10.1145/1420000/1413968/p65-bleiweiss.pdf?ip=129.241.138.231&acc=active

### OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

### 3DES ECB Optimized for Massively Parallel CUDA GPU Architecture

3DES ECB Optimized for Massively Parallel CUDA GPU Architecture Lukasz Swierczewski Computer Science and Automation Institute College of Computer Science and Business Administration in Łomża Lomza, Poland

### NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

### Infrastructure Matters: POWER8 vs. Xeon x86

Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

### SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

### ~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

### The ARM11 Architecture

The ARM11 Architecture Ian Davey Payton Oliveri Spring 2009 CS433 Why ARM Matters Over 90% of the embedded market is based on the ARM architecture ARM Ltd. makes over \$100 million USD annually in royalties

### INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and

### Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

### Radar signal processing on graphics processors (Nvidia/CUDA)

Radar signal processing on graphics processors (Nvidia/CUDA) Jimmy Pettersson & Ian Wainwright Full master thesis presentation will be held in january CUDA is a framework to access the GPU for non-graphics

### Performance Basics; Computer Architectures

8 Performance Basics; Computer Architectures 8.1 Speed and limiting factors of computations Basic floating-point operations, such as addition and multiplication, are carried out directly on the central

### Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

### This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

### Michael Fried GPGPU Business Unit Manager Microway, Inc. Updated June, 2010

Michael Fried GPGPU Business Unit Manager Microway, Inc. Updated June, 2010 http://microway.com/gpu.html Up to 1600 SCs @ 725-850MHz Up to 512 CUDA cores @ 1.15-1.4GHz 1600 SP, 320, 320 SF 512 SP, 256,

### Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

### Module 2: "Parallel Computer Architecture: Today and Tomorrow" Lecture 4: "Shared Memory Multiprocessors" The Lecture Contains: Technology trends

The Lecture Contains: Technology trends Architectural trends Exploiting TLP: NOW Supercomputers Exploiting TLP: Shared memory Shared memory MPs Bus-based MPs Scaling: DSMs On-chip TLP Economics Summary

### Heterogeneous Computing -> Fusion

Heterogeneous Computing -> Fusion Norm Rubin AMD Fellow 1 Heterogeneous Computing -> Fusion saahpc 2010 Definitions Heterogenous Computing A system comprised of two or more compute engines with signficant

### Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

### Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

### Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

### GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

### CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, 2012

1 CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, 2012 1. [ 12 Points ] Datapath & Pipelining. (a) Consider a simple in-order five-stage pipeline with a two-cycle

### NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

### In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

### Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering

### The Feasibility of Using OpenCL Instead of OpenMP for Parallel CPU Programming

The Feasibility of Using OpenCL Instead of OpenMP for Parallel CPU Programming Kamran Karimi Neak Solutions Calgary, Alberta, Canada kamran@neak-solutions.com Abstract OpenCL, along with CUDA, is one of

### Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

### Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

### CUDA programming on NVIDIA GPUs

p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

### Which ARM Cortex Core Is Right for Your Application: A, R or M?

Which ARM Cortex Core Is Right for Your Application: A, R or M? Introduction The ARM Cortex series of cores encompasses a very wide range of scalable performance options offering designers a great deal

### GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

### Generations of the computer. processors.

. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

### Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

### Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

### Introduction to GPU Architecture

Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

### GPU Computing - CUDA

GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

### IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

### what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

### Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What

### Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

### CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

### Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

### GPU Architecture. Michael Doggett ATI

GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super

### Operating Systems 4 th Class

Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science

### ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications CUDA Optimization Tiling as a Design Pattern in CUDA Vector Reduction Example March 13, 2012 Dan Negrut, 2012 ME964 UW-Madison The inside of

### Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

### Introduction to GPU Programming Languages

CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### Lecture 2: Modern Multi-Core Architecture: (multiple cores + SIMD + threads)

Lecture 2: Modern Multi-Core Architecture: (multiple cores + SIMD + threads) (and their performance characteristics) CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements

### ultra fast SOM using CUDA

ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

### ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit