L20: GPU Architecture and Models

Similar documents
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Computer Graphics Hardware An Overview

GPGPU Computing. Yong Cao

GPU Architecture. Michael Doggett ATI

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Introduction to GPGPU. Tiziano Diamanti

Introduction to GPU Programming Languages

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Next Generation GPU Architecture Code-named Fermi

Introduction to GPU hardware and to CUDA

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

QCD as a Video Game?

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

System requirements for Autodesk Building Design Suite 2017

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

NVIDIA GeForce GTX 580 GPU Datasheet

Autodesk Revit 2016 Product Line System Requirements and Recommendations

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Radeon HD 2900 and Geometry Generation. Michael Doggett

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Revit products will use multiple cores for many tasks, using up to 16 cores for nearphotorealistic

Writing Applications for the GPU Using the RapidMind Development Platform

Introduction to GPU Computing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

The Future Of Animation Is Games

Hardware design for ray tracing

Introduction to GPU Architecture

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

NVIDIA workstation 3D graphics card upgrade options deliver productivity improvements and superior image quality

How To Teach Computer Graphics

GPUs for Scientific Computing

INTRODUCTION TO WINDOWS 7

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Web Based 3D Visualization for COMSOL Multiphysics

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Introduction to Computer Graphics

Lecture Notes, CEng 477

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

QuickSpecs. NVIDIA Quadro K1200 4GB Graphics INTRODUCTION PERFORMANCE AND FEATURES. Overview

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Autodesk 3ds Max 2010 Boot Camp FAQ

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

GPU for Scientific Computing. -Ali Saleh

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

HPC with Multicore and GPUs

Parallel Computing with MATLAB

¹ Autodesk Showcase 2016 and Autodesk ReCap 2016 are not supported in 32-Bit.

EMC ISILON AND ELEMENTAL SERVER

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Binary search tree with SIMD bandwidth optimization using SSE

Interactive Level-Set Deformation On the GPU

Texture Cache Approximation on GPUs

Interactive Rendering In The Post-GPU Era. Matt Pharr Graphics Hardware 2006

QuickSpecs. NVIDIA Quadro K2200 4GB Graphics INTRODUCTION. NVIDIA Quadro K2200 4GB Graphics. Technical Specifications

Choosing a Computer for Running SLX, P3D, and P5

Ray Tracing on Graphics Hardware

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Chapter 2 Parallel Architecture, Software And Performance

HP Workstations graphics card options

Autodesk Inventor on the Macintosh

NVIDIA GeForce GTX 750 Ti

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Console Architecture. By: Peter Hood & Adelia Wong

Optimization for DirectX9 Graphics. Ashu Rege

Parallel Programming Survey

Let s put together a Manual Processor

CUDA programming on NVIDIA GPUs

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

GPU Parallel Computing Architecture and CUDA Programming Model

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Developer Tools. Tim Purcell NVIDIA

ME964 High Performance Computing for Engineering Applications

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

A Crash Course on Programmable Graphics Hardware

Clustering Billions of Data Points Using GPUs

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

Advanced Rendering for Engineering & Styling

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

For designers and engineers, Autodesk Product Design Suite Standard provides a foundational 3D design and drafting solution.

Oracle Applications Release 10.7 NCA Network Performance for the Enterprise. An Oracle White Paper January 1998

Graphic Processing Units: a possible answer to High Performance Computing?

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

NVIDIA IndeX Enabling Interactive and Scalable Visualization for Large Data Marc Nienhaus, NVIDIA IndeX Engineering Manager and Chief Architect

Transcription:

L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays. The original intent when designing GPUs was to use them exclusively for graphics rendering. People realized later that GPUs can in fact be used to process highly parallel data structures and became popular for solving problems in other scientific domains which require intensive computational power. This notes will give a short history of graphics and then move to discuss GPU s pipeline. It then concludes with a section about GPU hierarchy and show an example of performance measure. 20.2 Graphics History How a pixel is drawn on the screen? The process by which a computer transforms shapes stored in memory into actual objects drawing in the screen is known as rendering. And most well known and used technique during rendering is called rasterization. The basic concept can be described as shown in figure 20.1 where each pixel is drawn on the screen if there exists a trace of light (the arrow) that intersect one of the triangle shapes stored in memory. This simplistic view also shows that if a triangle is also further back than other surrounding triangles then this should be reflected on the display and if a triangle is on top of another triangle then the latter will be partly obscured by the former on the display. It is this approach whats makes the foundation for 3D graphic support on computer screens. Figure 20.2 shows few examples of images rendered with this approach. Eventually, GPUs are the hardware responsible for transforming the representation in memory to actual objects in the screen. And since this is needs to be done for every pixel, GPUs were developed to handle that in parallel. Figure 20.1: How pixels are drawn on display. Earlier GPUs in the 1990s were offered with a fixed functional pipeline that was implemented directly via mainly two APIs: OpenGL and DirectX. This meant that GPUs were programmable with a low level 1

Figure 20.2: Two computer graphics created by drawing a pixel through tracing triangular shapes.

set of operations that interact directly with the hardware similar to assembly languages. At that time almost most computer graphics look similar because they use either one of these two available APIs. OpenGL in particular was not a programming language but rather a vendor specification of how the hardware interface should be implemented. On the other hand, DirectX offered by Microsoft was essentially a library that allowed development of graphics in computer games. The pipeline then evolved into a more advanced level and shading capabilities were added into the APIs. The two main vendors who were manufacturing GPUs were nvidia and ATI. In the next section, we shed more light on GPU s pipeline. 20.3 GPU Pipeline Figure 20.3 shows how early GPUs pipeline looked like. The input to the pipeline was the description of graphics in memory in terms of vertices and triangular shapes. The input will be transformed using a vertex shader which process every vertex individually and creates rasterized pixels in 2D screen. The fragment shader will then take that and produce the final output. Figure 20.3: Early GPU pipeline. Figure 20.4: Improved GPU pipeline. Figure 20.4 shows how the GPU pipeline looks now. Unlike early GPUs which used different hardware for different shaders, the new pipeline uses a unified hardware that combines all shaders under one core and shared memory. In addition, nvidia introduced CUDA (Compute Unified Device Architecture) which added the ability to write general-purpose C code with some restrictions. This meant that a programmer has access to thousands of cores which can be instructed to carry out similar operations in parallel. 20.4 GPU Hierarchy & Programming In this section we will zoom out to look at the hierarchal structure of GPUs and give a simple example of performance measure. Top of the line GPUs now are claimed to support few teraflops and 100+ GB/sec of memory bandwidth internally. They are also easier to program through support of C++ (as in nvidia s Fermi architecture) and MATLAB integration. This is essentially why GPUs became very popular because it was possible now to carry out expensive computational task without the need for expensive hardware

cluster setup. Therefore, many modern applications such as Phtotoshop and Mathematica took advantage of GPUs. However, in order to fully take advantage of GPUs, the program must be highly parallel and contain fine-grain simple operations which can be distributed easily across the e.g. thousands of cores. Figure 20.5 shows the hierarchy of a GPU. The bottom of diagram is the computer s system memory (RAM and disk) and then on top of that is GPU s internal memory. Then there are two levels caches: L2 and L1, and each L1 cache is connected to a processor known as stream processor (SM). The L1 cache has a small memory range of about 16 to 64 kbs. Similarly, L2 cache in the GPU is significantly smaller than typical L2 cache in computer s memory, with the former in the order of few hundred kbs and the latter few MBs. The trend is also the same at the memory level, with GPU s memory up to 12 GBs and computer s main memory ranging between 4 GBs up to 128 GBs. Furthermore, the GPU s memory is not coherent meaning there is no support for concurrent read and concurrent write (CRCW). Figure 20.5: GPU Hierarchy Figure 20.6 is showing an example structure of each of stream processors inside of GPU. As can be seen from the bottom of digram there is an L1 cache. There is also register memory (top of diagram) that is shared between the 32 cores, with each core has access to fixed set of registers. There has been some excitement in the research community about GPU s 100-200x speed-up gains which often overlooked some aspects of the GPUs and which were driven mainly by unfair comparison to the general-purpose CPUs. Some of the key issues to consider when evaluating the gain in speed-up in the GPU in comparison to CPU are, for instance, whether similar optimized code was used on both GPU and CPU, and whether single precision or double precision numbers were used. The double precision is known to cause slow performance in GPU. Further, it is often the case that the transfer time between computer s memory to GPU s memory is neglected. As it turn out in some cases this can be nontrivial and it is a hardware issue that can not be handled by software. Taking this transfer time into account when measuring performance and making comparisons is essential. To illustrate this final point of memory transfer time, we will now show a simple experiment of GPU performance measure using a Matlab program. Consider the following lines of Matlab code, cpu_x = rand(1,100000000)*10*pi; gpu_x = gpuarray(cpu_x);

Figure 20.6: GPU s Stream Processor. gpu_y = sin(gpu_x); cpu_y = gather(gpu_y); The first line creates a large array data structure with hundreds of millions of decimal numbers. The second line loads this large array into GPU s memory. The third executes the sin function on each individual number of the array inside the GPU. And last line transfer the results back into computer s main memory. The plot in the top of figure 20.7 shows the runtime performance comparison as a result of executing this program with increasing input size. The middle line is the baseline performance of computer s CPU while the bottom line is GPU s runtime excluding memory transfer and the line at the top is GPU s runtime performance when taking memory transfer into account. Obviously, this shows that GPU s runtime performance is better by about a factor of 3 without considering transfer time and it is worse than CPU s runtime performance when transfer time is included. However, now consider changing the the third line in our Matlab program and instead use a more expensive function such as big-trig-function instead of sin. The plot at the bottom of figure 20.7 shows the resulting runtime performance. The top line is CPU performance while the two lines below are runtime performance including transfer time of two different brands of GPUs. This time the GPUs outperform the CPU by about a factor of 3 to 6 even when the transfer time is included. Therefore, one has to be careful when measuring GPUs speedup gains for certain tasks and not to overlook important aspects that may affect a fair comparison.

Figure 20.7: GPU runtime performance measure.