CS 563 Advanced Topics in Computer Graphics Chapter 15: Graphics Hardware. by Dan Adams

Similar documents
Computer Graphics Hardware An Overview

GPU Architecture. Michael Doggett ATI

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Touchstone -A Fresh Approach to Multimedia for the PC

Introduction to GPGPU. Tiziano Diamanti

OpenGL Performance Tuning

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Lecture 15: Hardware Rendering

L20: GPU Architecture and Models

GPGPU Computing. Yong Cao

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Advanced Visual Effects with Direct3D

Intel Graphics Media Accelerator 900

Real-Time Graphics Architecture

Radeon HD 2900 and Geometry Generation. Michael Doggett

Lecture Notes, CEng 477

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

Optimizing AAA Games for Mobile Platforms

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

INTRODUCTION TO RENDERING TECHNIQUES

Configuring Memory on the HP Business Desktop dx5150

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Technical Brief. Quadro FX 5600 SDI and Quadro FX 4600 SDI Graphics to SDI Video Output. April 2008 TB _v01

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

3D Computer Games History and Technology

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

Advanced Rendering for Engineering & Styling

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Graphical displays are generally of two types: vector displays and raster displays. Vector displays

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Choosing a Computer for Running SLX, P3D, and P5

Lesson 10: Video-Out Interface

Interactive Level-Set Deformation On the GPU

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

Image Synthesis. Transparency. computer graphics & visualization

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

NVIDIA GeForce GTX 580 GPU Datasheet

Console Architecture. By: Peter Hood & Adelia Wong

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

Introduction to Computer Graphics

Boundless Security Systems, Inc.

Dynamic Resolution Rendering

GRAPHICS CARDS IN RADIO RECONNAISSANCE: THE GPGPU TECHNOLOGY

BUILDING TELEPRESENCE SYSTEMS: Translating Science Fiction Ideas into Reality

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Note monitors controlled by analog signals CRT monitors are controlled by analog voltage. i. e. the level of analog signal delivered through the

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

NVIDIA workstation 3D graphics card upgrade options deliver productivity improvements and superior image quality

Intel Pentium 4 Processor on 90nm Technology

Performance Optimization and Debug Tools for mobile games with PlayCanvas

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN.

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

QCD as a Video Game?

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

The Future Of Animation Is Games

Large Scale Data Visualization and Rendering: Scalable Rendering

A Short Introduction to Computer Graphics

Scan-Line Fill. Scan-Line Algorithm. Sort by scan line Fill each span vertex order generated by vertex list

Introduction to GPU Programming Languages

Msystems Ltd. SAPPHIRE HD GB GDDR5 PCIE

Communicating with devices

Monash University Clayton s School of Information Technology CSE3313 Computer Graphics Sample Exam Questions 2007

SAPPHIRE HD GB GDDR5 PCIE.

Preliminary Draft May 19th Video Subsystem

NVIDIA Quadro M4000 Sync PNY Part Number: VCQM4000SYNC-PB. User Guide

Introduction to Game Programming. Steven Osman

Beyond 2D Monitor NVIDIA 3D Stereo

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Low power GPUs a view from the industry. Edvard Sørgård

Getting Started with RemoteFX in Windows Embedded Compact 7

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS

Binary search tree with SIMD bandwidth optimization using SSE

IP Video Rendering Basics

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

High Speed, Real-Time Machine Vision. Perry West Automated Vision Systems, Inc.

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Real-Time BC6H Compression on GPU. Krzysztof Narkowicz Lead Engine Programmer Flying Wild Hog

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Next Generation GPU Architecture Code-named Fermi

A Crash Course on Programmable Graphics Hardware

Optimized Design for MEMS Based Automotive Laser Pico Projectors

Writing Applications for the GPU Using the RapidMind Development Platform

How To Teach Computer Graphics

Intel 810 and 815 Chipset Family Dynamic Video Memory Technology

GPU Usage. Requirements

Introduction to graphics and LCD technologies. NXP Product Line Microcontrollers Business Line Standard ICs

QuickSpecs. NVIDIA Quadro K4200 4GB Graphics INTRODUCTION. NVIDIA Quadro K4200 4GB Graphics. Overview

Intel 845G/GL Chipset Dynamic Video Memory Technology

System requirements for Autodesk Building Design Suite 2017

RADEON 9700 SERIES. User s Guide. Copyright 2002, ATI Technologies Inc. All rights reserved.

Transcription:

CS 563 Advanced Topics in Computer Graphics Chapter 15: Graphics Hardware by Dan Adams

Topics Buffers Color, Z, W, Stereo, Stencil, and Accumulation Buffers Hardware Architecture Pipelining and Parallelization Implementing the Stages in Hardware Memory and Bandwidth Case Studies Xbox InfiniteReality KYRO

Frame Buffer Display Basics Can be in host memory, dedicated memory, or memory shared by buffers and textures Connected to a video Controller

Video Controller Display Basics AKA Digital-to-Analog Converter (DAC) Converts digital pixel values to analog signals for the monitor Monitor has a fixed refresh rate (60 to 120 Hz) Sends color data to monitor in sync with the monitor beam

Display Basics Monitor beams moves left to right, top to bottom Horizontal retrace Beam moves from end of one line to the beginning of the next Does not actually set any colors

Display Basics Horizontal refresh rate (aka line rate) Rate at which it can draw a line Vertical retrace Returns to top left after the entire screen is drawn Vertical refresh rate How many times per second it can refresh entire screen Noticeable < 72 Hz by most people

Example: Display Example 1280x1024 screen with 75 Hz refresh rate Updates every 13.3 ms ( 1 / 75 = 0.0133s) Screen has a specified frame size which is 1688 x 1066 for this resolution Pixel clock rate at which pixels are refreshed 1688*1066*75 Hz = 134,955,600 Hz = 134 Mhz rate = 1 / 134 Mhz = 7.410e9 = 7.4 nanoseconds Line rate 1066*75 Hz = 79,950 lines/sec Vertical retrace rate 1066 1024 = 42 42 * 1688 * 7.4 ns = 525 ns

Monitor Types Noninterlaced (aka progressive scan) Most common for computer monitors Interlaced Found in TVs Horizontal lines are interlaced (evens first, then odds) Converting is non-trivial

High Color Color Modes 2 bytes per pixel (15 or 16 bits for color) either 32,768 or 65,536 colors Uneven division between colors (16 / 3 = 5.3 pixels per color?) Green given an extra bit because it has more effect on the eye 32 * 64 * 32 = 65,536 Extra bit used not used or used for an alpha channel 32 * 32 * 32 = 32,768

Color Modes Can cause a Mach banding effect Differences in color are noticeable by the eye Similar problem with Gouraud shading Can use dithering to lessen effect http://www.grafx-design.com/01gen.html

True Color Color Modes 3 or 4 bytes per pixel (24 bits for color) 16.8 million colors 8 bits per color 255 * 255 * 255 = 16,581,385 colors 24 bit format is called the packed pixel format Saves frame buffer space 32 bit format Some hardware optimized for groups of 4 bytes Extra 8 bits can be used for an alpha Using 24 bits corrects some problems found with high color Quantization effects due to low precision (eg when using multipass)

Z-buffer Normally stores 24 bits / pixel Orthographic viewing Depth resolution is uniform Ex: Near and far planes are 100 meters apart Z-buffer stores 16 bits per pixel 100 meters / 2^16 = about 1.5 mm Perspective viewing Depth resolution is non-uniform Farther away you go, more precision you need to be accurate Can create artifacts or popping effects

Z-buffer After applying perspective transformation to a point we get a vector v: v = (x, y, z, w) v then divided by w so that v = (x/w, y/w, z/w, 1) z/w is mapped to the range [0, 2^b - 1] and stored in Z-buffer The farther away the point, the smaller z/w is after being mapped, and the less precision it has.

W-buffer Alternate to storing a Z-buffer Stores the w value Results in uniform precision Don't have to fool around with the near and far planes Becoming deprecated... (?)

Single Buffering Only uses one buffer which is used for drawing You can see primitives being drawn on the screen Can result in tearing User can see part of a primitive as its being drawn Not very useful for real-time graphics Can be used if you don't update very often Windows in a workspace

Double Buffering Overcomes problems with single buffering Front buffer is display while the back buffer is drawn to Buffers are swapped during vertical retrace Avoids tearing but is not required

Swapping methods Page flipping Double Buffering Address of front buffer is stored in a register. Points to (0,0) Swap by writing the address of the back buffer to the register Easy method for doing screen panning Blitting (or BLT swapping) Back buffer is simply copied over the front buffer

Triple Buffering Adds a second back buffer, the pending buffer Do not have to wait for vertical retrace to start drawing the next frame Do not have to wait for the back buffer to be cleared

Can increase frame rate: Monitor is 60 Hz Triple Buffering If image generation < 1/60 th sec then double and triple buffering will get 60 fps If it takes > 1/60 th sec, double buffering gets 30 fps while triple gets (almost) 60 fps

Increases latency Triple Buffering Response time for user is basically 2 frames behind Pending buffer requires another color buffer Works well if the hardware supports it

Triple Buffering DirectX supports it, OpenGL does not Can (in theory) use as many buffers as you want... Good when image generation time varies a lot Increases latency

Stereo Buffers Called stereopsis or stereo vision Two images (one for each eye) are rendered to fool your eyes into giving objects real depth Can be really convincing! Not everyone is able to do it (magic 3d image?)

Hardware Stereo Buffers Old-school paper 3d glasses HMD (head mounted display) Shutter glasses (cheap and work well) Synchronizes shutter speed with the monitor refresh rate Now supported in the display itself Doubles the amount of buffer memory needed http://www.cgg.cvut.cz/local/glasses/

Stencil & Accumulation Normally the same size Buffers as the color buffer Stencil buffer Used to mask off regions of the color buffer 1 bit simple masking 8 bits complex effects like shadow volumes Accumulation buffer Used to add and subtract images from the color buffer Needs much higher precision than the color buffer 48 bits for a 24 bit color buffer would allow 256 composed images Useful for effects like depth of field, antialiasing, motion blur

T-buffer Useful for supporting fast antialiasing in hardware Contains a set of 2, 4, or more image and Z- buffers Send down a triangle once, it is sent to multiple buffers in parallel with a different offset in each Images are recombined to do AA Does not require multiple passes to do AA Raises hardware cost; much of the pipeline has to be duplicated in each parallel unit 3dfx was the only one to ever implement it

Buffer Memory How much memory do we need for all these buffers? Ex: Color buffer is 1280 * 1024 * 3 (bytes) = 3.75 MB Double buffering doubles this to 7.5 MB Z-buffer has 24 bpp = 3.75 MB Stencil buffer of 8 bpp and accumulation buffer of 48 bpp = 8.75 MB 7.5 + 3.75 + 8.75 = 20 MB If you are using stereo, this doubles the color buffer and adds another 1.25 MB Only one Z-buffer is ever needed for the current color buffer

Perspective-Correct Interpolation Implemented by the rasterizer in hardware Vertex position can be lerped (linear interpolation) Cannot do this for colors and texture coordinates

To correct this: Perspective-Correct Interpolation Linearly interpolate both the textured coordinate and 1/w Thus (u,v) = (u/w, v/w) / (1/w) Called hyperbolic interpolation or rational linear interpolation

A little history... Graphics Architecture Early processors just interpolated/textured spans 1996 3Dfx Voodoo 1 introduced triangle setup 1999 NVIDIA GeForce256 introduced geometry stage acceleration in hardware (fixed function pipeline) Today Accelerated geometry and rasterization plus programmable shaders Trying to put as much on the card as possible Still some support on the CPU Pentium 4 has SSE 2 which are SIMD extensions for parallel processing of vectors

It's Wicked Fast Two approaches to getting super-high performance Pipelining and Parallelizing Pipelining N stages implemented in the hardware that are pipelined together Gives a speedup of N A GPU is 100 times faster than it's equivilent CPU because of using pipelining CPU actually has a higher clock speed Hardware is customized for one area (ie graphics rendering) More operations implemented in hardware (20 pipeline stages in the Pentium 4 and 600-800 in the GeForce3)

Parallelization Divide processing into N parallel processing units and merge results later Normally done for the geometry and rasterizer stages Results must be sorted at some point

Sort-First Primitives are sorted before geometry stage Screen is divided into a number of regions, or tiles Each processor responsible for a region Not used much among normal implementations

Stanford WireGL project Sort-First Is now in the Chromium project (on sourceforge) Rendering is done with a cluster machines and displayed using a projector for each tile Created as an OpenGL implementation (Quake3 anyone?) Pretty awesome if you need a HUGE display with REDICULOUS resolution System has also added motion tracking (!)

http://chromium.sf.net I Need One of These Maybe my wife can get me one for Christmas

Sort-Middle Distribute over the geometry units and then sort results Used in the InfiniteReality and KYRO systems After geometry stage, the primitive location is known so it can resorted to the right FG Each FG is responsible for a region If a triangle overlaps more than one region it can be sent to multiple FGs (which is bad)

Sorts fragments after fragment generation and before fragment merge Used by the Xbox No overlap of as there is in Sort-Middle Imbalance can occur if a set of large triangles are sent to one FG Sort-Last Fragment

Sorts after all rasterization is done Each pipeline renders with depth and then results are composed based on z values Implemented in the PixelFlow architecture Used deferred shading: only textured and shaded visible fragments Cannot be implemented in OpenGL because it does not render primitives in the order sent in Sort-Last Image

It's All About the Textures GPU computation speed is growing exponentially Memory speed and bandwidth is not Yet textures are the way to go for fast realtime graphics Caching and prefetching used to speed up texture access

Texture Catching Rasterizer produces fragments Request queue gets textures from memory Reorder buffer sorts blocks as they were requested Performs at 97% of the ideal (no latency)

Memory Architectues Unified Memory Architecture (UMA) GPU can use any of the host memory CPU and GPU share bus Used by Xbox and the SGI O2

Memory Architectues Non-unified memory GPU has dedicated memory not accessible by the CPU Does not have to share bus bandwidth Used by the KYRO and InfiniteReality

Buses and Bandwidth Two methods for sending the GPU data, Pull and Push Pull Data is written to system memory GPU then pulls the data from memory aka Direct Memory Access (DMA) GPU can lock a region of memory that CPU cannot use GPU works faster and CPU time is saved AGP uses pull A dedicated port (ie bus) that only the CPU and GPU use AGP 4x is 1067 Mbytes/sec. AGP 8x (3.0) is 2.1Gbytes/sec

Push Method Data is written to the GPU once per frame More bandwidth left over for the application Graphics card needs to have a large FIFO buffer Pull method is better for memory data that is static since it can just stay in memory

Ex: How Much Bandwidth? For each pixel we need to read the z-buffer, write back a z-value and the color buffer, and one or more texture reads. = 60 bytes per pixel Assume 60 fps with 1280x1024 resolution and a depth complexity of 4: 4 * 60 * 1280 * 1024 * 60 bytes/s = about 18 Gbytes/s

Ex: How Much Bandwidth? Assume bus speed is 300 MHz with DDRAM (256 bits per clock): 300MHz * 256/8 = about 9.6 Gbytes/s < 18 Gbytes/s Memory bandwidth becomes a big bottleneck This can be reduced with texture caching, prefetching, compression, etc

Reading from Buffers Reading from buffers (Z-buffer, etc) can be slow Often writing is done over AGP while reading is done over PCI Reading from the GPU should probably be avoided

Hardware Z-buffer Status memory stores the state of a 8x8 pixel tile in the frame buffer and a zmax for each tile State can be compressed, uncompressed, or cleared

Hardware Z-buffer To do a fast clear, just set all states to cleared

Hardware Z-buffer ATI Radeon uses fast Z-clear and Z- compression for a 25% frame rate increase A Differential Pulse Code Modulation (DPCM) is used Good when high coherence in the data Reduces memory bandwidth by 75%

Hardware Occlusion Culling Zmax is used to check whether a tile is occluded and then pipeline is exited early Saves on bandwidth Different methods Check the minimum z-value of the triangle vertices against zmax. Not very accurate but very fast. Test the corners of the entire tile against zmax. If larger than the tile it can be skipped. Test each pixel against zmax. Implemented in both GeForce3 and Radeons

Built by NVIDIA and Microsoft Uses the UMA Case Study: Xbox Memory is divided into blocks and can be accessed in parallel CPU is an Intel Pentium III 733 Mhz GPU is an extended GeForce3 Supports programmable vertex and fragment shaders http://collegehumor.com

Has dual vertex shaders which doubles throughput Pre T&L cache (4 bytes) avoids redundant memory fetches Caches vertices (used by 6 triangles on average) Case Study: Xbox

Post T&L cache (16 vertices) avoids processing the same vertex more than once with the shader Primitive Assembly Cache (3 fully shaded vertices) avoids redundant fetches to the Post T&L cache ex: a vertex may be needed multiple times in a triangle strip Case Study: Xbox

Rasterizer has 4 parallel, programmable pixel shaders TX texture unit. Processes fragments as they come in RC register combiner. Combiner computes the final fragment color (after fog, etc) Fragment merge computes final pixel color (z-buffer, etc) Case Study: Xbox

Case Study: Xbox Uses texture swizzling to increase cache performance and page locality

A Sort-Middle architecture produced by SGI Host Interface Processor Responsible for bring in work to the system Can get display lists from memory with DMA Also has a 16MB cache for display lists Per vertex info can be sent directly from the host Case Study: InfinteReality

Geometry Distributor Passes work to the least buys of the geometry engines Each work item has a number so they can be sorted later into the order they came in (compatible with OpenGL) Case Study: InfinteReality

Geometry Engine Contains 3 Floating-point Cores (FPC) so the elements of a vertex can be processed in parallel (SIMD) Each FPC is like a mini-cpu for processing vertices Four-stage pipeline In-chip memory (2560 words) Case Study: InfinteReality

Case Study: InfinteReality Geometry-Raster FIFO Reorders vertices using the numbers given to them Feeds the correct stream of vertices onto the Vertex Bus Can hold 65,536 vertices

Case Study: InfinteReality Raster Memory Board A fragment generator Contains a copy of entire texture memory Scan conversion by evaluating the plane equations rather than doing interpolation Better for small triangles Contains 80 image engines Distributes work to the image engines

Case Study: KYRO Implements a tile-based algorithm in hardware Screen divided into equal size, rectangle regions Back buffer, Z-buffer, and stencil buffer only need to be the size of a tile Stored on chip About 1/3 the bandwidth as the normal approach Geometry currently done the CPU but could be added to the chip

Case Study: KYRO Queues all incoming vertices until and arranges them by tiles Creates triangle strips on the fly for each tile The ISP processes complete tiles while the TA processing incoming vertices in parallel ISP handles Z-buffer, stencil buffer, and occlusion culling Eliminates occluded triangles early to save bandwidth

Case Study: KYRO TSP does deferred shading; done after the ISP has already done depth testing Spans of pixels are grouped by what texture they use before being sent to the TSP Swapping textures is expensive Can texture 2 pixels simultaneously KYRO designed to be pipelined/parallelized Existing pipeline could be duplicated and parallelized

The End (questions?)