3D Graphics Hardware Graphics II Spring 1999

Similar documents
Computer Graphics Hardware An Overview

Touchstone -A Fresh Approach to Multimedia for the PC

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Scan-Line Fill. Scan-Line Algorithm. Sort by scan line Fill each span vertex order generated by vertex list

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Radeon HD 2900 and Geometry Generation. Michael Doggett

Introduction to GPGPU. Tiziano Diamanti

Monash University Clayton s School of Information Technology CSE3313 Computer Graphics Sample Exam Questions 2007

High-speed image processing algorithms using MMX hardware

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Architecture. Michael Doggett ATI

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Writing Applications for the GPU Using the RapidMind Development Platform

Lecture Notes, CEng 477

B2.53-R3: COMPUTER GRAPHICS. NOTE: 1. There are TWO PARTS in this Module/Paper. PART ONE contains FOUR questions and PART TWO contains FIVE questions.

Lecture 2 Parallel Programming Platforms

Real-Time Graphics Architecture

Binary search tree with SIMD bandwidth optimization using SSE

Introduction to Cloud Computing

Rethinking SIMD Vectorization for In-Memory Databases

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

Introduction to Computer Graphics

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Introduction to Game Programming. Steven Osman

OpenGL Performance Tuning

GPGPU Computing. Yong Cao

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

L20: GPU Architecture and Models

Solomon Systech Image Processor for Car Entertainment Application

OpenEXR Image Viewing Software

Next Generation GPU Architecture Code-named Fermi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

İSTANBUL AYDIN UNIVERSITY

Triangle Scan Conversion using 2D Homogeneous Coordinates

LSN 2 Computer Processors

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Low power GPUs a view from the industry. Edvard Sørgård

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Advanced Visual Effects with Direct3D

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller


Intel Graphics Media Accelerator 900

EE361: Digital Computer Organization Course Syllabus

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Dynamic Resolution Rendering

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

Systolic Computing. Fundamentals

Vector storage and access; algorithms in GIS. This is lecture 6

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Hardware design for ray tracing

CS1112 Spring 2014 Project 4. Objectives. 3 Pixelation for Identity Protection. due Thursday, 3/27, at 11pm

DATA VISUALIZATION OF THE GRAPHICS PIPELINE: TRACKING STATE WITH THE STATEVIEWER

Chapter 2 Parallel Architecture, Software And Performance

CMSC 611: Advanced Computer Architecture

Getting Started with RemoteFX in Windows Embedded Compact 7

Basler. Line Scan Cameras

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

SAN Conceptual and Design Basics

Computer Organization

Parallel Simplification of Large Meshes on PC Clusters

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Lecture 23: Multiprocessors

Scalability and Classifications

QCD as a Video Game?

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Computer Applications in Textile Engineering. Computer Applications in Textile Engineering

Lesson 10: Video-Out Interface

Motivation: Smartphone Market

A Crash Course on Programmable Graphics Hardware

Data Storage 3.1. Foundations of Computer Science Cengage Learning

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

Technical Brief. Quadro FX 5600 SDI and Quadro FX 4600 SDI Graphics to SDI Video Output. April 2008 TB _v01

Computer Architecture TDTS10

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

Virtuoso and Database Scalability

A Short Introduction to Computer Graphics

GPU Shading and Rendering: Introduction & Graphics Hardware

Texture Cache Approximation on GPUs

Real-Time BC6H Compression on GPU. Krzysztof Narkowicz Lead Engine Programmer Flying Wild Hog

RevoScaleR Speed and Scalability

Visualization of 2D Domains

Computer Organization & Architecture Lecture #19

Transcription:

3D Graphics Hardware 15-463 Graphics II Spring 1999

Topics Graphics Architecture Uniprocessor Acceleration Front-End Multiprocessing Pipelined Parallel Back-End Multiprocessing Pipelined Parallel

Graphics Architecture

Programming Interface Database contains primitives from model Immediate mode Program calls library with individual primitives Need to enmumerate entire database for each frame No temporal coherence between frames Retained mode Primitives are remembered between frames Program calls library to make incremental changes Can potentially exploit temporal coherence

Frame Buffer Organization Dedicated bank of RAM for pixel values Components are rgb, alpha, z, etc. CPU or display processor writes pixels Video controller reads pixels Generates video signal for monitor Contention between CPU/DP and video controller

DRAM Organization Read or write just a few bits at a time Multiple chips in parallel for more bits Bits organized internally in a 2D grid Row select and column select identify particular bits Row select has extra set-up latency No penalty for column select Page mode chips exploit this

Video RAM (VRAM) (1983) Extra bit port for video signal generation Avoids contention with video controller Buffer loaded during stolen memory cycle Whole row can be copied in a single cycle Multiple banks needed for big frame buffers Single chip can t keep up with video signal

Uniprocessor Acceleration

Peripheral Display Processor Reads/writes frame buffer in place of CPU Acceleration for 2D operations Read/write many pixels in parallel Built-in support for lines, circles, bitblt, etc. No transformations, floating-point, etc. Now limited only by speed of memory

CPU Extensions Additional instructions in CPU for 3D graphics Pioneered by i860 (1989) 64-bit pixel operations: linear interpolation, z-buffer compare, conditional update MMX (1995?) Reinterprets x86 floating-point registers Up to 8 simultaneous integer operations 8x8bit, 4x16bit, 2x32bit, 1x64bit (Clamped) arithmetic, logical, pack/unpack

3DNow (1998) Reinterprets MMX registers Two simultaneous floating-point operations Add, subtract, multiply, min, max, etc. Reciprocal and reciprocal square root approximation Three instructions implement Newton s method On-chip table lookup for initial approximation Two successive iterations to refine approximation Superscalar execution permits 4 operations/cycle

Performance Barriers Floating-point geometry processing Transformation, clipping, lighting Integer pixel processing Scan conversion Frame-buffer memory bandwidth Z-buffer comparison, shading, alpha blending What can we do about these?

Front-End Multiprocessing

Pipelining Extra processors for distinct tasks Primitives are processed in discrete stages Need result of stage i to compute stage i+1 Process stage i of primitive j concurrently with stage i+1 of primitive j+1 Increase in throughput, (small) increase in latency Individual latency usually isn t important

Parallelism Extra processors for uniform tasks SIMD One instruction affects multiple operands Disable bit for conditional operations Advantage: no extra control logic Disadvantage: inflexible for non-uniform tasks Examples: MMX, 3DNow MIMD Each processor has its own instruction stream Synchronization/interconnect can be difficult

Pipeline Front End Many distinct stages naturally lead to pipelining CPU usually handles display traversal Geometry subsystem is floating-point intensive Transformations can be done in parallel By component By vector (vertex or normal) Transformation bundled with trivial accept/reject One processor per clip plane Division by w is expensive Compute 1/w and multiply x, y, and z by that

Geometry Engine (1982) Used in early SGI machines ~One chip per stage of front-end pipeline Configuration word determines which stage Each chip reads and writes command stream Vertices inserted and deleted during clipping Replaced by commodity chips in later SGIs Weitek 3332 and Intel i860

Parallel Front End Significantly harder than pipelining front end Need to recombine streams at back end for ordered rendering algorithms Processor contention over shared database RealityEngine processes geometry in parallel Command processor sends primitives to geometry engines in round robin order Geometry engine is not pipelined (single i860) Triangle bus broadcasts transformed triangles to backend processors

Back-End Multiprocessing

Taxonomy Object order: outer loop over objects Z-buffer, depth-sort, and BSP-tree algorithms Image order: outer loop over pixels Image parallel = parallel object order Parallel inner loop over image pixels Partitioned image, logic-enhanced memory Object parallel = parallel image order Parallel inner loop over objects Processor-per-primitive, tree-structured

Pipelined Object Order Polygon-edge-span processor pipeline Polygon unit finds x, z, and rgb deltas for each edge Edge unit computes left/right x, z, and rgb bounds/deltas for each span Span unit interpolates z and rgb for each pixel Implements z buffer compare Poor load balance: span processor is bottleneck Polygons*edges*spans*pixels Pixel cache to increase memory bandwidth Excellent locality

Pipelined Image Order Scan-line pipeline Y sorter finds first scan line of each polygon Active segment sorter sorts segments on scan line by x A segment is a span of a single polygon Visible span generator compares z values of segments Shader computes rgb values of visible spans No frame buffer if hardware is fast enough (!) Again, poor load balance: shader is bottleneck Need more than 1 processor/stage: parallel processing

Parallel Object Order/Image Parallel Multiple processors render pixels in parallel Predominant back-end architecture Contiguous partition Each processor handles a specific block of pixels Don t look at primitives outside my region Interleaved partition Each processor handles every nth pixel Must look at every primitive Worst case isn t as bad, best case isn t as good Predominant architecture

Parallel Object Order/Image Parallel SIMD All processors work on same nxn pixel block Single control logic for all processors Many conditional branches (complex algorithms) result in poor utilization MIMD Processors can work on pixels in different blocks Requires control logic for each processor Better utilization, but needs fast interconnect

RealityEngine (1993) Third-generation SGI graphics architecture Processing units Command processor breaks commands into primitives Geometry engines (i860s) transform, light, and clip Triangle bus broadcasts transformed triangles Fragment generators scan convert triangles into subsample masks and single rgb and z Traverses vertically: every 5th column Image engines combine subsamples in local memory Display generator assembles colors from subsamples

Logic Enhanced Memory Massively parallel approach Processing elements built into memory chips Overcomes memory bandwidth limitations Interconnect between chips can be difficult Custom logic increases cost of chips Ordinary DRAM has vast economy of scale

Pixel Planes (1981) 1-bit processor associated with each pixel Enable bit for pixels inside polygon Linear expression tree evaluates linear equations in parallel F(x, y) = Ax + By + C Edges of convex polygon Depth value of points in a triangle Color components for Gouraud shading Constant rendering time for any size polygon!

Parallel Polygon Rasterization Used by RealityEngine fragment generators Edge functions similar to Pixel Planes linear expressions nxn block of pixel logic traverses frame buffer Traversal algorithm affects performance Vertex bounding box (simplest) Edge-to-edge: search for left edge at each scan line Center line: search left and right from interior Can clip just by adding edges

Parallel Image Order/Object Parallel Specific primitives assigned to each processor Usually 1 primitive per processor Fragments combined from each processor output Processor-per-primitive pipeline My z > previous z: pass my rgb and z to next My z < previous z: pass rgb and z unchanged Processor-per-primitive tree Merge rgb and z streams using binary tree Performance drops if #primitives > #processors

Talisman (1996) Microsoft design for PC 3D hardware Retained mode API Distinct image layers for independent objects Image layers are composited during video signal generation Layer transformations for simple effects Layers are compressed by hardware

References Foley and van Dam, Chapter 18 Kurt Akeley, RealityEngine Graphics, SIGGRAPH 93 Kurt Akeley, High-Performance Polygon Rendering, SIGGRAPH 88 James H. Clark, The Geometry Engine: A VLSI Geometry System for Graphics, SIGGRAPH 82 Juan Pineda, A Parallel Algorithm for Polygon Rasterization, SIGGRAPH 88 Jay Torborg and James T. Kajiya, Talisman: Commodity Realtime 3D Graphics for the PC, SIGGRAPH 96