Radeon HD 2900 and Geometry Generation. Michael Doggett

Similar documents

GPU Architecture. Michael Doggett ATI

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Computer Graphics Hardware An Overview

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

GPGPU Computing. Yong Cao

Introduction to GPGPU. Tiziano Diamanti

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Console Architecture. By: Peter Hood & Adelia Wong

L20: GPU Architecture and Models

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU architecture II: Scheduling the graphics pipeline

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

DATA VISUALIZATION OF THE GRAPHICS PIPELINE: TRACKING STATE WITH THE STATEVIEWER

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction to GPU Architecture

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Midgard GPU Architecture. October 2014

Touchstone -A Fresh Approach to Multimedia for the PC

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Optimizing AAA Games for Mobile Platforms

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GRAPHICS CARDS IN RADIO RECONNAISSANCE: THE GPGPU TECHNOLOGY

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE

SAPPHIRE HD GB GDDR5 PCIE.

An Introduction to Modern GPU Architecture Ashu Rege Director of Developer Technology

Introduction to GPU hardware and to CUDA

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Next Generation GPU Architecture Code-named Fermi

Dynamic Resolution Rendering

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

GPU Parallel Computing Architecture and CUDA Programming Model

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

Introduction to GPU Programming Languages

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

OpenGL Performance Tuning

Interactive Rendering In The Post-GPU Era. Matt Pharr Graphics Hardware 2006

Low power GPUs a view from the industry. Edvard Sørgård

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

Advanced Visual Effects with Direct3D

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Close to the Metal. Justin Hensley Graphics Product Group

Image Synthesis. Transparency. computer graphics & visualization

Getting Started with RemoteFX in Windows Embedded Compact 7

Intel Pentium 4 Processor on 90nm Technology

Lecture Notes, CEng 477

Hardware design for ray tracing

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

QCD as a Video Game?

Texture Cache Approximation on GPUs

1. INTRODUCTION Graphics 2

A Crash Course on Programmable Graphics Hardware

The Future Of Animation Is Games

A Computer Vision System on a Chip: a case study from the automotive domain

Msystems Ltd. SAPPHIRE HD GB GDDR5 PCIE

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Interactive Level-Set Deformation On the GPU

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Chapter 1 Computer System Overview

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Getting Started with CodeXL

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Water Flow in. Alex Vlachos, Valve July 28, 2010

Developer Tools. Tim Purcell NVIDIA

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

A Performance-Oriented Data Parallel Virtual Machine for GPUs

Monash University Clayton s School of Information Technology CSE3313 Computer Graphics Sample Exam Questions 2007

Choosing a Computer for Running SLX, P3D, and P5

In the early 1990s, ubiquitous

GPU-Driven Rendering Pipelines

GPU Computing - CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

Configuring Memory on the HP Business Desktop dx5150

Optimization for DirectX9 Graphics. Ashu Rege

Introduction to Game Programming. Steven Osman

Let s put together a Manual Processor

GPGPU: General-Purpose Computation on GPUs

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Transcription:

Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007

Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command Processor Shader Setup Engine Ultra Threaded Dispatch Processor Shader Core Texture Render Backend Memory Controller Dirext3D 10 Geometry Shader Conclusion 2 September 2007

Introduction to 3D Graphics 3 September 2007

Rendering Pipeline Combination of devices and methods for creating image from 3D scene description Common pipeline structure: Application (runs on CPU) 3D rendering GPU Geometry/vertex processor Rasterizer GPU Application Geometry Rasterizer Output image 4 September 2007

Polygons The most common polygon type used in 3D graphics is a triangle More polygons better approximation of the surface curvature 5 September 2007

Geometry Definition Triangles are defined by points in corners vertices Operations on geometry is called vertex processing Vertex position is relative to some arbitrary point Other descriptions of vertex attributes Color Texture coordinates Normal Other custom attributes Vertices 6 September 2007

Rasterization Break up triangles into raster elements (pixels) Perform scan conversion Scan line Later perform other per-pixel operations 7 September 2007

Pixel Processing Each screen pixel of the rendered 3D surface is evaluated Light can be added to control the lighting of the scene Texturing adds details Various effects can be applied to enhance the scene appearance 8 September 2007

Texturing One of the simplest ways to define material Adds details to objects without extra geometry All details are in painted textures Many different parameters control the texture appearance 9 September 2007

Texturing Think of textures as thin stretchy painted film used for shrink-wrapping Texture coordinates define texture application (0, 0) (0.5, 0) (1, 1) (1, 1) (0, 1) 10 September 2007

Radeon 2900 11 September 2007

Starting point Combine the best of existing technology R5xx series Heavily threaded shader cores Hides latency of memory fetch Vec4+1 Vertex and Vec3+1 Pixel shaders Ringbus memory subsystem XBOX 360 GPU Unified shader architecture Vertex and Pixel Vec4+1 Stream Out Unified L1 texture cache Introduced Tessellator 12 September 2007

Requirements DirectX10 compatible Support new driver model Vista driver model Scalable family Number of shader cores, texture units, render back-ends. Shader scalable in number of pipes, SIMDs. Target specific cost, feature set and performance levels for each part 13 September 2007

Top Level Red Compute Yellow Cache Unified shader Hierarchical Z Stream Out Rasterizer Interpolators Command Processor Vertex Index Fetch Setup Unit Tessellator Geometry Vertex Assembler Assembler Ultra-Threaded Dispatch Processor Shader Caches Instruction & Constant Shader R/W Instr./Const. cache Unified texture cache Compression Z/Stencil Cache Memory Read/Write Cache Unified Shader Processors Shader Export Texture Units L1 Texture Cache L2 Texture Cache Render Back-Ends Color Cache 14 September 2007

Command Processor GPU interface with host A custom RISC based Micro-Coded engine First class memory client with Read/Write access State management Command Processor Vertex Index Fetch Hierarchical Z Stream Out Rasterizer Interpolators Setup Unit Tessellator Geometry Vertex Assembler Assembler Ultra-Threaded Dispatch Processor Shader Caches Instruction & Constant 15 September 2007

Shader Setup Engine Draw/State Vertex Index Fetch Shader Feedback Hierarchiacal and EarlyZ Rasterizer Interpolators Setup Engine Geometry Assembler Programmable Tessellator Vertex Assembler Thread Workload Pixel Geometry Vertex 3 groups of blocks feeding 3 data streams Each group feeding 16 elements (Vertices/Geometry/Pixels)/cycle 16 September 2007

Shader Setup Engine Draw/State Vertex Index Fetch Shader Feedback Hierarchiacal and EarlyZ Rasterizer Interpolators Setup Engine Geometry Assembler Programmable Tessellator Vertex Assembler Thread Workload Vertex Vertex blocks Primitive Tessellation Inputs Index & instancing Sends vertex addresses to shader core 17 September 2007

Shader Setup Engine Draw/State Vertex Index Fetch Shader Feedback Hierarchiacal and EarlyZ Rasterizer Interpolators Setup Engine Geometry Assembler Programmable Tessellator Vertex Assembler Thread Workload Geometry Geometry blocks Uses on/off chip staging Sends processed vertex addresses, near neighbor addresses and topological information to shader core 18 September 2007

Shader Setup Engine Draw/State Vertex Index Fetch Shader Feedback Hierarchiacal and EarlyZ Rasterizer Interpolators Setup Engine Geometry Assembler Programmable Tessellator Vertex Assembler Pixel Thread Workload Pixel blocks Triangle setup, Rasterization and Interpolation Interfaces to depth to perform HiZ/Early Z checks 19 September 2007

Ultra-Threaded Dispatch Processor Main control for the shader core All workloads have threads of 64 elements 100 s of threads in flight Threads are put to sleep when they request a slow responding resource Arbitration policy Age/Need/Availability When in doubt favor pixels Programmable Hiera Stream O Interpolators Geometry Vertex Assembler Assembler Ultra-Threaded Dispatch Processor Caches ction & stant che L 20 September 2007

Ultra-Threaded Dispatch Processor Setup Engine Vertex Shader Command Queue Geometry Shader Command Queue PIxel Shader Command Queue VS Thread 3 VS Thread 2 VS Thread 1 GS Thread 2 GS Thread 1 Arbiter Arbiter Sequencer Interpolators Arbiter Arbiter Sequencer Sequencer Sequencer Arbiter Sequencer Sequencer PS PS PS PS Arbiter Arbiter Sequencer Sequencer Thread Thread Thread Thread 4 3 2 1 Texture Fetch Arbiter Vertex Fetch Arbiter Texture Fetch Sequencer Vertex Fetch Sequencer Instructions and Control SIMD Array SIMD Array 80 Stream Processing Units 80 Stream Processing Units 80 Stream Processing Units 80 Stream Processing Units September 2007 Vertex / Texture Cache SIMD Array Vertex / Texture Units 21 SIMD Array Shader Constant Cache Arbiter Geometry Assembler Shader Instruction Cache UltraUltra-Threaded Dispatch Processor Vertex Assembler

Shader Core 4 parallel SIMD units Each unit receives independent ALU instruction Very Long Instruction Word (VLIW) ALU Instruction (1 to 7 64-bit words) 5 scalar ops 64 bits for src/dst/cntrls/op 2 additional for literal constant(s) 22 September 2007

Stream Processing Units 5 Scalar Units Each scalar unit does FP Multiply- Add (MAD) and integer operations One also handles transcendental instructions IEEE 32-bit floating point precision Branch Execution Unit Flow control instructions Up to 6 operations co-issued General Purpose Registers Branch Execution Unit 23 September 2007

Memory Read/Write Cache Virtualizes register space Allows overflow to graphics memory Can be read from or written to by any SIMD (texture & vertex caches are read-only) 8KB Fully associative cache, write combining Stream Out Allows shader output to bypass render back-ends and color buffer Outputs sequential stream of data instead of bitmaps Used for Inter-thread communication 24 September 2007

Texture Units Texture Filter Units Vertex Address Processors Texture Address Processors FP16 Texture Samples Vertex Fetches Fetch Units 8 Fetch Address Processors each 4 filtered and 4 unfiltered 20 Texture Samplers each Can fetch a single data value per clock 4 filtered texels (with BW) Bilinear filter one 64-bit FP color value per clock, 128b FP per 2 clocks for each pixel Fetch Caches Unified caches across all SIMDs Vertex/Unfiltered cache 4kb L1, 32Kb L2 Texture cache 32KB L1, 256KB L2 Texture 25 September 2007

Render Back-Ends Double rate depth/stencil test 32 pixels per clock for HD 2900 New HiStencil Programmable MSAA resolve Allows Custom AA Filters New blend-able DX10 surface formats 128-bit and 11:11:10 floating point format Z/Stencil Cache Decompress Compress Programmable MSAA Resolve Decompress Alpha/Fog Depth/Stencil Blend Compress Up to 8 Mulptiple Render Targets with MSAA support Color Cache 26 September 2007

Memory Interface and Controller y 512-bit Interface Compact, stacked I/O pad design More bandwidth with existing memory technology E P Ri xp CI ng re St ss op Ring Stop Improved cost:bandwidth ratio 8 x 64 bit memory channels Ring Stop Ring Stop y Double ringbus 512 bit read and write Ring Stop 27 September 2007

Radeon HD 2000 Series Radeon 2900 2600 2400 Stream Processors 320 120 40 SIMDs 4 3 2 Pipelines 16 8 4 Texture Units 16 8 4 Render Backends 16 4 4 L2 texture cache (KB) 256 128 0 Technology (nm) 80 65 65 Area (mm2) 420 153 82 Transistors (Millions) 720 390 180 Memory Bandwidth 512 128 64 28 September 2007

Direct3D 10 Geometry Shader 29 September 2007

Direct3D 10 New Stages 30 September 2007

Direct3D 10 Geometry Shader Input Primitive Line, Point, Triangle Neighbouring vertices 2 for Line, 3 for Triangle Output Rasterizer Vertex buffer in memory (Stream Out) 31 September 2007

Geometry Shader example Per-pixel Displacement Mapping Hirche, Ehlert, Guthe, Doggett, GI2004 Similar to DirectX SDK Displacement Mapping 32 September 2007

Where next? Move fixed functions blocks to shader Improve programmability, reduce area, improve reuse, maintain/target performance Enhancements for GPGPU Improved precision and compliance New APIs, new functions New technologies such as 65, 55, 45, 32 Graphics and gaming keeps on evolving DX-next is already being discussed We are well into next generation and next-next generations 33 September 2007

Radeon HD 2900 Unified shader Vertex, Geometry and Pixel Multiple SIMD 5-way scalar Shader cached memory read/write Geometry shader on/off chip storage 512 bit stacked I/O Memory Interface Full DX10 functionality 34 September 2007

Questions and Demo Thanks to Eric Demers and Mike Mantor 35 September 2007

Performance Improvements 35300 30300 25300 GPU Performance Moore's Law (doubling every 24 months) 20300 15300 10300 5300 300 2000 2001 2002 2003 2004 2005 2006 2007 7500 8500 9700 9800 X800 X1800 X1950 2900 36 September 2007

GPU vs. Size/Perf./Features 37 September 2007

The ATI Radeon HD 2900 developed by AMD is a Graphics Processing Unit (GPU) capable of massively parallel computation for high performance 3D graphics and general purpose algorithms. The unified shader architecture consists of a combination of MIMD and SIMD architectures of 5 way scalar arithmetic units running in parallel. The shader uses multi-threading to hide latency of memory access so that compute units are keep busy. The threads consist of vertex, geometry and pixel threads that represent different programmable stages of a traditional 3D graphics pipeline mapped onto a single scheduled shader unit. Varied distributed and unified caches are used for data, instructions, read only texture reads and vertex data. A ring based memory subsystem allows multiple clients to access multiple memory channels. The Radeon HD 2900 includes a tessellator for generating highly detailed surfaces using the GPU in a fixed function pipeline. More complex surface operations can be performed using the Direct3D10 API's new geometry shader. This shader stage allows programmable operations on the geometry in a 3D scene, not just the individual vertices or pixels. This geometry shader gives access to neighbouring vertices which allows computation of surface properties. These surface properties can then be used to generate new vertices to represent a more complex geometry then was originally sent to the GPU enabling algorithms to be run that previously needed to be run on the CPU. 38 September 2007