QCD as a Video Game?



Similar documents
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Computer Graphics Hardware An Overview

Introduction to GPGPU. Tiziano Diamanti

GPGPU Computing. Yong Cao

L20: GPU Architecture and Models

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU Architecture. Michael Doggett ATI

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Introduction to GPU Computing

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Next Generation GPU Architecture Code-named Fermi

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Developer Tools. Tim Purcell NVIDIA

Writing Applications for the GPU Using the RapidMind Development Platform

NVIDIA GeForce GTX 580 GPU Datasheet

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Turbomachinery CFD on many-core platforms experiences and strategies

GPUs for Scientific Computing

Accelerating CFD using OpenFOAM with GPUs

Binary search tree with SIMD bandwidth optimization using SSE

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Introduction to GPU Programming Languages

GPGPU accelerated Computational Fluid Dynamics

Introduction to GPU hardware and to CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

GOLD20TH-GTX980-P-4GD5

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Hardware Acceleration for CST MICROWAVE STUDIO

CFD Implementation with In-Socket FPGA Accelerators

Chapter 2 Parallel Architecture, Software And Performance

Optimizing AAA Games for Mobile Platforms

In the early 1990s, ubiquitous

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

1. INTRODUCTION Graphics 2

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction to GPU Architecture

Several tips on how to choose a suitable computer

Lattice QCD Performance. on Multi core Linux Servers

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Radeon HD 2900 and Geometry Generation. Michael Doggett

Interactive Level-Set Deformation On the GPU

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

A Crash Course on Programmable Graphics Hardware

Multi-Threading Performance on Commodity Multi-Core Processors

System requirements for Autodesk Building Design Suite 2017

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

Low power GPUs a view from the industry. Edvard Sørgård

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Web Based 3D Visualization for COMSOL Multiphysics

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Stream Processing on GPUs Using Distributed Multimedia Middleware

Advanced Rendering for Engineering & Styling

Parallel Programming Survey

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Several tips on how to choose a suitable computer

GeoImaging Accelerator Pansharp Test Results

Introduction to Computer Graphics

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Lecture Notes, CEng 477

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

How To Teach Computer Graphics

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Tekla Structures 18 Hardware Recommendation

Texture Cache Approximation on GPUs

¹ Autodesk Showcase 2016 and Autodesk ReCap 2016 are not supported in 32-Bit.

NVIDIA workstation 3D graphics card upgrade options deliver productivity improvements and superior image quality

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS

How to choose a suitable computer

Clustering Billions of Data Points Using GPUs

Communicating with devices

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

GPU Point List Generation through Histogram Pyramids

Hardware design for ray tracing

CUDA programming on NVIDIA GPUs

HPC Deployment of OpenFOAM in an Industrial Setting

Parallel Algorithm Engineering

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Transcription:

QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó

Outline 1. Introduction 2. GPU architecture 3. Software 4. Staggered and Wilson matrix implementation 5. Summary

Why use Graphical Processing Units? Ian Buck, Stanford Gamerz market is big $$$: top model costs $500

How much power can be utilized for the lattice? Performance for Wilson multiplication 12 GPU 7900 GTX GPU 7800 GTX CPU P4 SSE 10 Gflops 8 6 4 2 0 8 4 8 3 16 16 3 8 16 3 24 16 3 64 Volume Conjugate gradient: 90% of Wilson 10 GFlops

How is this possible? CPU: lot of transistors do non-computational tasks GPU: transistors (almost) only calculate GPU architecture Native data: 2D arrays, textures (for games its content ends up on the screen). Each pixel (or fragment): 4 floating point numbers: RGBA color channels. Each pixel computation: incoming textures outgoing textures: same operation on each pixel massively parallel. Like a stream: pipeline should be full performance is better for large textures (lattices).

GPU hardware architecture Textures Vertex processor Rasterizer Fragment processor Screen Elements of the pipeline: Vertex processor: basic transformations (e.g. rotations) Rasterizer: generate texture coordinates Fragment processor: shade a region using texture inputs

Main specifications (Nvidia 7900GTX): Computation: 24 fragment processing units 800MHz clock 8 floating point operations/clock 154 GFlops peak performance! Memory: 512 MB (today s cards have up to 1GB) 256 bit wide bus 1600 MHz DDR3 memory 51.2 GB/s peak bandwidth 5 top CPUs

Native data types: textures (2D data streams) Texture elements: fragments/pixels different number of fields/fragment (R, RGB, RGBA) different data types: fixed, half (16 bit float), float General purpose computations typically RGBA float is used similar to SSE

Fragment programs input textures output (screen/textures) x, y coordinates Fragment program y x One fragment per output texture is processed at a time The 24 fragment processor units work simultaneously on different fragments

Input textures: maximum number is 32 read only can have different size can be addressed indirectly e.g. (2x+1,3y+2) Output textures: maximum number is 4 write only fixed size (for all 4) direct addressing, x, y location is written

Software environment Bad news: GPU hardware is undocumented Only possibility to program it is through graphics drivers Direct3D or OpenGL environments OpenGL also under Linux Main steps: - allocate textures - upload textures to video memory - upload fragment program - set target texture(s) - run fragment program - download output texture to main memory Upload/download is slow many runs without download needed e.g. full Conjugate Gradient on GPU, not only Dirac operator

Programming languages: Assembler: not well documented code can be uploaded from a string with an OpenGL call Cg (C for graphics): similar to C, very low level almost one-to-one compilation into assembler either pre-compilation or runtime compilation powerful instruction set multiple-add instructions on the assembler level component shuffling is trivial: e.g. a.rgba=b.rrgb+c.grab compiles to one assembler instruction

Real life example x i = y i + z i i = 1...4nm Fragment program in Cg: struct FragmentOut { float4 color0:color0; } FragmentOut example(in float2 mytexcoord:wpos, uniform samplerrect y, uniform samplerrect z) { FragmentOut c; c.color0 = texrect( y, mytexcoord ) + texrect( z, mytexcoord ); return c; }

Create texture in OpenGL (like malloc) GLuint X; glgentextures(1, &X); glbindtexture(..., X); gltexparameteri(...); glteximage2d(..., n,m,...,0); /* Texture parameters, e.g. boundary conditions */... /* do the same for Y, Z as well */ Transfer data CPU GPU in OpenGL glbindtexture(...,y); glframebuffertexture2dext(...,y,...); gltexsubimage2d(...,n,m,...,y);... /* do the same for Z */

Do the actual computation, run the fragment program cgglsettextureparameter(...,y); glframebuffertexture2dext(..., X, 0 ); /* set X as target * gldrawbuffer(...); /* tell opengl which targets to write */ cgglbindprogram(...); /* choose program */ glbegin(...); /* send vertices to the vertex processor */ { glvertex2f(0,0); glvertex2f(n,0); glvertex2f(n,m); glvertex2f(0,m); } glend();

Transfer data GPU CPU glframebuffertexture2dext(...,x,...); glreadbuffer(...); /* read from texture X */ glreadpixels(, n, m,..., x ); Same in C: for( i = 0; i < 4 * n * m; i++ ) x[i] = y[i] + z[i]; GPU version is much longer but much faster (not counting up- and downloads)

Even more real life example: QCD Staggered QCD on the GPU 4 components of a fragment (RGBA) are not optimal for SU(3) two possibilities: 1. Use only RGB fields 2. store 2 lattice sites per fragment (2 complex numbers) F0 F1 F5

texture size for a N x N y N z N t lattice is (N x N y /2) (N z N t ) Vectors: color components to different textures 3 texture per staggered vector Gauge fields: Keep only first two rows of an SU(3) matrix third row can be reconstructed 6 complex numbers 6 textures for one direction altogether 24 textures for gauge fields Index texture: store the coordinates of a neighbor for a given direction

Further ingredients for a (multishift) Conjugate Gradient solver: linear combination of vectors trivial scalar product: - multiplication is trivial - no native support for summation Reduction step: for a 2n 2m input texture take an n m output texture for an (x;y output coordinate add the elements (2x;2y),(2x+1;2y),(2x;2y+1),(2x+1;2y+1) of the input texture repeat as many times as possible (can also use 3x3 reduction instead of 2x2) download the remaining small texture and finish on the CPU

Wilson QCD on the GPU Due to Dirac indices Wilson vectors perfectly fit to the RGBA structure of the fragments texture size for a N x N y N z N t lattice is (N x N y ) (N z N t ) Vectors: half Wilson vectors are needed two Dirac components fit into the RGBA fields 3 color components to 3 textures altogether 3 textures are needed for a half Wilson vector Gauge fields: third row reconstruction not implemented yet 18 real components for a given direction Need 5 textures and there is room for 2 = 5 4 18 more real numbers. Store 2D texture location of neighbor there. Altogether 20 textures for the gauge fields

Performance of the staggered Dirac operator performance on large lattices is about 5-6 faster than on a 3 GHz p4 CPU contrary to CPU large lattices are faster multishift CG performance depends on the number of shifts 10% lower (9 GFlops on a 7900GTX card for large lattices)

Performance of the Wilson Dirac operator Performance for Wilson multiplication Performance of GPU 7900 GTX 12 GPU 7900 GTX GPU 7800 GTX CPU P4 SSE 12 Wilson CG 10 10 8 6 Gflops 8 6 4 4 2 2 0 8 4 8 3 16 16 3 8 16 3 24 16 3 64 Volume 0 8 4 8 3 16 16 3 8 16 3 24 16 3 64 Volume Similar to the staggered case

The precision problem: the GPU has 32 bit float precision no proper rounding, but truncation error propagation is worse not a serious problem for two reasons 1. Conjugate gradient solver can be run in two steps each time with lower precision and obtain high precision result 2. Most of the computations are along the trajectories where lower precision is satisfactory High precision is only needed for the accept/reject step

Summary GPUs are massively parallel stream based processors General purpose applications are possible Staggered and Wilson Dirac operator and CG solver implemented factor 3-6 speedup compared to CPUs Only single precision with truncation, no problem for CG Future prospects Communication between the cards (in one PC or different PC s) Software side: unknown and unpredictable progress made by vendors resulting in new OpenGL extensions and improved drivers. Hardware side: faster, larger, more!