Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin
|
|
|
- Martha Atkins
- 9 years ago
- Views:
Transcription
1 Reduced Precision Hardware for Ray Tracing Sean Keely University of Texas, Austin
2 Question Why don t GPU s accelerate ray tracing?
3 Real time ray tracing needs very high ray rate Example Scene: 3 area lights + AO/GI 64 rays/pixel, 1080p, 30Hz -> 4 Billion Rays/s
4 Real time ray tracing needs very high ray rate Example Scene: 3 area lights + AO/GI 64 rays/pixel, 1080p, 30Hz -> 4 Billion Rays/s Software approaches give M Rays/s
5 Single precision ASIC s will be large Approx. 4 Tflops Trace + 2 Tflops Intersect +?? Shading Will need roughly the die area of a high end GPU for a trace & intersect co-processor
6 Previous work competes with current GPU s STRaTA, D. Kopta 2013 MIMD configurable pipeline, avoids divergence penalty Repurpose L2 as ray cache SGRT, W. Lee 2014 MIMD co-processor design Most recent work addresses the performance cost of co-processor design
7 Goal: Accelerate ray tracing in a GPU, not next to one Add high performance BVH traversal acceleration to current GPU architecture MIMD traversal SIMD programs Constraints: low impact, not a co-processor small die area low power low bandwidth
8 How: Reduced Precision and Integration 4 Billion RPS needs ~24W just for multiplies in traversal. Can be reduced to ~1W with reduced precision. One off-chip data access is ~100x more energy than a FMA. Reduced precision saves here too! Don t build new registers, cache, wires Integrate to a GPU and get this for free!
9 Reduced Precision BVH Traversal
10 Robust Ray-AABB Test T F T F >T N : Intersect Box T N T N >T F : Miss Box T F T N HIT MISS
11 Robust Ray-AABB Test T F T F >T N : Intersect Box T N T N >T F : Miss Box T F T N HIT MISS T F T N?? Ambiguity if T F T N. Need T N >T F +Ɛ, Ɛ=2*ULP(T F ) to declare a miss [T. Ize, 13].
12 Hit or Miss?
13 Hit or Miss? High Precision Intersection
14 Hit or Miss? Low Precision Intersection
15 Hit or Miss?
16 Hit or Miss?
17 Hit or Miss?
18 Minimize the effective box size Seek P such that
19 Traversal Point Update Gap due to finite precision. Problems Computing T N needs multiplies Unbounded box size Multiple applications
20 Traversal Point Update Gap due to finite precision. f Precision of update procedure S Maximum parent-child edge ratio C Maximum internal parent-child offset
21 Traversal Point Update 5 bit child boxes -> S=32 Need 8 bit arithmetic to guarantee Can actually use just 1 bit! Incorrectly taken paths are quickly aborted Much smaller than even 8 bit arithmetic Reduces precision of the adds and divides as well No need to share divider, replaced with LUT.
22 Reduced Precision Traversal Unit Fixed function traversal unit implements: 1. Two 1 bit precision traversal point updates 2. Two 5 bit precision robust Ray-AABB tests 3. Near child detection 4. Single bit traversal stack 5.
23 Reduced Precision Traversal Unit Fixed function traversal unit implements: 1. Two 1 bit precision traversal point updates 2. Two 5 bit precision robust Ray-AABB tests 3. Near child detection 4. Single bit traversal stack 5.
24 Reduced Precision Traversal Unit Fixed function traversal unit implements: 1. Two 1 bit precision traversal point updates 2. Two 5 bit precision robust Ray-AABB tests 3. Near child detection 4. Single bit traversal stack 5. A fully pipelined traversal unit is roughly 6% the area of a single SIMD FPU in a current GPU.
25 BVH Compression Traversal inputs drop to 5 bits early on Can t distinguish all boxes Opportunity to save bandwidth! BVH boxes are quantized to 5 bits 12 byte node format holding two boxes
26 GPU Integration & Scheduling
27 Schedule to minimize off chip traffic Treelet scheduling [T. Aila,... 10] Smaller is better but for queuing traffic. Stay out of memory! On chip queuing. Lots of rays Improve odds of reuse.
28 We can use everything here
29 Registers are great for rays 64 byte SIMD registers Random access Ultra high bandwidth
30 Registers are great for rays 64 byte ray state Store one ray across one register All queues are linked lists of registers
31 MIMD traversal shares with SIMD core.
32 Queue data stays on chip Single global resource Has remote logic capability already
33 Queue data stays on chip
34 Queue data stays on chip One per core 40% used for queues
35 Queue rays when data is expensive Add Hit-Only load type Cache reserved for scene data
36 Queue based intersect and shading
37 Queue based intersect and shading Register Transpose units ease transitioning between MIMD and SIMD operation
38 Ray tracing fits in current GPUs
39 Evaluation
40 Experimental setup Two simulators measure key metrics Crown: 4.9M Triangles Hairball: 2.9M Triangles Powerplant: 12.8M Triangles Vegetation: 1.1M Triangles
41 Normalized traversal steps per ray Reduced precision has low costs Compressed BVH: 7% more work (usually) Hairball Vegetation Powerplant Crown Baseline Compressed BVH 8-bit Updates 1-bit Updates
42 Normalized traversal steps per ray Reduced precision has low costs Reduced Precision Traversal: 3% more work Hairball Vegetation Powerplant Crown Baseline Compressed BVH 8-bit Updates 1-bit Updates
43 Normalized traversal steps per ray Reduced precision has low costs 1-bit Traversal Point Update: 1-2% more work Hairball Vegetation Powerplant Crown Baseline Compressed BVH 8-bit Updates 1-bit Updates
44 Normalized traversal steps per ray Total overhead is small 10-15% (usually) Hairball Vegetation Powerplant Crown Baseline Compressed BVH 8-bit Updates 1-bit Updates
45 GB off chip traffic Off chip traffic is in the real time range Simulated workload set to 30 frames per second GB/s 219 GB/s 231 GB/s 123 GB/s Hairball Vegetation Powerplant Crown Traversal Limited Bandwidth Limited
46 Million Rays Per Second Simulated ray rate Average ray rate: 3.4 Billion rays/s Current Software Performance Hairball Vegetation Powerplant Crown
47 Conclusion Reduced precision yields surprising performance benefits roughly 20x. Hardware ray tracing acceleration can be a lightweight feature of modern GPUs.
48 Acknowledgements Samuli Laine for use of Vegetation and Hairball Martin Lubich & Intel for use of Crown Daniel Kopta for assistance with comparisons in the paper UT Graphics group and others for many useful discussions
49 Questions?
50 Ground Truth
51 Incorrectly taken paths abort quickly
52 Incorrectly taken paths abort quickly
53 Incorrectly taken paths abort quickly
54 Incorrectly taken paths abort quickly
55 Incorrectly taken paths abort quickly
SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing
SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing Won-Jong Lee 1, Youngsam Shin 1, Jaedon Lee 1, Jin-Woo Kim 2, Jae-Ho Nah 3, Seokyoon Jung 1, Shihwa Lee 1, Hyun-Sang Park 4, Tack-Don Han 2 1 SAMSUNG
Motivation: Smartphone Market
Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics
SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing
SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing Won-Jong Lee, Shi-Hwa Lee, Jae-Ho Nah *, Jin-Woo Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT, SAMSUNG Electronics, Yonsei Univ. *,
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Lab Testing Summary Report
Lab Testing Summary Report July 2006 Report 060725 Product Category: WAN Optimization Appliances Vendors Tested: Citrix Systems TM Riverbed Technology TM Key findings and conclusions: The WANScaler 8800
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Hardware design for ray tracing
Hardware design for ray tracing Jae-sung Yoon Introduction Realtime ray tracing performance has recently been achieved even on single CPU. [Wald et al. 2001, 2002, 2004] However, higher resolutions, complex
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University Brook General purpose Streaming language DARPA Polymorphous Computing Architectures Stanford - Smart Memories UT Austin - TRIPS
2) Based on the information in the table which choice BEST shows the answer to 1 906? 906 899 904 909
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ) Multiplying a number by results in what type of. even. 0. even.,0. odd..,0. even ) Based on the information in the table which choice BEST shows the answer to 0? 0 0 0 )
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
GPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
ARM Microprocessor and ARM-Based Microcontrollers
ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 A Microcontroller-Based Embedded System Roadmap 1 Introduction ARM ARM Basics 2 ARM Extensions Thumb Jazelle NEON & DSP Enhancement
FAST 11. Yongseok Oh <[email protected]> University of Seoul. Mobile Embedded System Laboratory
CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of flash Memory based Solid State Drives FAST 11 Yongseok Oh University of Seoul Mobile Embedded System Laboratory
A Cross-Platform Framework for Interactive Ray Tracing
A Cross-Platform Framework for Interactive Ray Tracing Markus Geimer Stefan Müller Institut für Computervisualistik Universität Koblenz-Landau Abstract: Recent research has shown that it is possible to
A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005
Recent Advances and Future Trends in Graphics Hardware Michael Doggett Architect November 23, 2005 Overview XBOX360 GPU : Xenos Rendering performance GPU architecture Unified shader Memory Export Texture/Vertex
Radeon HD 2900 and Geometry Generation. Michael Doggett
Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
SharePoint Performance Optimization
White Paper AX Series SharePoint Performance Optimization September 2011 WP_SharePoint_091511.1 TABLE OF CONTENTS 1 Introduction... 2 2 Executive Overview... 2 3 SSL Offload... 4 4 Connection Reuse...
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
ClearCube White Paper Best Practices Pairing Virtualization and Centralization Increasing Performance for Power Users with Zero Client End Points
ClearCube White Paper Best Practices Pairing Virtualization and Centralization Increasing Performance for Power Users with Zero Client End Points Introduction Centralization and virtualization initiatives
Web Server Software Architectures
Web Server Software Architectures Author: Daniel A. Menascé Presenter: Noshaba Bakht Web Site performance and scalability 1.workload characteristics. 2.security mechanisms. 3. Web cluster architectures.
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Exploring RAID Configurations
Exploring RAID Configurations J. Ryan Fishel Florida State University August 6, 2008 Abstract To address the limits of today s slow mechanical disks, we explored a number of data layouts to improve RAID
Embedded Parallel Computing
Embedded Parallel Computing Lecture 5 - The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: Course responsible and examiner: Tomas
Let s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
Course Description. Students Will Learn
Course Description The next generation of telecommunications networks will deliver broadband data and multimedia services to users. The Ethernet interface is becoming the interface of preference for user
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Touchstone -A Fresh Approach to Multimedia for the PC
Touchstone -A Fresh Approach to Multimedia for the PC Emmett Kilgariff Martin Randall Silicon Engineering, Inc Presentation Outline Touchstone Background Chipset Overview Sprite Chip Tiler Chip Compressed
Virtual vs Physical Addresses
Virtual vs Physical Addresses Physical addresses refer to hardware addresses of physical memory. Virtual addresses refer to the virtual store viewed by the process. virtual addresses might be the same
EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi
EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Applied Technology Abstract Microsoft SQL Server includes a powerful capability to protect active databases by using either
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Path Tracing Overview
Path Tracing Overview Cast a ray from the camera through the pixel At hit point, evaluate material Determine new incoming direction Update path throughput Cast a shadow ray towards a light source Cast
Windows Server 2008 R2 Hyper-V Live Migration
Windows Server 2008 R2 Hyper-V Live Migration Table of Contents Overview of Windows Server 2008 R2 Hyper-V Features... 3 Dynamic VM storage... 3 Enhanced Processor Support... 3 Enhanced Networking Support...
3ds Max 8 & mental ray. Autodesk Media and Entertainment
3ds Max 8 & mental ray Autodesk Media and Entertainment 3ds Max 8 and mental ray 3.4 Audience: Public Autodesk 3ds Max 8 software supports mental ray 3.4.4 rendering software using various methods for
Introduction to RISC Processor. ni logic Pvt. Ltd., Pune
Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC
Cisco Application Networking for Citrix Presentation Server
Cisco Application Networking for Citrix Presentation Server Faster Site Navigation, Less Bandwidth and Server Processing, and Greater Availability for Global Deployments What You Will Learn To address
GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2
GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,
Annotation to the assignments and the solution sheet. Note the following points
Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed
Top 10 Performance Tips for OBI-EE
Top 10 Performance Tips for OBI-EE Narasimha Rao Madhuvarsu L V Bharath Terala October 2011 Apps Associates LLC Boston New York Atlanta Germany India Premier IT Professional Service and Solution Provider
Advanced Core Operating System (ACOS): Experience the Performance
WHITE PAPER Advanced Core Operating System (ACOS): Experience the Performance Table of Contents Trends Affecting Application Networking...3 The Era of Multicore...3 Multicore System Design Challenges...3
Hardware/Software Guidelines
There are many things to consider when preparing for a TRAVERSE v11 installation. The number of users, application modules and transactional volume are only a few. Reliable performance of the system is
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
Project Report on Traffic Engineering and QoS with MPLS and its applications
Project Report on Traffic Engineering and QoS with MPLS and its applications Brief Overview Multiprotocol Label Switching (MPLS) is an Internet based technology that uses short, fixed-length labels to
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
Appendix L. General-purpose GPU Radiative Solver. Andrea Tosetto Marco Giardino Matteo Gorlani (Blue Engineering & Design, Italy)
141 Appendix L General-purpose GPU Radiative Solver Andrea Tosetto Marco Giardino Matteo Gorlani (Blue Engineering & Design, Italy) 14 15 October 2014 142 General-purpose GPU Radiative Solver Abstract
How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)
Scalability Results Select the right hardware configuration for your organization to optimize performance Table of Contents Introduction... 1 Scalability... 2 Definition... 2 CPU and Memory Usage... 2
Asynchronous Bypass Channels
Asynchronous Bypass Channels Improving Performance for Multi-Synchronous NoCs T. Jain, P. Gratz, A. Sprintson, G. Choi, Department of Electrical and Computer Engineering, Texas A&M University, USA Table
Datacenter Operating Systems
Datacenter Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015 This Lecture What s a datacenter Why datacenters Types of datacenters Hyperscale datacenters Major
INTRODUCTION TO RENDERING TECHNIQUES
INTRODUCTION TO RENDERING TECHNIQUES 22 Mar. 212 Yanir Kleiman What is 3D Graphics? Why 3D? Draw one frame at a time Model only once X 24 frames per second Color / texture only once 15, frames for a feature
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
WhitePaper: XipLink Real-Time Optimizations
WhitePaper: XipLink Real-Time Optimizations XipLink Real Time Optimizations Header Compression, Packet Coalescing and Packet Prioritization Overview XipLink Real Time ( XRT ) is a new optimization capability
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS
CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS The web content providers sharing the content over the Internet during the past did not bother about the users, especially in terms of response time,
Improving Grid Processing Efficiency through Compute-Data Confluence
Solution Brief GemFire* Symphony* Intel Xeon processor Improving Grid Processing Efficiency through Compute-Data Confluence A benchmark report featuring GemStone Systems, Intel Corporation and Platform
Programmable Networking with Open vswitch
Programmable Networking with Open vswitch Jesse Gross LinuxCon September, 2013 2009 VMware Inc. All rights reserved Background: The Evolution of Data Centers Virtualization has created data center workloads
Dragon NaturallySpeaking and citrix. A White Paper from Nuance Communications March 2009
Dragon NaturallySpeaking and citrix A White Paper from Nuance Communications March 2009 Introduction As the number of deployed enterprise applications increases, organizations are seeking solutions that
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak
Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps
Windows Server 2008 R2 Hyper-V Live Migration
Windows Server 2008 R2 Hyper-V Live Migration White Paper Published: August 09 This is a preliminary document and may be changed substantially prior to final commercial release of the software described
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors
FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors Doe Hyun Yoon Naveen Muralimanohar Jichuan Chang Parthasarathy Ranganathan Norman P. Jouppi Mattan Erez Electrical and Computer
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by
Maximizing VMware ESX Performance Through Defragmentation of Guest Systems Presented by July, 2010 Table of Contents EXECUTIVE OVERVIEW 3 TEST EQUIPMENT AND METHODS 4 TESTING OVERVIEW 5 Fragmentation in
Latency on a Switched Ethernet Network
Application Note 8 Latency on a Switched Ethernet Network Introduction: This document serves to explain the sources of latency on a switched Ethernet network and describe how to calculate cumulative latency
The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)
The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000) IntelliMagic, Inc. 558 Silicon Drive Ste 101 Southlake, Texas 76092 USA Tel: 214-432-7920
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:
Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT: In view of the fast-growing Internet traffic, this paper propose a distributed traffic management
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Vorlesung Rechnerarchitektur 2 Seite 178 DASH
Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Intel DPDK Boosts Server Appliance Performance White Paper
Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks
PCI Express IO Virtualization Overview
Ron Emerick, Oracle Corporation Author: Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and
Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009
Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized
Hardware acceleration enhancing network security
Hardware acceleration enhancing network security Petr Kaštovský [email protected] High-Speed Networking Technology Partner Threats Number of attacks grows together with damage caused Source: McAfee
Management and Orchestration of Virtualized Network Functions
Management and Orchestration of Virtualized Network Functions Elisa Maini Dep. of Electrical Engineering and Information Technology, University of Naples Federico II AIMS 2014, 30 th June 2014 Outline
0408 - Avoid Paying The Virtualization Tax: Deploying Virtualized BI 4.0 The Right Way. Ashish C. Morzaria, SAP
0408 - Avoid Paying The Virtualization Tax: Deploying Virtualized BI 4.0 The Right Way Ashish C. Morzaria, SAP LEARNING POINTS Understanding the Virtualization Tax : What is it, how it affects you How
CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT
81 CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT 5.1 INTRODUCTION Distributed Web servers on the Internet require high scalability and availability to provide efficient services to
Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays
Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Database Solutions Engineering By Murali Krishnan.K Dell Product Group October 2009
