GPU multiprocessing. Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain)
|
|
- Kelly McCarthy
- 7 years ago
- Views:
Transcription
1 GPU multiprocessing Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain)
2 Outline 1. Multichip solutions [10 slides] 2. Multicard solutions [2 slides] 3. Multichip + multicard [3] 4. Performance on matrix decompositions [2] 5. CUDA programming [5] 6. Scalability on 3DFD [4]
3 A world of possibilities From lower to higher cost, we have: 1. Multichip: Voodoo5 (3Dfx), 3D1 (Gigabyte). 2. Multicard: SLI(Nvidia) / CrossFire(ATI). NVIDIA (2007) ATI (2007) Gigabyte (2005) NVIDIA (2008) 3. Combination: Two chips/card and/or two cards/connector. Evans & Sutherland (2004): 3
4 I. Multichip solutions 4
5 First choice: Multichip. A retrospective: Voodoo Dfx (1999) Volari V8 Duo XGI (2002) Rage Fury Maxx ATI (2000) 2 Rad9800 (prototype) Sapphire (2003) 5
6 First choice: Multichip. Example 1: 3D1 (Gigabyte ). A double GeForce 6600GT GPU on the same card (december 2005). Each GPU endowed with 128 MB of memory and a 128 bits bus width. 6
7 First choice: Multichip. Example 2: GeForce 7950 GX2 (Nvidia 2006) 7
8 First choice: Multichip. Example 3: GeForce 9800 GX2 (Nvidia ) Double GeForce 8800 GPU, double printed circuit board and double video memory of 512 MB. A single PCI-express connector. 8
9 First choice: Multichip. 3D1 (Gigabyte). Cost and performance 3DMark DMark 2005 Card GeForce 6600 GT 3D1 using a single GPU GeForce 6800 GT GeForce 6600 GT SLI 3D1 using two GPUs 1024x x x x Cost: row 3 > row 4 > row 5 > row 1 > row 1 9
10 First choice: Multichip. 3D1 (Gigabyte). Analysis. As compared to a single GeForce 6800 GT, 3D1 has: Lower cost. Higher arithmetic performance. Better at poorer resolution and software innovations (shaders). Similar bandwidth. Lower memory space and usability: Vertices and textures must be replicated. A GPU cannot see the memory of its twin. As compared to two GeForce 6600 GT connected through SLI: Slightly lower cost. Greater performance without demanding CPU bandwidth. Less versatile: Future expansion and/or single-card use. 10
11 First choice: Multichip. GeForce 7950 GX2 (2006) GPU developed by Nvidia in June The GPU has twin soul (duality affects design). Clocks are slower than the single-gpu model: GPU: 500 MHz (twin) versus 650 MHz (stand alone). Memory: 2x600 MHz (twin) versus 2x800 MHz (stand alone). Drivers were released almost a year later, which penalized initially the popularity of this card. It allows to use 48 pixel processors (24 on each GPU) and a video memory of 1 GB (512 MB connected to each GPU through a couple of buses 256 bits wide). 11
12 First choice: Multichip (2006). Transistors. A smaller chip with smaller transistors allows growing through a GPU replication 12
13 First choice: Multichip (2006). Frequency. A double GPU allows to relax clocks, with less heat and power consumption. 13
14 First choice: Multichip (2006). Bandwidth. Two GPUs placed on parallel planes make it easier to duplicate the bus width to 512 bits. 14
15 II. Multicard solutions 15
16 Second choice: Multicard. A couple of GPUs SLI (Nvidia on GeForces) CrossFire (ATI on Radeons) 16
17 Second choice: Multicard. SLI (Nvidia). Elements. - The motherboard must have several slots PCI-express 2.0 and PCI-express x16: - The power supply must reach at least 700 Watts. - Performance issues: A twin card may increment performance 60-80%. A new generation of GPUs may increment even more. Time frame becomes crucial! 17
18 III. Multichip + multicard 18
19 1+2 choice: Multichip+multicard First solution available on the marketplace: Gigabyte (2005) based on GeForce 6 GPUs. It allows heterogeneous graphics cards, but workload balance gets complicated. 19
20 1+2 choice: Multichip+multicard. Implementation details 20
21 1+2 choice: Multichip+multitarjeta. Newer designs It combines a number of GeForce 9800 GX2 GPUs and a multi-socket motherboard to configure up to quad-sli: 2 GPUs/card x up to 4 cards = 8 GPUs. 2 GPUs 4 GPUs 8 GPUs 21
22 IV. Performance on matrix decompositions 22
23 Multicard performance versus a newer generation (LU decomposition) A second (twin) GPU improves 1.6x, but does not reach the performance of a single card coming from the next generation. 23
24 CPU+GPU performance versus a single quad-core CPU (more on this later) The benchmark is composed of three popular matrix decompositions used in linear algebra 24
25 V. CUDA programming for multi-gpu applications 25
26 Device Management CPU can query and select GPU devices cudagetdevicecount( int *count ) cudasetdevice( int device ) cudagetdevice( int *current_device ) cudagetdeviceproperties( cudadeviceprop* prop, int device ) cudachoosedevice( int *device, cudadeviceprop* prop ) Multi-GPU setup: device 0 is used by default one CPU thread can control only one GPU multiple CPU threads can control the same GPU calls are serialized by the driver 41 26
27 Multiple CPU Threads and CUDA CUDA resources allocated by a CPU thread can be consumed only by CUDA calls from the same CPU thread. Violation example: CPU thread 2 allocates GPU memory, stores address in p thread 3 issues a CUDA call that accesses memory via p 42 27
28 When using several GPUs, the implementation gets complicated GPUs don t share video memory, so programmer must move data around PCI-express (even when GPUs belong to the same graphics card, as in the GeForce 9800 GX2). Steps to follow: Copy data from GPU A to CPU thread A. Copy data from CPU thread A to CPU thread B using MPI. Copy data from CPU thread B to GPU B. We can use asynchronous copies to overlap the kernel execution on the GPU with data copies, and pinned memory to share copies among CPU threads (use cudahostalloc()) 28
29 Host Synchronization All kernel launches are asynchronous control returns to CPU immediately kernel executes after all previous CUDA calls have completed cudamemcpy is synchronous control returns to CPU after copy completes copy starts after all previous CUDA calls have completed cudathreadsynchronize() blocks until all previous CUDA calls complete 39 29
30 CPU GPU interactions: Conclusions CPU GPU mem BW much lower than GPU mem BW. Use page-locked host memory (cudamallochost()) for maximum CPU GPU bandwidth 3.2 GB/s common on PCI-e x16. ~4 GB/s measured on nforce 680i chipsets (8 GB/s for PCI-e 2.0). Be cautious however since allocating too much page-locked memory can reduce overall system performance. Minimize CPU GPU data transfers by moving more code from CPU to GPU: Even if that means running kernels with low parallelism. Intermediate data structs. can be allocated, operated on, and deallocated without ever copying them to CPU memory. Group data transfers: One large transfer much better than many small ones. 30
31 VI. Scalability for 3DFD (Nvidia code) 31
32 Example: Multi-GPU implementation for 3DFD 3DFD is a finite differences code for the discretization of the seismic wave equation. 8th order in space, 2nd order in time. Using a regular mesh. Fixed X and Y dimensions, varying Z. Data is partitioned among GPUs along Z axis. Computation increases with z, communication (per node) stays constant. A GPU has to exchange 4 xy-planes (ghost nodes) with each of its neighbors. Executed on a cluster of 2 GPUS per node and Infiniband SDR network. 32
33 Performance for a couple of GPUs Linear scaling is achieved when computation time exceeds communication time. 33
34 Three or more cluster nodes Times are per cluster node. At least one cluster node needs two MPI communications, one with each of the neighbors. 34
35 Performance with 8 GPUs 8x improvement factor is sustained at Z>1300, exactly where computation exceeds communication. 35
How PCI Express Works (by Tracy V. Wilson)
1 How PCI Express Works (by Tracy V. Wilson) http://computer.howstuffworks.com/pci-express.htm Peripheral Component Interconnect (PCI) slots are such an integral part of a computer's architecture that
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationWhy You Need the EVGA e-geforce 6800 GS
Why You Need the EVGA e-geforce 6800 GS GeForce 6800 GS Profile NVIDIA s announcement of a new GPU product hailing from the now legendary GeForce 6 series adds new fire to the lineup in the form of the
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More informationGPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationNVIDIA Quadro M4000 Sync PNY Part Number: VCQM4000SYNC-PB. User Guide
NVIDIA Quadro M4000 Sync PNY Part Number: VCQM4000SYNC-PB User Guide PNY 100 Jefferson Road Parsippany NJ 07054-0218 973-515-9700 www.pny.com/quadro Features and specifications are subject to change without
More informationData Sheet Graphic Cards for Fujitsu ESPRIMO PCs
Data Sheet Graphic Cards for Fujitsu ESPRIMO PCs Fujitsu ESPRIMO PCs are used for common office applications. To fulfill the demands of demanding applications, Fujitsu ESPRIMO PCs can be ordered with either
More informationConfiguring Memory on the HP Business Desktop dx5150
Configuring Memory on the HP Business Desktop dx5150 Abstract... 2 Glossary of Terms... 2 Introduction... 2 Main Memory Configuration... 3 Single-channel vs. Dual-channel... 3 Memory Type and Speed...
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationPCI vs. PCI Express vs. AGP
PCI vs. PCI Express vs. AGP What is PCI Express? Introduction So you want to know about PCI Express? PCI Express is a recent feature addition to many new motherboards. PCI Express support can have a big
More informationHow To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With
SAPPHIRE R9 270X 4GB GDDR5 WITH BOOST & OC Specification Display Support Output GPU Video Memory Dimension Software Accessory 3 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) 1 x DisplayPort 1.2
More informationSAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST
SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST Specification Display Support Output GPU Video Memory Dimension Software Accessory supports up to 4 display monitor(s) without DisplayPort 4 x Maximum Display
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationPedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationOctaVis: A Simple and Efficient Multi-View Rendering System
OctaVis: A Simple and Efficient Multi-View Rendering System Eugen Dyck, Holger Schmidt, Mario Botsch Computer Graphics & Geometry Processing Bielefeld University Abstract: We present a simple, low-cost,
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationIP Video Rendering Basics
CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.
More informationChoosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
More informationMsystems Ltd. www.msystems.gr SAPPHIRE HD 6870 1GB GDDR5 PCIE
SAPPHIRE HD 6870 1GB GDDR5 PCIE The SAPPHIRE HD 6870 has a new architecture with a total of 1120 stream processors and 56 texture units delivering massively parallel computing power for graphics and other
More informationLatency and Bandwidth Impact on GPU-systems
NTNU Norwegian University of Science and Technology Faculty of Information Technology, Mathematics and Electrical Engineering Department of Computer and Information Science TDT4590 Complex Computer Systems,
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationPCI Express Basic Info *This info also applies to Laptops
PCI Express Basic Info *This info also applies to Laptops PCI Express Laptops PCI Express Motherboards PCI Express Video Cards PCI Express CPU Motherboard Combo's PCI Express Barebone Systems PCI Express
More informationData Sheet. Desktop ESPRIMO. General
Data Sheet Graphic Cards for FUJITSU Desktop ESPRIMO FUJITSU Desktop ESPRIMO are used for common office applications. To fulfill the demands of demanding applications, ESPRIMO Desktops can be ordered with
More informationOptimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
More informationSAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST
SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST Specification Display Support Output GPU Video Memory Dimension Software Accessory 4 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) 1 x DisplayPort
More informationGOLD20TH-GTX980-P-4GD5
GOLD20TH-GTX980-P-4GD5 The fastest GTX 980 in the world. ASUS Exclusive Innovations DIRECTCU II + 0dB FAN 15% COOLER. SILENT GAMING. ASUS GTX 980 20th anniversary gold edition drives DirectCU technology
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More information================================================================== CONTENTS ==================================================================
Disney Epic Mickey 2 : The Power of Two Read Me File ( Disney) Thank you for purchasing Disney Epic Mickey 2 : The Power of Two. This readme file contains last minute information that did not make it into
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationA+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware
A+ Guide to Managing and Maintaining Your PC, 7e Chapter 1 Introducing Hardware Objectives Learn that a computer requires both hardware and software to work Learn about the many different hardware components
More informationAMD PhenomII. Architecture for Multimedia System -2010. Prof. Cristina Silvano. Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923
AMD PhenomII Architecture for Multimedia System -2010 Prof. Cristina Silvano Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923 Outline Introduction Features Key architectures References AMD Phenom
More informationThe Bus (PCI and PCI-Express)
4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the
More informationThe Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
More informationPCI Express and Storage. Ron Emerick, Sun Microsystems
Ron Emerick, Sun Microsystems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material in presentations and literature
More informationEDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation
PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
More informationParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
More informationNVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
More informationThe Motherboard Chapter #5
The Motherboard Chapter #5 Amy Hissom Key Terms Advanced Transfer Cache (ATC) A type of L2 cache contained within the Pentium processor housing that is embedded on the same core processor die as the CPU
More informationAdvanced CUDA Webinar. Memory Optimizations
Advanced CUDA Webinar Memory Optimizations Outline Overview Hardware Memory Optimizations Data transfers between host and device Device memory optimizations Summary Measuring performance effective bandwidth
More informationMsystems Ltd. www.msystems.gr SAPPHIRE HD 6850 1GB GDDR5 PCIE. Specification
Specification Output GPU Memory Software Accessory 1 x Dual-Link DVI 1 x HDMI 1.4a 1 x DisplayPort 1 x Single-Link DVI-D 775 MHz Core Clock 40 nm Chip 960 x Stream Processors 1024 MB Size 256 -bit GDDR5
More informationHome Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015
INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationHP Workstations graphics card options
Family data sheet HP Workstations graphics card options Quick reference guide Leading-edge professional graphics February 2013 A full range of graphics cards to meet your performance needs compare features
More informationRecent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005
Recent Advances and Future Trends in Graphics Hardware Michael Doggett Architect November 23, 2005 Overview XBOX360 GPU : Xenos Rendering performance GPU architecture Unified shader Memory Export Texture/Vertex
More informationMulti-GPU Programming Supercomputing 2011
Multi-GPU Programming Supercomputing 2011 Paulius Micikevicius NVIDIA November 14, 2011 Outline Usecases and a taxonomy of scenarios Inter-GPU communication: Single host, multiple GPUs Multiple hosts Case
More informationNVIDIA GeForce GTX 750 Ti
Whitepaper NVIDIA GeForce GTX 750 Ti Featuring First-Generation Maxwell GPU Technology, Designed for Extreme Performance per Watt V1.1 Table of Contents Table of Contents... 1 Introduction... 3 The Soul
More informationCUDA. Multicore machines
CUDA GPU vs Multicore computers Multicore machines Emphasize multiple full-blown processor cores, implementing the complete instruction set of the CPU The cores are out-of-order implying that they could
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More informationDiscovering Computers 2011. Living in a Digital World
Discovering Computers 2011 Living in a Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook computers, and mobile devices Identify chips,
More informationST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
More informationSymmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
More informationPCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)
PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters from One Stop Systems (OSS) PCIe Over Cable PCIe provides greater performance 8 7 6 5 GBytes/s 4
More informationQualified Apple Mac Workstations for Avid Media Composer v5.0.x
Qualified Apple Mac Workstations for Media Composer v5.0.x Qualified Workstation Two 2.66GHz 6-Core Intel Xeon Westmere (12 cores) 6 GB Ram (6x1GB) ATI Radeon HD 5770 1GB ^ Nitris Mojo Mojo Mojo SDI or
More informationAMD Processor Performance. AMD Phenom II Processors Discrete Platform Benchmarks December 2008
AMD Processor Performance AMD Phenom II Processors Discrete Platform Benchmarks December 2008 AMD Phenom II Performance Overall Performance of Office Productivity + Digital Media + Games AMD Phenom II
More informationGenerations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
More informationRetargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
More informationPCI Express IO Virtualization Overview
Ron Emerick, Oracle Corporation Author: Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and
More informationExperiences With Mobile Processors for Energy Efficient HPC
Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica
More informationCHAPTER 2: HARDWARE BASICS: INSIDE THE BOX
CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX Multiple Choice: 1. Processing information involves: A. accepting information from the outside world. B. communication with another computer. C. performing arithmetic
More informationL20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationPCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation
PCI Express Impact on Storage Architectures and Future Data Centers Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
More informationPCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems
PCI Express Impact on Storage Architectures Ron Emerick, Sun Microsystems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may
More information1. INTRODUCTION Graphics 2
1. INTRODUCTION Graphics 2 06-02408 Level 3 10 credits in Semester 2 Professor Aleš Leonardis Slides by Professor Ela Claridge What is computer graphics? The art of 3D graphics is the art of fooling the
More informationQuickSpecs. NVIDIA Quadro M6000 12GB Graphics INTRODUCTION. NVIDIA Quadro M6000 12GB Graphics. Overview
Overview L2K02AA INTRODUCTION Push the frontier of graphics processing with the new NVIDIA Quadro M6000 12GB graphics card. The Quadro M6000 features the top of the line member of the latest NVIDIA Maxwell-based
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationIntroduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
More informationOptimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
More informationNote monitors controlled by analog signals CRT monitors are controlled by analog voltage. i. e. the level of analog signal delivered through the
DVI Interface The outline: The reasons for digital interface of a monitor the transfer from VGA to DVI. DVI v. analog interface. The principles of LCD control through DVI interface. The link between DVI
More informationCPU. Motherboard RAM. Power Supply. Storage. Optical Drives
CPU Motherboard RAM Power Supply Storage Optical Drives GPU 2 The CPU is the brain of a computer CPU receives instructions from software programs stored in memory Instructions are read and the tasks performed
More informationComputation of Mutual Information Metric for Image Registration on Multiple GPUs
Computation of Mutual Information Metric for Image Registration on Multiple GPUs Andrew V. Adinetz 1, Markus Axer 2, Marcel Huysegoms 2, Stefan Köhnen 2, Jiri Kraus 3, Dirk Pleiter 1 1 JSC, Forschungszentrum
More informationBoard Specification. Tesla C1060 Computing Processor Board. September 2008 BD-04111-001_v03
Board Specification Tesla C1060 Computing Processor Board September 2008 BD-04111-001_v03 Document Change History Version Date Responsible Description of Change 01 July 10, 2008 SG, SM Preliminary Release
More informationChapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance
What You Will Learn... Computers Are Your Future Chapter 6 Understand how computers represent data Understand the measurements used to describe data transfer rates and data storage capacity List the components
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationEUCIP IT Administrator - Module 1 PC Hardware Syllabus Version 3.0
EUCIP IT Administrator - Module 1 PC Hardware Syllabus Version 3.0 Copyright 2011 ECDL Foundation All rights reserved. No part of this publication may be reproduced in any form except as permitted by ECDL
More informationLearning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards eddie.edwards@imperial.ac.uk At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationv1 System Requirements 7/11/07
v1 System Requirements 7/11/07 Core System Core-001: Windows Home Server must not exceed specified sound pressure level Overall Sound Pressure level (noise emissions) must not exceed 33 db (A) SPL at ambient
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationAccelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
More informationBLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
More information22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
More informationATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group
ATI Radeon 4800 series Graphics Michael Doggett Graphics Architecture Group Graphics Product Group Graphics Processing Units ATI Radeon HD 4870 AMD Stream Computing Next Generation GPUs 2 Radeon 4800 series
More information