Architecture. Jason Lowden Advanced Computer Architecture November 7, 2012
|
|
- Dora Craig
- 7 years ago
- Views:
Transcription
1 Evolution of the NVIDIA GPU Architecture Jason Lowden Advanced Computer Architecture November 7, 2012
2 Agenda Introduction of the NVIDIA GPU Graphics Pipeline GPU Terminology Architecture of a GPU Computing Elements Memory Types Fermi Architecture Kepler Architecture GPUs as a Computational Device CUDA Programming Performance Comparison Relation to SMT, Vector Processors, and DSPs Summary
3 NVIDIA GPU History First GPU is released in 1999 Used for the purpose of graphics processing GeForce and Quadro CUDA Architecture released in 2006 Designed for use by industry and academia as a computing device Move towards commodity parallel processing Tesla GPU series released in 2007 Fermi Architecture released in 2009 Kepler Architecture released in 2012
4 Graphics Pipeline
5 Terminology Thread The smallest grain of the hierarchy of device computation Block A group of threads Grid A group of blocks Warp A group of 32 threads that are executed simultaneously on the device Kernel The creator of a grid for GPU execution
6 Architecture of a GPU Same components as a typical CPU However, More computing elements More types of memory Original GPUs had vertex and pixel shaders Specifically for graphics Modern GPUs are slightly different CUDA Compute Unified Device Architecture
7 Computational Elements of a GPU Streaming Processor Core of the design Place where all of the computation takes place Streaming Multiprocessor Groups of streaming multiprocessors In addition to the SPs, these also contain the Special Function Units and Load/Store Units Instructional Schedulers Complex Control Logic
8 Streaming Multiprocessor Architecture
9 Types of GPU Memory Global DRAM Slowest Performance Texture Cached Global Memory Bound at runtime Constant Cached Global Memory Shared Local to a block of threads
10 Architectural Memory Hierarchy
11 Fermi Architecture
12 Fermi Improvements Increase the number of SPs per SM Unified Request Path for load/store instructions Implementation of a cache hierarchy L1 cache per SM Configurable with Shared Memory L2 cache is shared globally Register Spilling Occurs when the register requirements of a thread exceed what is available on the device Previous Generation: Spill to DRAM (global memory) Fermi: Use of the L1 cache
13 Summary
14 Kepler SM Overview Goal: Improve GPU performance and power efficiency Improved to 3 times performance per watt over Fermi Increased to 192 SPs per SM 32 Special Floating Point units Improved Warp Scheduling 14
15 Kepler SM Design 15
16 Warp Scheduler 4 warp schedulers Each scheduler can issue up to 2 independent instructions when it is ready to issue. 16
17 Kepler Memory Architecture Shared Memory and L1 are still physically shared New configuration: 32K L1, 32K Shared Shared memory bandwidth is doubled compared with Fermi Increased the size of L2 Doubled the size Fermi, increasing it to 1536 KB Introduction of Read Only Cache Previously, this was used in Fermi for Texture cache 48 KB of storage 17
18 Warp Shuffle Instructions In Fermi, data could only be exchanged between threads using shared memory. Resulted in additional synchronization time Kepler allows the shuffle functions, which Exchange data between threads without using shared memory Handles the store and load operation as a single step Data can only be shared within the same warp In their example, an FFT algorithm saw 6% performance increase when using this instruction. 18
19 Kepler Hardware Features Dynamic Parallelism Any kernel can launch more kernels from within itself Takes additional load off of the CPU Hyper Q 32 hardware managed work queues Fermi had 1 queue Grid Management Unit Needed to manage the number of grids that are executed Introduction of the GMU to handle all of the grids that can be active at one time NVIDIA GPUDirect TM Ability for CUDA enabled GPUs to interact without the need for CPU intervention The GPU can interact directly with the NIC 19
20 Comparison of Kepler and Fermi 20
21 Use for Computation Historically, GPUs were used for graphics to offload CPU work Current trend Combine CPU and GPU on a single core Due to the massively parallel computations of the work, GPUs are ideal for their number of processing cores. However, these are only ideal when there are few data dependencies. Introduction of CUDA and the Tesla GPUs
22 CUDA Programming Extensions to the C language With some C++ support Programming Support Windows Visual Studio Linux/Mac Eclipse Programming paradigm where each computation take place on a separate thread Requires NVIDIA GPU for acceleration Simulators are used for research purposes
23 Example Vector Addition C for( int i = 0; i < SIZE; ++i ) { c[ i ] = a[ i ] + b[ i ]; } CUDA global void addvectors( float* a, float* b, float* c ) { int id = threadidx.x; if( id < SIZE ) { c[ id ] = a[ id ] + b[ id ]; } }
24 Programming Requirements Explicit Memory Operations to allocate and copy data from the CPU to GPU Some exceptions do apply All kernels execute asynchronously of the CPU Explicit synchronization barriers between the processors
25 Synchronization and Performance To meet data dependencies, Synchronization Primitives syncthreads() Synchronizes all threads in a block Atomic Operations Depending on compute/cuda version, these are possible on global and shared memory Performance is dictated by memory operations and synchronization cost Memory Coalescence Warp Divergence
26 Performance Comparison
27 Relation to Other Architectures SMT Many smaller cores, with less functionality, to compute results Each core has a hardware context for a thread that can be switched out Vector Processors Computation of results in parallel that could be done sequentially by a CPU Ability to access large chunks of data from memory at a given time Banks of shared memory could lead to bank conflicts Digital Signal Processors As with DSP algorithms, many applications could also use the MAC elements; these are built into the GPU by design
28 Conclusions GPUs are massively parallel devices that can be used for general purpose computing, in addition to graphics processing As the cost continues to decrease, these devices become off the shelf components that can be used to build larger system. In addition to compute capabilities, Kepler offers the benefit of additional performance per watt, making a more power efficient design. When used with other technologies, like OpenCL, GPUs can be used in heterogeneous platforms.
29 References S. L. Alarcon, CUDA Memories, unpublished. NVIDIA. (2012 April 16). NVIDIA CUDA C Programming Guide. [Online]. Available: amming_guide.pdf. NVIDIA. (2009). NVIDIA s Next Generation CUDA TM Compute Architecture: Fermi. [Online]. Available: cture_whitepaper.pdf. NVIDIA. (2012). NVIDIA s Next Generation CUDA TM Compute Architecture: Kepler TM GK110. [Online]. Available: Kepler GK110 Architecture Whitepaper.pdf. NVIDIA. (2012). NVIDIA GeForce GTX 680. [Online]. Available: GTX 680 Whitepaper FINAL.pdf
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationThe Fastest, Most Efficient HPC Architecture Ever Built
Whitepaper NVIDIA s Next Generation TM CUDA Compute Architecture: TM Kepler GK110 The Fastest, Most Efficient HPC Architecture Ever Built V1.0 Table of Contents Kepler GK110 The Next Generation GPU Computing
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationGPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationCUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More informationGPU Performance Analysis and Optimisation
GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler
More informationTexture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache
More informationProgramming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University johnmc@rice.edu COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
More informationGPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationNVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationGuided Performance Analysis with the NVIDIA Visual Profiler
Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided
More informationGPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
More informationAccelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
More informationReal-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university
Real-Time Realistic Rendering Michael Doggett Docent Department of Computer Science Lund university 30-5-2011 Visually realistic goal force[d] us to completely rethink the entire rendering process. Cook
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More information1. If we need to use each thread to calculate one output element of a vector addition, what would
Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x
More informationThe Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationL20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
More informationCUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
More informationGPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail
More informationOptimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
More informationIntro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationCUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University
CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum
More informationGPU Tools Sandra Wienke
Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA
More informationExperiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationAccelerating Wavelet-Based Video Coding on Graphics Hardware
Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing
More informationCase Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
More informationMONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
More informationGPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer
GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer The GU Advantage The GU Advantage A Tale of Two Machines Tianhe-1A at NSC Tianjin Tianhe-1A at NSC Tianjin The World s
More informationParallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
More informationIntroduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More informationNVIDIA GeForce GTX 750 Ti
Whitepaper NVIDIA GeForce GTX 750 Ti Featuring First-Generation Maxwell GPU Technology, Designed for Extreme Performance per Watt V1.1 Table of Contents Table of Contents... 1 Introduction... 3 The Soul
More informationHIGH PERFORMANCE CONSULTING COURSE OFFERINGS
Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...
More informationGPU Programming Strategies and Trends in GPU Computing
GPU Programming Strategies and Trends in GPU Computing André R. Brodtkorb 1 Trond R. Hagen 1,2 Martin L. Sætra 2 1 SINTEF, Dept. Appl. Math., P.O. Box 124, Blindern, NO-0314 Oslo, Norway 2 Center of Mathematics
More informationAPPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, typeng2001@yahoo.com
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationNVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
More informationDetector Defect Correction of Medical Images on Graphics Processors
This is the author s version of the work. The definitive work was published in Proceedings of the SPIE: Medical Imaging 2011: Image Processing, Lake Buena Vista, Orlando, FL, USA, February 12-17, 2011.
More informationHome Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015
INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale
More informationCUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
More informationIMAGE PROCESSING WITH CUDA
IMAGE PROCESSING WITH CUDA by Jia Tse Bachelor of Science, University of Nevada, Las Vegas 2006 A thesis submitted in partial fulfillment of the requirements for the Master of Science Degree in Computer
More informationPorting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka
Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2,
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationCudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer Stanford University mebauer@cs.stanford.edu Henry Cook UC Berkeley hcook@cs.berkeley.edu Brucek Khailany NVIDIA Research bkhailany@nvidia.com
More informationProject INF BigData. Figure 1: Plot of the learned function from the checker board data set.
Project INF BigData Roberto Fontanarosa, Tobias Rupp, and Steffen Hirschmann Figure 1: Plot of the learned function from the checker board data set. Abstract Prediction and forecasting has become very
More informationGPU Profiling with AMD CodeXL
GPU Profiling with AMD CodeXL Software Profiling Course Hannes Würfel OUTLINE 1. Motivation 2. GPU Recap 3. OpenCL 4. CodeXL Overview 5. CodeXL Internals 6. CodeXL Profiling 7. CodeXL Debugging 8. Sources
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
More informationCUDA. Multicore machines
CUDA GPU vs Multicore computers Multicore machines Emphasize multiple full-blown processor cores, implementing the complete instruction set of the CPU The cores are out-of-order implying that they could
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationEmbedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
More informationA quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
More informationANDROID DEVELOPER TOOLS TRAINING GTC 2014. Sébastien Dominé, NVIDIA
ANDROID DEVELOPER TOOLS TRAINING GTC 2014 Sébastien Dominé, NVIDIA AGENDA NVIDIA Developer Tools Introduction Multi-core CPU tools Graphics Developer Tools Compute Developer Tools NVIDIA Developer Tools
More information~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
More informationAccelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
More informationHP Workstations graphics card options
Family data sheet HP Workstations graphics card options Quick reference guide Leading-edge professional graphics February 2013 A full range of graphics cards to meet your performance needs compare features
More informationParallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com
Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
More informationNVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS
NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS DU-05349-001_v6.0 February 2014 Installation and Verification on TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2.
More informationOpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011. Assignment 5 Posted. Project
Administrivia OpenCL Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 5 Posted Due Friday, 03/25, at 11:59pm Project One page pitch due Sunday, 03/20, at 11:59pm 10 minute pitch
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Intro, GPU Computing February 9, 2012 Dan Negrut, 2012 ME964 UW-Madison "The Internet is a great way to get on the net. US Senator Bob Dole
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationAUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS. A Thesis by ANDREW KEITH LACHANCE May 2011
AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS A Thesis by ANDREW KEITH LACHANCE May 2011 Submitted to the Graduate School Appalachian State University in partial fulfillment of
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationImage Processing & Video Algorithms with CUDA
Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly
More informationGPU for Scientific Computing. -Ali Saleh
1 GPU for Scientific Computing -Ali Saleh Contents Introduction What is GPU GPU for Scientific Computing K-Means Clustering K-nearest Neighbours When to use GPU and when not Commercial Programming GPU
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
More informationHIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS Abhijit Patait Eric Young April 4 th, 2016 NVIDIA GPU Video Technologies Video Hardware Capabilities AGENDA Video Software
More informationAdvanced CUDA Webinar. Memory Optimizations
Advanced CUDA Webinar Memory Optimizations Outline Overview Hardware Memory Optimizations Data transfers between host and device Device memory optimizations Summary Measuring performance effective bandwidth
More information