CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
|
|
|
- Darren Stephens
- 10 years ago
- Views:
Transcription
1 CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA
2 What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2
3 What Does the Application Do? It does not matter!!! We care about memory accesses, instructions, latency, Companion code: 3
4 4
5 Our Method Trace the application Identify the hot spot and profile it Identify the performance limiter Memory Bandwidth Instruction Throughput Latency Optimize the code Iterate 5
6 Our Environment We use Nvidia Tesla K20c (GK110, SM 3.5), ECC OFF, Microsoft Windows 7 x64, Microsoft Visual Studio 2012, CUDA 5.5, Nvidia Nsight
7 ITERATION 1 7
8 Trace the Application (Nsight VSE)
9 CUDA Launch Summary (Nsight VSE) spmv_kernel_v0 is a hot spot, let s start here!!! Kernel Time Speedup Original version ms 9
10 Trace the Application (NVVP)
11 Trace the Application (nvprof)
12 Profile the Most Expensive Kernel (Nsight VSE)
13 CUDA Launches (Nsight VSE) 13
14 Profile the Most Expensive Kernel (NVVP)
15 Profile the Most Expensive Kernel (nvprof)
16 Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 16
17 Memory Bandwidth SM SM Registers SMEM/L1$ Registers SMEM/L1$ L2$ DRAM (Global Memory)
18 Memory Bandwidth Utilization of DRAM Bandwidth: 29.14% We are not limited by the memory bandwidth (< 70-80%) 18
19 Memory Bandwidth (nvprof) Utilization of DRAM Bandwidth: 31.86% Read Write Total Bandwidth (GB/s) Utilization (%) Peak BW (K20c): 208GB/s We are not limited by the memory bandwidth 19
20 Instruction Throughput SM Registers Instructions go to the pipes Load Store Control Flow ALU Texture Issue 1 or 2 instructions every cycle SMEM/L1$ We cannot if a pipe is saturated
21 Instruction Throughput All pipes are underutilized: <70-75% We are not limited by instruction throughput 21
22 Instruction Throughput All pipes have Low/Mid utilization Load/Store Control Flow ALU Texture Utilization Mid Low Low Idle We are not limited by instruction throughput
23 Guided Analysis (Nvvp)
24 Latency First two things to check: Occupancy Memory accesses (coalesced/uncoalesced accesses) Other things to check (if needed): Control flow efficiency (branching, idle threads) Divergence Bank conflicts in shared memory 24
25 Latency (Occupancy) Occupancy: 55.98% Achieved / 62.50% Theoretical It s not too high but not too low: Hard to say
26 Latency (Occupancy) Guided Analysis (Nvvp): Theoretical occupancy is less than 100% but is large enough that increasing occupancy may not improve performance
27 Latency (Occupancy) SM Eligible Warps per Active Cycle: Registers Sched Sched Sched Sched Units Units Units Units SMEM/L1$ Occupancy is not an issue (> 4)
28 Memory Transactions Warps of threads (32 threads) L1 transaction: 128B Alignment: 128B (0, 128, 256, ) L2 transaction: 32B Alignment: 32B (0, 32, 64, 96, )
29 Memory Transactions (fp32) Ideal case: 32 aligned and consecutive fp32 numbers 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred
30 Memory Transactions (fp64) Ideal case: 32 aligned and consecutive fp64 numbers 2x L1 transactions: 256B needed / 256B transferred 8x L2 transactions: 256B needed / 256B transferred
31 Memory Transactions (fp64) Worst case: 32 fp64 with a stride of 128B (16x fp64) 32x L1 transactions: 256B needed / 32x128B transferred 32x L2 transactions: 256B needed / 32x32B transferred
32 Memory Transactions (fp64) Misaligned: 32 fp64 3x L1 transactions: 256B needed / 384B transferred 9x L2 transactions: 256B needed / 288B transferred
33 Memory Transactions Broadcast: 1 fp64 32 threads 1 L1 transaction: 8B needed / 128B transferred 1 L2 transaction: 8B needed / 32B transferred
34 Replays A Memory Request: LD/ST instruction The 1 st transaction is issued Other transactions induce replays Note: For each fp64 request, we have at least 1 replay
35 Latency (Memory Accesses) Transactions per Request: loads / 8 stores We have too many uncoalesced accesses!!! 35
36 Where Do Those Accesses Happen? (Nsight VSE) CUDA Source Profiler (Nsight VSE): Where are uncoalesced requests (need to compile with lineinfo) Tip: Sort L2 Global Transactions Executed 36
37 Where Do Those Accesses Happen? (Nvvp) 1/ Select a run 2/ Click on Unguided Analysis 3/ Run Memory Access Pattern 4/ Something goes wrong at those lines
38 Access Pattern Double precision numbers: 64-bit Thread 0 Thread 1 Thread 2 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Per Warp: Up to 32 L1 Transactions / Ideal case: 2 Transactions Up to 32 L2 Transactions / Ideal case: 8 Transactions 38
39 Access Pattern Next iteration: Thread 0 Thread 1 Thread 2 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Idea: Use the Read-only cache (LDG load) On Fermi: Use a texture or Use 48KB for L1 39
40 First Modification: Use ldg LD (DRAM-L2$-L1$-Reg) LDG (DRAM-L2$-TEX$-Reg) LD SM SM LDG Registers Registers LD/ST Ctrl Flow ALU TEX LD/ST Ctrl Flow ALU TEX L1$ TEX$ L1$ TEX$ L2$ 40
41 First Modification: Use ldg We change the source code: It is slower: 625.8ms Kernel Time Speedup Original version ms LDG to load A ms 0.83x 41
42 First Modification: Use ldg No benefit from the read-only cache: Hit rate at 3.3% Worse hit rate in L2$: 15.8% compared to 36.7% 42
43 First Modification: Use ldg Eligible Warps per Active Cycle has dropped to
44 First Modification: Use ldg Warps cannot issue because they have to wait 44
45 First Modification: Use ldg The loads compete for the cache too much Low hit rate: 3.3% Texture requests introduce too much latency (in that case) Things to check in those cases: Texture Hit Rate: Low means no reuse Issue Efficiency and Stall Reasons It was actually expected: GPU caches are not CPU caches!!! 45
46 First Modification: Use ldg Other accesses may benefit from LDGs Memory blocks accessed several times by several threads How can we detect it? Source code analysis There is no way to detect it from Nsight 46
47 First Modification: Use ldg We change the source code In y = Ax, we use ldg when loading x It s faster: 98.30ms Kernel Time Speedup Original version ms LDG to load A ms 0.83x LDG to load X 98.30ms 1.07x 47
48 First Modification: Use ldg Good hit rate in Texture Cache: 82% Slightly less data transferred from L2 (94MB vs 110MB) 48
49 ITERATION 2 49
50 CUDA Launch Summary spmv_kernel_v2 is still a hot spot, so we profile it 50
51 Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 51
52 Identify the Main Limiter We are still limited by latency Low DRAM utilization: 29.95% Pipe utilization is Low/Mid: <70-75% 52
53 Identify the Main Limiter We are not limited by the Occupancy We have > 4 Eligible Warps per Active Cycle (8.16) Too many uncoalesced accesses: 40.79% of Replay Overhead 53
54 Second Strategy: Change Memory Accesses 4 consecutive threads load 4 consecutive elements Threads 0, 1, 2, 3 Threads 4, 5, 6, 7 Threads 8, 9, 10, 11 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Per Warp: Up to 8 L1 Transactions / Ideal case: 2 Transactions Up to 8 L2 Transactions / Ideal case: 8 Transactions 54
55 Second Strategy: Change Memory Accesses It s much faster: 45.61ms Kernel Time Speedup Original version ms LDG to load A ms 0.83x LDG to load X 98.30ms 1.07x Coalescing with 4 Threads 45.61ms 2.30x 55
56 Second Strategy: Change Memory Accesses We have much fewer Transactions per Request: 5.16 (LD) 56
57 Second Strategy: Change Memory Accesses Much less traffic from L2: 28.27MB (it was MB) Much less DRAM traffic: 25.93MB (it was 69.59MB)
58 ITERATION 3 58
59 CUDA Launch Summary spmv_kernel_v3 is still a hot spot, so we profile it 59
60 Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 60
61 Identify the Main Limiter We are still limited by latency Low DRAM utilization: 36.01% Pipe Utilization is still Low/Mid 61
62 Latency Eligible Warps per Active Cycle: 6.70 on average We are not limited by occupancy 62
63 Latency Memory Accesses: Load: 5.16 Transactions per Request Store: 2 Transactions per Request We still have too many uncoalesced accesses 63
64 Latency We still have too many uncoalesced accesses Nearly 68.44% of Instruction Serialization (Replays) Stall Reasons: 43.14% due to Data Requests 64
65 Latency Serialization: (Inst. Issued Inst. Executed) / Inst. Issued Inst. Replay Overhead: Avg. Number of replays per Inst. Inst. Issued = 1 + Avg. Number of Replays Inst. Replay Overhead Inst. Replay Overhead / (1 + Inst. Replay Overhead ) %
66 Latency Issue Stall Reasons Stall Reasons Instruction Fetch 0.58% Execution Dependency 32.51% Data Request 37.85% Texture 0.41% Sync 0.00% Other 15.13%
67 Latency
68 Where Do Those Accesses Happen? Same lines of code as before 68
69 What Can We Do? In our kernel: 4 threads per row of the matrix A Threads 0, 1, 2, 3 Threads 4, 5, 6, 7 Threads 8, 9, 10, 11 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) New approach: 1 warp of threads per row of the matrix A Threads 0, 1, 2, 3,, 31 (some possibly idle) L2 Transaction (32B) L2 Transaction (32B) L2 Transaction (32B) 69
70 One Warp Per Row It s faster: 37.50ms Kernel Time Speedup Original version ms LDG to load A ms 0.83x LDG to load X 98.30ms 1.07x Coalescing with 4 Threads 45.61ms 2.30x 1 Warp per Row 37.50ms 2.79x 70
71 One Warp Per Row Much fewer Transactions Per Request: 1.37 (LD) / 1 (ST) 71
72 ITERATION 4 72
73 One Warp Per Row spmv_kernel_v4 is the hot spot 73
74 One Warp Per Row DRAM utilization: 37.36% Pipe Utilization is Low/Mid We are still limited by latency
75 One Warp Per Row Occupancy and memory accesses are OK (not shown) Control Flow Efficiency: 87.31% 75
76 Control Flow Efficiency All threads work: 100% Some threads do nothing: Less efficiency if( threadidx.x % 32 < 24 ) { // do some long computation } Efficiency = 24/32 = 75%
77 Control Flow Efficiency Low efficiency in one the key loop: 69.9%
78 One Half Warp Per Row It is faster: 35.81ms Kernel Time Speedup Original version ms LDG to load A ms 0.73x LDG to load X 98.30ms 1.07x Coalescing with 4 Threads 45.61ms 2.30x 1 Warp per Row 37.50ms 2.79x ½ Warp per Row 35.81ms 2.93x 78
79 ITERATION 5 79
80 One Half Warp Per Row DRAM utilization: 48.18% Pipe Utilization is Low/Mid We are still limited by latency 80
81 One Half Warp Per Row Occupancy is not an issue Memory accesses are good enough 81
82 One Half Warp Per Row Branch divergence induce latency We have 16.32% of divergent branches 82
83 Branch Divergence
84 Branch Divergence Execution Time = Time of If branch + Time of Else branch if( threadidx.x % 32 < 24 ) { // do some long computation } else { // do some long computation }
85 One Half Warp Per Row We fix branch divergence It is faster: 29.60ms Kernel Time Speedup Original version ms LDG to load A ms 0.83x LDG to load X 98.30ms 1.07x Coalescing with 4 Threads 45.61ms 2.30x 1 Warp per Row 37.50ms 2.79x ½ Warp per Row 35.81ms 2.93x No divergence 29.60ms 3.54x 85
86 One Half Warp Per Row DRAM utilization: 60.80% We are still far from the peak
87 So Far We have consecutively: Improved caching using ldg (use with care) Improved coalescing Improved control flow efficiency Improved branching Our new kernel is 3.5x faster than our first implementation Tools helped us a lot 87
88 88
89 ITERATION 6 89
90 Next Kernel We are satisfied with the performance of spmv_kernel We move to the next kernel: jacobi_smooth 90
91 91
92 What Have You Seen? An iterative method to optimize your GPU code Trace your application Identify the hot spot and profile it Identify the performance limiter Optimize the code Iterate A way to conduct that method with Nvidia tools 92
Guided Performance Analysis with the NVIDIA Visual Profiler
Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided
GPU Performance Analysis and Optimisation
GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler
Optimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
GPU Tools Sandra Wienke
Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)
Mitglied der Helmholtz-Gemeinschaft CUDA Tools for Debugging and Profiling Jiri Kraus (NVIDIA) GPU Programming@Jülich Supercomputing Centre Jülich 7-9 April 2014 What you will learn How to use cuda-memcheck
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
1. If we need to use each thread to calculate one output element of a vector addition, what would
Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer
GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer The GU Advantage The GU Advantage A Tale of Two Machines Tianhe-1A at NSC Tianjin Tianhe-1A at NSC Tianjin The World s
Hands-on CUDA exercises
Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer Stanford University [email protected] Henry Cook UC Berkeley [email protected] Brucek Khailany NVIDIA Research [email protected]
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
The Fastest, Most Efficient HPC Architecture Ever Built
Whitepaper NVIDIA s Next Generation TM CUDA Compute Architecture: TM Kepler GK110 The Fastest, Most Efficient HPC Architecture Ever Built V1.0 Table of Contents Kepler GK110 The Next Generation GPU Computing
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
GPU Programming Strategies and Trends in GPU Computing
GPU Programming Strategies and Trends in GPU Computing André R. Brodtkorb 1 Trond R. Hagen 1,2 Martin L. Sætra 2 1 SINTEF, Dept. Appl. Math., P.O. Box 124, Blindern, NO-0314 Oslo, Norway 2 Center of Mathematics
How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda
Mitglied der Helmholtz-Gemeinschaft Hands On CUDA Tools and Performance-Optimization JSC GPU Programming Course 26. März 2011 Dominic Eschweiler Outline of This Talk Introduction Setup CUDA-GDB Profiling
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0
Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What
TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING
TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING NVIDIA DEVELOPER TOOLS BUILD. DEBUG. PROFILE. C/C++ IDE INTEGRATION STANDALONE TOOLS HARDWARE SUPPORT CPU AND GPU DEBUGGING & PROFILING
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
CUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1
Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University
CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
Intel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29
RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant
Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. [email protected]. MemoryHierarchy- 1
Hierarchy Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN [email protected] Hierarchy- 1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor
ANDROID DEVELOPER TOOLS TRAINING GTC 2014. Sébastien Dominé, NVIDIA
ANDROID DEVELOPER TOOLS TRAINING GTC 2014 Sébastien Dominé, NVIDIA AGENDA NVIDIA Developer Tools Introduction Multi-core CPU tools Graphics Developer Tools Compute Developer Tools NVIDIA Developer Tools
The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy
The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008
Radeon GPU Architecture and the series Michael Doggett Graphics Architecture Group June 27, 2008 Graphics Processing Units Introduction GPU research 2 GPU Evolution GPU started as a triangle rasterizer
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Advanced CUDA Webinar. Memory Optimizations
Advanced CUDA Webinar Memory Optimizations Outline Overview Hardware Memory Optimizations Data transfers between host and device Device memory optimizations Summary Measuring performance effective bandwidth
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
Programming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University [email protected] COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:
Performance Counters Technical Data Sheet Microsoft SQL Overview: Key Features and Benefits: Key Definitions: Performance counters are used by the Operations Management Architecture (OMA) to collect data
CUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
Profiler User's Guide
Version 2016 www.pgroup.com TABLE OF CONTENTS Profiling Overview... iv What's New... iv Terminology... v Chapter 1. Preparing An Application For Profiling...1 1.1. Focused Profiling...1 1.2. Marking Regions
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Path Tracing Overview
Path Tracing Overview Cast a ray from the camera through the pixel At hit point, evaluate material Determine new incoming direction Update path throughput Cast a shadow ray towards a light source Cast
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010
GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context:
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University
Automated Software Testing of Memory Performance in Embedded GPUs Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University 1 State-of-the-art in Detecting Performance Loss Input Program profiling
NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS
NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS DU-05349-001_v6.0 February 2014 Installation and Verification on TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2.
Learn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
GPU-based Decompression for Medical Imaging Applications
GPU-based Decompression for Medical Imaging Applications Al Wegener, CTO Samplify Systems 160 Saratoga Ave. Suite 150 Santa Clara, CA 95051 [email protected] (888) LESS-BITS +1 (408) 249-1500 1 Outline
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem
High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem Kamil Rocki, PhD Department of Computer Science Graduate School of Information Science and Technology The University of
Lecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles [email protected] Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
COMPUTER HARDWARE. Input- Output and Communication Memory Systems
COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)
SIDN Server Measurements
SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources
One of the database administrators
THE ESSENTIAL GUIDE TO Database Monitoring By Michael Otey SPONSORED BY One of the database administrators (DBAs) most important jobs is to keep the database running smoothly, which includes quickly troubleshooting
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
