The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)
|
|
|
- Georgina Gibson
- 9 years ago
- Views:
Transcription
1 The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off) CS 223 Guest Lecture John Owens Electrical and Computer Engineering, UC Davis
2 Credits This lecture is originally derived from a tutorial at ASPLOS 2008 (March 2008) by David Luebke (NVIDIA Research), Michael Garland (NVIDIA Research), John Owens (UC Davis), and Kevin Skadron (NVIDIA Research/University of Virginia), with additional material from Mark Harris (NVIDIA Ltd.).
3 If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens? Seymour Cray
4 Why is data-parallel computing fast? The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control Control ALU ALU ALU ALU Cache DRAM CPU DRAM GPU
5 Recent GPU Performance Trends GFLOPS Historical Single /Double Precision Peak Compute Rates GB/s 288 GB/s 250 GB/s Date Precision Vendor SP DP AMD (GPU) NVIDIA (GPU) Intel (CPU) Intel Xeon Phi $388 r7970 $452 gtx680 $3998 k20x $1844 e5-2687w Early data courtesy Ian Buck; from Owens et al [CGF]; thanks to Mike Houston (AMD), Ronak Singhal (Intel), Sumit Gupta (NVIDIA)
6 (3.6, 515) Gflop/s (0.8, 78) ~ 3x (Case study 1) (1.7, 86) (1.7, 43) ~ 6x (Case studies 2 & 3) Platform a a a a Fermi a C1060 a Nehalem x 2 a Nehalem 16 8 Double-precision 4 1/8 1/4 1/ Intensity (flop : byte) Courtesy Rich Vuduc, Georgia Tech
7 What I Know About PRAM
8 GPU Truths There is no limit on the number of processors in the machine. True! The programming model abstracts this. Any memory location is uniformly accessible from any processor. For global memory, true, but there s a memory hierarchy. Also, memory coalescing is important. There is no limit on the amount of shared memory in the system. Practically, if an implementation goes out of core memory, it s probably not worth pursuing anyway. Resource contention is absent. Er, not so much. The programs written on these machines are, in general, of type MIMD. Certain special cases such as SIMD may also be handled in such a framework. GPUs are SIMD across warps, otherwise MIMD.
9 GPU Memory Model The read/write conflicts in accessing the same shared memory location simultaneously are resolved by one of the following strategies: Exclusive Read Exclusive Write (EREW) every memory cell can be read or written to by only one processor at a time Concurrent Read Exclusive Write (CREW) multiple processors can read a memory cell but only one can write at a time (you d better program this way) Exclusive Read Concurrent Write (ERCW) never considered Concurrent Read Concurrent Write (CRCW) multiple processors can read and write (I guess you could program this way we do have a hash table code where multiple processors write to the same location at the same time and all we care about is that one of them win; also is my sequence sorted can be written like this)
10 Programming Model: A Massively Multi-threaded Processor Move data-parallel application portions to the GPU Differences between GPU and CPU threads Lightweight threads GPU supports 1000s of threads Today: GPU hardware CUDA programming environment
11 Big Idea #1 One thread per data element. Doesn t this mean that large problems will have millions of threads?
12 Big Idea #2 Write one program. That program runs on ALL threads in parallel. Terminology here is SIMT : single-instruction, multiple-thread. Roughly: SIMD means many threads run in lockstep; SIMT means that some divergence is allowed and handled by the hardware This is not so much like the PRAM s MIMD model. Since the GPU has SIMD at its core, the question becomes what the hw/sw do to ease the difficulty of SIMD programming.
13 CUDA Kernels and Threads Parallel portions of an application are executed on the device as kernels One SIMT kernel is executed at a time Many threads execute each kernel Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching Definitions: Device = GPU; Host = CPU Kernel = function that runs on the device CUDA must use 1000s of threads to achieve efficiency Multi-core CPUs can use only a few
14 What sort of features do you have to put in the hardware to make it possible to support thousands (or millions) of threads?
15 Graphics Programs Features Millions of instructions Full integer and bit instructions No limits on branching, looping Texture (read only) Thread Program Constants Registers Output Registers
16 General-Purpose Programs Features 1D, 2D, or 3D thread ID allocation Fully general load/store to GPU memory: Scatter/ Gather Thread Number Programmer flexibility on how memory is accessed Untyped, not limited to fixed texture types Pointer support Texture Thread Program Constants Registers Global Memory Output Registers
17 Parallel Data Cache Features Thread Number Dedicated on-chip memory Shared between threads for inter-thread communication Explicitly managed As fast as registers Texture Thread Program Constants Registers Global Memory Output Registers Parallel Data Cache
18 Now, if every thread runs its own copy of a program, and has its own registers, how do two threads communicate?
19 Parallel Data Cache Addresses a fundamental problem of stream computing Bring the data closer to the ALU Thread Execution Manager Control ALU P n =P 1 +P 2 +P 3 +P 4 Parallel Data Cache Shared Data Stage computation for the parallel data cache Minimize trips to external memory Share values to minimize overfetch and computation Increases arithmetic intensity by keeping data close to the processors User managed generic memory, threads read/write arbitrarily Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU P n =P 1 +P 2 +P 3 +P 4 P 1 P 2 P 3 P 4 P 5 DRAM Parallel execution through cache
20 This parallel data cache seems pretty cool. Since we can have thousands of threads active at any time, why can t all of them communicate through it?
21 Realities of modern DRAM DRAM {throughput, latency} not increasing as fast as processor capability Vendor response: DRAM grains are getting larger Consequence: Accessing 1 byte of DRAM same cost as accessing 128 consecutive bytes Consequence: Structure your memory accesses to be contiguous (NVIDIA: coalesced )
22 GPU Computing (G80 GPUs) Processors execute computing threads Thread Execution Manager issues threads Host 128 Thread Processors Parallel Data Cache accelerates processing Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Load/store Global Memory
23 SM Multithreaded Multiprocessor SP IU Shared Memory SP IU Shared Memory SP SM MT IU Shared Memory Each SM runs a block of threads SM has 8 SP Thread Processors 32 GFLOPS peak at 1.35 GHz IEEE bit floating point Scalar ISA Up to 768 threads, hardware multithreaded 16KB Shared Memory Concurrent threads share data Low latency load/store
24 Memory hierarchychapter(2:(programming(model!! Thread! + registers Per9thread!local! memory! Thread!Block!! Per9block!shared! memory! Grid!0! Block!(0,!0)! Block!(0,!1)! Grid!1! Block!(0,!0)! Block!(1,!0)! Block!(1,!1)! Block!(2,!0)! Block!(2,!1)! Block!(1,!0)!!!!!!!!!!!!!!! Global!memory! + Cache on Fermi and later GPUs Block!(0,!1)! Block!(1,!1)! Block!(0,!2)! Block!(1,!2)! (
25 Big Idea #3 Latency hiding. It takes a long time to go to memory. So while one set of threads is waiting for memory run another set of threads during the wait. In practice, 32 threads run in a SIMD warp and an efficient program usually has threads in a block.
26 Scaling the Architecture Same program Scalable performance Host Input Assembler Thread Execution Manager Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Load/store Load/store Global Memory Global Memory
27 There s lots of dimensions to scale this processor with more resources. What are some of those dimensions? If you had twice as many transistors, what could you do with them?
28 What should you do with more resources? In what dimension do you think NVIDIA will scale future GPUs?
29 HW Goal: Scalability Scalable execution Program must be insensitive to the number of cores Write one program for any number of SM cores Program runs on any size GPU without recompiling Hierarchical execution model Decompose problem into sequential steps (kernels) Decompose kernel into computing parallel blocks Decompose block into computing parallel threads This is very important. Hardware distributes independent blocks to SMs as available
30 Programming Model: A Highly Multi-threaded Coprocessor The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application execute on the device as kernels that run many cooperative threads in parallel Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead Multi-core CPU needs only a few GPU needs 1000s of threads for full efficiency
31 CUDA Software Development Kit CUDA Optimized Libraries: math.h, FFT, BLAS, Integrated CPU + GPU C Source Code NVIDIA C Compiler NVIDIA Assembly for Computing (PTX) CPU Host Code CUDA Driver Debugger Profiler Standard C Compiler GPU CPU
32 Compiling CUDA for GPUs C/C++ CUDA Application NVCC CPU Code Generic Specialized PTX Code PTX to Target Translator GPU GPU Target device code
33 Programming Model (SPMD + SIMD): Thread Batching A kernel is executed as a grid of thread blocks Host Device Grid 1 A thread block is a batch of threads that can cooperate with each other by: Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Efficiently sharing data through shared memory Synchronizing their execution For hazard-free shared memory accesses Two threads from two different blocks cannot cooperate Blocks are independent Kernel 2 Thread (0, 0) Thread (0, 1) Thread (0, 2) Thread (1, 0) Thread (1, 1) Thread (1, 2) Block (1, 1) Thread (2, 0) Thread (2, 1) Thread (2, 2) Grid 2 Thread (3, 0) Thread (3, 1) Thread (3, 2) Thread (4, 0) Thread (4, 1) Thread (4, 2)
34 Execution Model Kernels are launched in grids One kernel executes at a time A block executes on one multiprocessor Does not migrate, does not release resources until exit Several blocks can reside concurrently on one multiprocessor (SM) Control limitations (of compute capability 3.0+ GPUs): At most 16 concurrent blocks per SM At most 2048 concurrent threads per SM Number is further limited by SM resources Register file is partitioned among all resident threads Shared memory is partitioned among all resident thread blocks
35 Execution Model Thread Identified by threadidx Thread Block Identified by blockidx Grid of Thread Blocks Result data array Multiple levels of parallelism Thread block syncthreads() Up to 1024 threads per block (CC 3.0+) Communicate through shared memory Threads guaranteed to be resident threadidx, blockidx Grid of thread blocks f<<<nblocks, nthreads>>>(a,b,c) Global & shared memory atomics
36 Divergence in Parallel Computing Removing divergence pain from parallel programming SIMD Pain User required to SIMD-ify User suffers when computation goes divergent Threads can diverge freely Hardware managed GPUs: Decouple execution width from programming model Inefficiency only when divergence exceeds native machine width Managing divergence becomes performance optimization Scalable
37 CUDA Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel programming language Enable heterogeneous systems (i.e., CPU+GPU) CPU & GPU are separate devices with separate DRAMs
38 Key Parallel Abstractions in CUDA Hierarchy of concurrent threads Lightweight synchronization primitives Shared memory model for cooperating threads
39 Hierarchy of concurrent threads Parallel kernels composed of many threads Thread t all threads execute the same sequential program (This is SIMT ) Threads are grouped into thread blocks threads in the same block can cooperate Block b t0 t1 tb Threads/blocks have unique IDs Each thread knows its address (thread/block ID)
40 CUDA: Programming GPU in C Philosophy: provide minimal set of extensions necessary to expose power Declaration specifiers to indicate where things live global void KernelFunc(...); // kernel callable from host device void DeviceFunc(...); // function callable on device device int GlobalVar; // variable in device memory shared int SharedVar; // shared within thread block Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); Special variables for thread identification in kernels // launch 500 blocks w/ 128 threads each dim3 threadidx; dim3 blockidx; dim3 blockdim; dim3 griddim; Intrinsics that expose specific operations in kernel code syncthreads(); // barrier synchronization within kernel
41 CUDA: Features available on GPU Standard mathematical functions sinf, powf, atanf, ceil, min, sqrtf, etc. Atomic memory operations (not in the class hw) atomicadd, atomicmin, atomicand, atomiccas, etc. Texture accesses in kernels texture<float,2> my_texture; // declare texture reference float4 texel = texfetch(my_texture, u, v);
42 Example: Vector Addition Kernel Compute vector sum C = A+B means: n = length(c) for i = 0 to n-1: C[i] = A[i] + B[i] So C[0] = A[0] + B[0], C[1] = A[1] + B[1], etc.
43 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition Device Code global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }
44 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } Host Code int main() { // Run N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }
45 Synchronization of blocks Threads within block may synchronize with barriers Step 1 syncthreads(); Step 2 Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicinc() Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c);
46 What is a thread? Independent thread of execution has its own PC, variables (registers), processor state, etc. no implication about how threads are scheduled CUDA threads might be physical threads as on NVIDIA GPUs CUDA threads might be virtual threads might pick 1 block = 1 physical thread on multicore CPU Very interesting recent research on this topic
47 What is a thread block? Thread block = virtualized multiprocessor freely choose processors to fit data freely customize for each kernel launch Thread block = a (data) parallel task all blocks in kernel have the same entry point but may execute any code they want Thread blocks of kernel must be independent tasks program valid for any interleaving of block executions
48 Blocks must be independent Any possible interleaving of blocks should be valid presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD can easily deadlock Independence requirement gives scalability
49 Big Idea #4 Organization into independent blocks allows scalability / different hardware instantiations If you organize your kernels to run over many blocks the same code will be efficient on hardware that runs one block at once and on hardware that runs many blocks at once
50 Summary: CUDA exposes parallelism Thread parallelism each thread is an independent thread of execution Data parallelism across threads in a block across blocks in a kernel Task parallelism different blocks are independent independent kernels
51 GPUs are not PRAMs Computational hierarchy, motivated by memory hierarchy Memory layout (coalescing) Cost of execution divergence (SIMD vs. MIMD) Synchronization (not a bulk synchronous model)
52
53 How would you program (in CUDA) two separate (parallel) tasks (to run in parallel)? How would that map to the GPU?
54 Memory model Thread Per-thread Local Memory Block Per-block Shared Memory
55 Using per-block shared memory Variables shared across block shared int *begin, *end; Block Shared Scratchpad memory shared int scratch[blocksize]; scratch[threadidx.x] = begin[threadidx.x]; // compute on scratch values begin[threadidx.x] = scratch[threadidx.x]; Communicating values between threads scratch[threadidx.x] = begin[threadidx.x]; syncthreads(); int left = scratch[threadidx.x - 1];
56 Memory model Kernel 0 Kernel 1... Per-device Global Memory... Sequential Kernels
57 Memory model Host memory cudamemcpy() Device 0 memory Device 1 memory
58 CUDA: Runtime support Explicit memory allocation returns pointers to GPU memory cudamalloc(), cudafree() Explicit memory copy for host device, device device cudamemcpy(), cudamemcpy2d(),... Texture management cudabindtexture(), cudabindtexturetoarray(),... OpenGL & DirectX interoperability cudaglmapbufferobject(), cudad3d9mapvertexbuffer(),
59 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C){ int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } int main(){ // Run N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }
60 Example: Host code for vecadd // allocate and initialize host (CPU) memory float *h_a =, *h_b = ; // allocate device (GPU) memory float *d_a, *d_b, *d_c; cudamalloc( (void**) &d_a, N * sizeof(float)); cudamalloc( (void**) &d_b, N * sizeof(float)); cudamalloc( (void**) &d_c, N * sizeof(float)); // copy host memory to device cudamemcpy( d_a, h_a, N * sizeof(float), cudamemcpyhosttodevice) ); cudamemcpy( d_b, h_b, N * sizeof(float), cudamemcpyhosttodevice) ); // execute the kernel on N/256 blocks of 256 threads each vecadd<<<n/256, 256>>>(d_A, d_b, d_c);
61 Example: Parallel Reduction Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<n; ++i) sum += x[i]; Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums n threads need log n steps one possible approach: Butterfly pattern
62 Example: Parallel Reduction Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<n; ++i) sum += x[i]; Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums n threads need log n steps one possible approach: Butterfly pattern
63 Parallel Reduction for 1 Block // INPUT: Thread i holds value x_i int i = threadidx.x; shared int sum[blocksize]; // One thread per element sum[i] = x_i; syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; syncthreads(); sum[i]=t; syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]
64 Example: Serial SAXPY routine Serial program: compute y = α x + y with a loop void saxpy_serial(int n, float a, float *x, float *y) { for(int i = 0; i<n; ++i) y[i] = a*x[i] + y[i]; } Serial execution: call a function saxpy_serial(n, 2.0, x, y);
65 Example: Parallel SAXPY routine Parallel program: compute with 1 thread per element global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; } if( i<n ) y[i] = a*x[i] + y[i]; Parallel execution: launch a kernel uint size = 256; // threads per block uint blocks = (n + size-1) / size; // blocks needed saxpy_parallel<<<blocks, size>>>(n, 2.0, x, y);
66 SAXPY in PTX 1.0 ISA cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs cvt.u32.u16 $blocksize, %ntid.x; cvt.u32.u16 $tid, %tid.x; mad24.lo.u32 $i, $blockid, $blocksize, $tid; ld.param.u32 $n, [N]; // Nothing to do if n i setp.le.u32 $p1, $n, bra $L_finish; mul.lo.u32 $offset, $i, 4; // Load y[i] ld.param.u32 $yaddr, [Y]; add.u32 $yaddr, $yaddr, $offset; ld.global.f32 $y_i, [$yaddr+0]; ld.param.u32 $xaddr, [X]; // Load x[i] add.u32 $xaddr, $xaddr, $offset; ld.global.f32 $x_i, [$xaddr+0]; ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i] mad.f32 $y_i, $alpha, $x_i, $y_i; st.global.f32 [$yaddr+0], $y_i; $L_finish: exit;
67 Sparse matrix-vector multiplication Sparse matrices have relatively few non-zero entries Frequently O(n) rather than O(n2 ) Only store & operate on these non-zero entries Example: Compressed Sparse Row (CSR) Format Non-zero values Column indices Row pointers Row 0 Row 2 Row 3 Av[7] = { 3, 1, 2, 4, 1, 1, 1 }; Aj[7] = { 0, 2, 1, 2, 3, 0, 3 }; Ap[5] = { 0, 2, 2, 5, 7 };
68 Sparse matrix-vector multiplication float multiply_row(uint rowsize, // number of non-zeros in row uint *Aj, // column indices for row float *Av, // non-zero entries for row float *x) // the RHS vector { float sum = 0; for(uint column=0; column<rowsize; ++column) sum += Av[column] * x[aj[column]]; } return sum; Non-zero values Column indices Row pointers Row 0 Row 2 Row 3 Av[7] = { 3, 1, 2, 4, 1, 1, 1 }; Aj[7] = { 0, 2, 1, 2, 3, 0, 3 }; Ap[5] = { 0, 2, 2, 5, 7 };
69 Sparse matrix-vector multiplication float multiply_row(uint size, uint *Aj, float *Av, float *x); void csrmul_serial(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1]; } } y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x);
70 Sparse matrix-vector multiplication float multiply_row(uint size, uint *Aj, float *Av, float *x); global void csrmul_kernel(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { uint row = blockidx.x*blockdim.x + threadidx.x; if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1]; } } y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x);
71 Adding a simple caching scheme global void csrmul_cached( ) { uint begin = blockidx.x*blockdim.x, end = begin+blockdim.x; uint row = begin + threadidx.x; shared float cache[blocksize]; // array to cache rows if( row<num_rows) cache[threadidx.x] = x[row]; // fetch to cache syncthreads(); if( row<num_rows ) { uint row_begin = Ap[row], row_end = Ap[row+1]; float sum = 0; for(uint col=row_begin; col<row_end; ++col) { uint j = Aj[col]; // Fetch from cached rows when possible float x_j = (j>=begin && j<end)? cache[j-begin] : x[j]; } sum += Av[col] * x_j; } } y[row] = sum;
72 Basic Efficiency Rules Develop algorithms with a data parallel mindset Minimize divergence of execution within blocks Maximize locality of global memory accesses Exploit per-block shared memory as scratchpad Expose enough parallelism
73 Summing Up CUDA = C + a few simple extensions makes it easy to start writing basic parallel programs Three key abstractions: 1. hierarchy of parallel threads 2. corresponding levels of synchronization 3. corresponding memory spaces Supports massive parallelism of manycore GPUs
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Programming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University [email protected] COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
CUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
Introduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
GPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2
NVIDIA CUDA NVIDIA CUDA C Programming Guide Version 4.2 4/16/2012 Changes from Version 4.1 Updated Chapter 4, Chapter 5, and Appendix F to include information on devices of compute capability 3.0. Replaced
CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University
CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum
IMAGE PROCESSING WITH CUDA
IMAGE PROCESSING WITH CUDA by Jia Tse Bachelor of Science, University of Nevada, Las Vegas 2006 A thesis submitted in partial fulfillment of the requirements for the Master of Science Degree in Computer
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
CUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Lecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles [email protected] Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
CUDA. Multicore machines
CUDA GPU vs Multicore computers Multicore machines Emphasize multiple full-blown processor cores, implementing the complete instruction set of the CPU The cores are out-of-order implying that they could
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail
GPGPU Parallel Merge Sort Algorithm
GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led
Graphics and Computing GPUs
C A P P E N D I X Imagination is more important than knowledge. Albert Einstein On Science, 1930s Graphics and Computing GPUs John Nickolls Director of Architecture NVIDIA David Kirk Chief Scientist NVIDIA
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
1. If we need to use each thread to calculate one output element of a vector addition, what would
Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x
Efficient Sparse Matrix-Vector Multiplication on CUDA
Efficient Sparse -Vector Multiplication on CUDA Nathan Bell and Michael Garland December 11, 2008 Abstract The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
OpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. [email protected] Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
Advanced CUDA Webinar. Memory Optimizations
Advanced CUDA Webinar Memory Optimizations Outline Overview Hardware Memory Optimizations Data transfers between host and device Device memory optimizations Summary Measuring performance effective bandwidth
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Radeon HD 2900 and Geometry Generation. Michael Doggett
Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Rootbeer: Seamlessly using GPUs from Java
Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java
GPGPUs, CUDA and OpenCL
GPGPUs, CUDA and OpenCL Timo Lilja January 21, 2010 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 1 / 42 Course arrangements Course code: T-106.5800 Seminar on Software Techniques Credits: 3 Thursdays
GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer
GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer The GU Advantage The GU Advantage A Tale of Two Machines Tianhe-1A at NSC Tianjin Tianhe-1A at NSC Tianjin The World s
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization Michael Bauer Stanford University [email protected] Henry Cook UC Berkeley [email protected] Brucek Khailany NVIDIA Research [email protected]
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Image Processing & Video Algorithms with CUDA
Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Spatial BIM Queries: A Comparison between CPU and GPU based Approaches
Technische Universität München Faculty of Civil, Geo and Environmental Engineering Chair of Computational Modeling and Simulation Prof. Dr.-Ing. André Borrmann Spatial BIM Queries: A Comparison between
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
Learn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
CSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
GPU Programming Strategies and Trends in GPU Computing
GPU Programming Strategies and Trends in GPU Computing André R. Brodtkorb 1 Trond R. Hagen 1,2 Martin L. Sætra 2 1 SINTEF, Dept. Appl. Math., P.O. Box 124, Blindern, NO-0314 Oslo, Norway 2 Center of Mathematics
Parallel Prefix Sum (Scan) with CUDA. Mark Harris [email protected]
Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University
General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University A little about me http://graphics.stanford.edu/~mhouston Education: UC San Diego, Computer Science BS Stanford
GPU Performance Analysis and Optimisation
GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2
GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
High Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
Hands-on CUDA exercises
Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished
GPU Tools Sandra Wienke
Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji [email protected]
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji [email protected] Please Note IBM s statements regarding its plans, directions, and intent are subject to
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
In-memory Tables Technology overview and solutions
In-memory Tables Technology overview and solutions My mainframe is my business. My business relies on MIPS. Verna Bartlett Head of Marketing Gary Weinhold Systems Analyst Agenda Introduction to in-memory
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Optimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
