Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1
|
|
- Megan Thornton
- 7 years ago
- Views:
Transcription
1 Intro to GPU computing Spring 2015 Mark Silberstein, , Technion 1
2 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, , Technion 2
3 GPUs are parallel processors Throughput-oriented hardware Example problem I have 100 apples to eat 1) high performance : finish one apple faster 2) high throughput : finish all apples faster Performance = parallel hardware + scalable parallel program! Spring 2015 Mark Silberstein, , Technion 3
4 Simplified GPU model (see an introductory paper to on my home page: «GPUs: High-performance Accelerators for Parallel Applications» ) Spring 2015 Mark Silberstein, , Technion 4
5 GPU 101 CPU GPU Memory Memory Spring 2015 Mark Silberstein, , Technion 5
6 GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, , Technion 6
7 GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, , Technion 7
8 Co-processor model CPU Computation GPU t a t i o n GPU kernel Memory Memory Spring 2015 Mark Silberstein, , Technion 8
9 GPU is a co-processor CPU Computation GPU Memory Memory Spring 2015 Mark Silberstein, , Technion 9
10 Simple GPU program Idea: same set of operations is applied to different data chunks in parallel Implementation Every thread runs the same code on different data chunks. GPU concurrently runs many parallel threads Spring 2015 Mark Silberstein, , Technion 10
11 Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Spring 2015 Mark Silberstein, , Technion 11
12 Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size A[i]=A[i]+B[i] Spring 2015 Mark Silberstein, , Technion 12
13 Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size A[i]=A[i]+B[i] Execution on GPU In parallel for every threadid < total GPU kernel threads A[threadId]=A[threadId]+B[threadId] Spring 2015 Mark Silberstein, , Technion 13
14 Sequential algorithm Vector sum A=A+B For every element i<a.size A[i]=A[i]+B[i] Parallel algorithm In parallel for every i<a.size Instantiated and executed on GPU A[i]=A[i]+B[i] in thousands of hardware threads Execution on GPU In parallel for every threadid < total GPU kernel threads A[threadId]=A[threadId]+B[threadId] Spring 2015 Mark Silberstein, , Technion 14
15 GPU programming basics Spring 2015 Mark Silberstein, , Technion 15
16 GPU execution model Thread sequence of sequentially-executed instructions Each thread has private state: registers, stack Single Instruction Multiple Threads (SIMT) model Same program, called kernel, is instantiated and invoked for each thread Each thread gets unique ID Many threads (tens of thousands) Threads wait in a hardware queue until resources become available All threads share access to GPU memory Spring 2015 Mark Silberstein, , Technion 16
17 CPU control of GPU GPU* and CPU have different memory spaces GPU is fully managed by a CPU cannot* access CPU RAM cannot* reallocate its own memory cannot* start computations CPU must allocate memory and transfer input before kernel invocation CPU must transfer output and clean up memory after kernel termination Spring 2015 Mark Silberstein, , Technion 17
18 Vector sum for A.size=1024 GPU A[threadId]=A[threadId]+B[threadId] CPU 1.Allocate arrays in GPU memory 2.Copy data CPU -> GPU 3.Invoke kernel with 1024 threads 4.Wait until complete and copy data GPU->CPU Spring 2015 Mark Silberstein, , Technion 18
19 Vector sum kernel code in CUDA global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU Get unique thread ID CUDA modifier to signify GPU kernels fine-grain parallelism: operation per thread Spring 2015 Mark Silberstein, , Technion 19
20 CPU code for GPU management global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); Two sets of pointers Two sets of pointers GPU memory allocated by CPU GPU CPU cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); Input copied from CPU to GPU Pass pointers to cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { device memory at fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); return 1; kernel invocation } cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); Spring 2015printf ("Success"); return 0; Mark Silberstein, , Technion 20 }
21 CPU code for GPU management global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; CPU cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); Wait until the kernel is over cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); Copy data back to CPU return 1; } cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); Spring 2015printf ("Success"); Mark Silberstein, , Technion 21 return 0; }
22 Typical code structure global void sum(int* A, int* B){ int my=gethwthreadid(); A[my]=A[my]+B[my]; } GPU programming GPU int main(int argc, char** argv){ int* d_a, * d_b; int A[SIZE], B[SIZE]; CPU cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); dim3 threads_in_block(512), blocks(2); sum<<<blocks,threads_in_block>>>(d_a,d_b); GPU integration cudadevicesynchronize(); cudaerror_t error=cudagetlasterror(); if (error!=cudasuccess) { fprintf(stderr,"kernel execution failed:%s\n",cudageterrorstring(error)); return 1; } cudamemcpy(d_a,a,size_bytes,cudamemcpydevicetohost); printf ("Success"); Spring 2015return 0; Mark Silberstein, , Technion 22 }
23 BUT! Vector sum is simple purely data parallel What if we need coordination between tasks Spring 2015 Mark Silberstein, , Technion 23
24 Inter-thread communication Programming model usability is determined by the means of inter-thread communication No communications easy for hardware, hard for software Fine-grained communications hard to make efficient in hardware, easy for software Spring 2015 Mark Silberstein, , Technion 24
25 Communications: inner product Spring 2015 Mark Silberstein, , Technion 25
26 Communications: inner product x x x x x x x x Eventually all-to-all thread communication required How can we make it efficient?? Spring 2015 Mark Silberstein, , Technion 26
27 Compromise: Hierarchy Threads Threadblocks All threads are split up into fixed-size groups: threadblocks Threads in a threadblock can synchronize and communicate efficiently Share fast memory Can use barriers Spring 2015 Mark Silberstein, , Technion 27
28 Execution model: stream of threadblocks Threadblock scheduling is non-deterministic Cannot rely on a schedule for correctness No inter-threadblock global barrier in the programming model Spring 2015 Mark Silberstein, , Technion 28
29 Hierarchical Dot-Product x x x x x x x x Threadblock 1 Threadblock 2 Spring 2015 Mark Silberstein, , Technion 29
30 Slow coordination should be rare Slow coordination Efficient communication x x x x x x x x Threadblock 1 Threadblock 2 Coarse grain task Fine-grain task Spring 2015 Mark Silberstein, , Technion 30
31 Dot product void vector_dotproduct_kernel(float* ga, float* gb, float* gout) { _shared_ float l_res[tbsize]; //local core memory } int tid=threadidx.x; int bid=blockidx.x; int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; synchthreads(); // wait for all products in a threadblock // parallel reduction for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) gout[bid]=l_res[0]; Spring 2015 Mark Silberstein, , Technion 31
32 Dot product void vector_dotproduct_kernel(float* ga, Fast float* local gb, memory float* gout) { _shared_ float l_res[tbsize]; //local core memory int tid=threadidx.x; int bid=blockidx.x; Two levels of indexing int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; Fast sync } synchthreads(); // wait for all products in a threadblock // parallel reduction for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) gout[bid]=l_res[0]; Spring 2015 Mark Silberstein, , Technion 32
33 How to finalize the parallel reduce Option 1: use CPU Option 2: use atomics Option 3: use multiple kernel invocations kernel completion = global barrier Spring 2015 Mark Silberstein, , Technion 33
34 Dot product with atomics void vector_dotproduct_kernel(float* ga, float* gb, float* gout) { _shared_ float l_res[tbsize]; //local core memory } int tid=threadidx.x; int bid=blockidx.x; int offset=bid*tbsize+tid; l_res[tid]=ga[offset]*gb[offset]; synchthreads(); // wait for all All threadblocks products in are going a threadblock to atomically update the // parallel reduction global memory gout[0] for(int i=tbsize/2; i>0;i/=2) { if (tid<i) l_res[tid]=l_res[tid]+l_res[i+tid]; synchthreads(); // wait for all partial sums } if (tid==0) atomicadd(gout,l_res[0]); Spring 2015 Mark Silberstein, , Technion 34
35 GPU hardware in more details Spring 2015 Mark Silberstein, , Technion 35
36 BEGIN: Terminology break Spring 2015 Mark Silberstein, , Technion 36
37 Flynn's HARDWARE Taxonomy Spring 2015 Mark Silberstein, , Technion 37
38 SIMD Lock-step execution Example: vector operations (SSE) Spring 2015 Mark Silberstein, , Technion 38
39 MIMD Example: multicores Spring 2015 Mark Silberstein, , Technion 39
40 END: Terminology break Spring 2015 Mark Silberstein, , Technion 40
41 GPU hardware characteristics Massive parallelism Runs thousands of concurrent execution flows (threads) identified and explicitly programmed by a developer Spring 2015 Mark Silberstein, , Technion 41
42 GPU hardware parallelism 1. MIMD = multi-core GPU GPU memory Core MP Core MP Core MP Core MP Spring 2015 Mark Silberstein, , Technion 42
43 GPU hardware parallelism 2. SIMD = vector Core SIMD vector SIMD vector SIMD vector Spring 2015 Mark Silberstein, , Technion 43
44 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state T1 T2 T3 Spring 2015 Mark Silberstein, , Technion 44
45 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 T1 T2 T3 Spring 2015 Mark Silberstein, , Technion 45
46 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP R 0x01 R 0x04 T1 T2 Execution state T3 Spring 2015 Mark Silberstein, , Technion 46
47 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 R 0x04 R 0x08 T1 T2 T3 Spring 2015 Mark Silberstein, , Technion 47
48 Fine-grain multithreading 3. Parallelism for latency hiding GPU GPU memory Core MP Execution state R 0x01 R 0x04 R 0x08 T1 T2 T3 Spring 2015 Mark Silberstein, , Technion 48
49 Putting it all together: 3 levels of hardware parallelism GPU GPU memory Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, , Technion 49
50 Thread n Thread 1 What is GPU thread? GPU GPU memory Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, , Technion 50
51 Thread n Thread 1 What is thread block? GPU GPU memory Core MP Core MP Core MP Core MP scratchpad memory in core Core Core MP Thread State Ctx 1 1 Thread block SIMD vector Thread State Ctx k k Spring 2015 Mark Silberstein, , Technion 51
52 What is GPU kernel Hardware queue GPU GPU memory Thread block Thread block Core MP Core MP Core MP Core MP Thread Thread Thread Thread block Thread block Thread block Thread block Thread block block block block Thread block Spring 2015 Mark Silberstein, , Technion 52
53 Thread 1 Thread n 10,000-s of concurrent threads! GPU GPU memory NVIDIA K20x GPU: 64x14x32= concurrent threads Core MP Core MP Core MP Core MP Core Thread State Ctx 1 1 SIMD vector 64 Core MP Thread State Ctx k k Spring 2015 Mark Silberstein, , Technion 53
54 Using shared memory for efficient matrix product computations (using CUDA samples) Spring 2015 Mark Silberstein, , Technion 54
55 Walk-through matrix product code Spring 2015 Mark Silberstein, , Technion 55
56 Walk-through matrix product code Spring 2015 Mark Silberstein, , Technion 56
57 Walk-through matrix product code Spring 2015 Mark Silberstein, , Technion 57
58 More info about CUDA NVIDIA programming manual UDACITY Interactive online course: Intro to parallel computing highly recommended to skim through the first two lectures before the next class Spring 2015 Mark Silberstein, , Technion 58
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationCUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University
CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationOpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
More informationGPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
More informationOptimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationCUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationIntroduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationCUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationParallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.
Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationCUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationGPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationLow Latency Complex Event Processing on Parallel Hardware
Low Latency Complex Event Processing on Parallel Hardware Gianpaolo Cugola a, Alessandro Margara a a Dip. di Elettronica e Informazione Politecnico di Milano, Italy surname@elet.polimi.it Abstract Most
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationGPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More informationOpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
More informationOverview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification
Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by
More informationA Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
More informationAn Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
More informationHybrid Programming with MPI and OpenMP
Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming
More informationDebugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
More informationLecture 1. Course Introduction
Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our
More informationAn Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationChapter 2: OS Overview
Chapter 2: OS Overview CmSc 335 Operating Systems 1. Operating system objectives and functions Operating systems control and support the usage of computer systems. a. usage users of a computer system:
More informationA Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
More informationParallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
More informationWhy Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationGPU Tools Sandra Wienke
Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA
More informationHigh Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples
More informationParallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
More informationHardware design for ray tracing
Hardware design for ray tracing Jae-sung Yoon Introduction Realtime ray tracing performance has recently been achieved even on single CPU. [Wald et al. 2001, 2002, 2004] However, higher resolutions, complex
More informationIMAGE PROCESSING WITH CUDA
IMAGE PROCESSING WITH CUDA by Jia Tse Bachelor of Science, University of Nevada, Las Vegas 2006 A thesis submitted in partial fulfillment of the requirements for the Master of Science Degree in Computer
More informationLecture 3. Optimising OpenCL performance
Lecture 3 Optimising OpenCL performance Based on material by Benedict Gaster and Lee Howes (AMD), Tim Mattson (Intel) and several others. - Page 1 Agenda Heterogeneous computing and the origins of OpenCL
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationProgramming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University johnmc@rice.edu COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationPorting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka
Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2,
More informationGPI Global Address Space Programming Interface
GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming
More informationCOMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP
COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State
More informationAdvanced Computer Architecture
Advanced Computer Architecture Institute for Multimedia and Software Engineering Conduction of Exercises: Institute for Multimedia eda and Software Engineering g BB 315c, Tel: 379-1174 E-mail: marius.rosu@uni-due.de
More informationCourse Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu
More informationL20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationGPGPU Parallel Merge Sort Algorithm
GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led
More informationCSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationOperating System: Scheduling
Process Management Operating System: Scheduling OS maintains a data structure for each process called Process Control Block (PCB) Information associated with each PCB: Process state: e.g. ready, or waiting
More informationParallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com
Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
More informationOperating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest
Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest 1. Introduction Few years ago, parallel computers could
More informationAPPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, typeng2001@yahoo.com
More informationTowards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Please Note IBM s statements regarding its plans, directions, and intent are subject to
More informationReal-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
More informationUnleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More informationMulti-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationPerformance Analysis for GPU Accelerated Applications
Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871
More informationTrends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
More informationFirst-class User Level Threads
First-class User Level Threads based on paper: First-Class User Level Threads by Marsh, Scott, LeBlanc, and Markatos research paper, not merely an implementation report User-level Threads Threads managed
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Intro, GPU Computing February 9, 2012 Dan Negrut, 2012 ME964 UW-Madison "The Internet is a great way to get on the net. US Senator Bob Dole
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationSGRT: A Scalable Mobile GPU Architecture based on Ray Tracing
SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing Won-Jong Lee, Shi-Hwa Lee, Jae-Ho Nah *, Jin-Woo Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT, SAMSUNG Electronics, Yonsei Univ. *,
More informationFrog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model
X. SHI ET AL. 1 Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model Xuanhua Shi 1, Xuan Luo 1, Junling Liang 1, Peng Zhao 1, Sheng Di 2, Bingsheng He 3, and Hai Jin 1 1 Services Computing
More informationCOSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationComputer Systems II. Unix system calls. fork( ) wait( ) exit( ) How To Create New Processes? Creating and Executing Processes
Computer Systems II Creating and Executing Processes 1 Unix system calls fork( ) wait( ) exit( ) 2 How To Create New Processes? Underlying mechanism - A process runs fork to create a child process - Parent
More informationLearning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards eddie.edwards@imperial.ac.uk At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
More information