GPGPU Parallel Merge Sort Algorithm
|
|
|
- Laurence Edgar Sutton
- 9 years ago
- Views:
Transcription
1 GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led to the the development of General Purpose computation on the GPU (GPGPU). nvidia has created a unified GPGPU architecture, known as CUDA, which can be utilized through language extensions to the C programming language. This work seeks to explore CUDA through the implementation of a parallel merge sort algorithm. 1 Introduction GPGPU allows for the parallel execution of code on the GPU. The GPU is essentially used as a co-processor to handle the execution of highly parallel code. There are very large performance improvements that can be realized by outsourcing highly parallel computationally intense tasks to the GPU. Typically, Central Processing Units (CPUs) achieve computational parallelism through the use of an out-of-order execution core, which typically contains only a few Arithmetic Logic Units (ALUs). ALUs are co-processors dedicated to integer and floating point computations. For example, three out of the six ports within the execution core of the Intel Core 2 Duo are ALUs. This means that, assuming a full pipeline, three arithmetic computations can execute at once. Graphical Processing Units (GPUs), unlike CPUs, are specialized for intense floating point calculations, and thus contain many stream processors. Stream processors are ALUs which specialize in a limited form of parallelism, similar to SIMD vertex instructions, in which there exists a stream of data and a kernel function which is applied to each element in the stream. These streams are ideal for intensive, highly specialized, parallel floating point operations. The nvidia GTX 260, for example, contains 250 on-board stream 1
2 Figure 1: CUDA Architecture processors. Compared with a CPU, the GPU has a much greater computation throughput, and thus it is ideal to outsource heavy embarrassingly parallel computations onto the GPU. This outsourcing onto the graphics card is known as General Purpose computation on Graphics Processing Units (GPGPU). Historically, GPGPU architectures were highly specific to certain graphics cards and required the developer to write all GPGPU code in assembly language, assembled specifically for the specific card in use, however with the increase in the number of stream processors, the increase in the amount of on-board memory, and increased pipeline speeds, the demand for GPGPU tools has increased and both nvidia and ATI have released unified GPGPU architectures for their latest cards. nvidia s architecture, CUDA, is a set of language extensions to the C programming language. The CUDA architecture can be seen in figure 1. 2 Extended CUDA C To allow for simple integration with existing applications, nvidia implemented their GPGPU architecture through extensions to the C programming language. The nvidia CUDA compiler supports full ISO C++ for code running on the host, but for functions executed on the device, only supports CUDA C. 2
3 Function Qualifiers The CUDA extensions for C contain three types of function type qualifiers, device, global and host. The device qualifier declares a function executed completely on the GPU, and hence this function is the most restricted. These functions can not be dereferenced, can not be recursive, can not contain static variable declarations and can not take a variable number of arguments. These functions can also only be called from the device. Functions declared with the global qualifier defines a kernel function, the function which will be executed on each element in the stream of data. global functions are executed on the GPU, but are only callable from the host. They also must return void. The third type of qualifier, host, declares a function which is executed on the host. These functions are only callable from the host.once a host function has been declared, it can be called using the following syntax: F UNC <<< Dg, Db, Ns >>> (P arameters); Here, Dg is a three-element tuple specifying the dimensions and size of the grid to be launched. Db is also a three-element tuple specifying the dimension and size of each block. Finally, Ns is of the type size_t, and represents the number of bytes in shared memory that is dynamically allocated per block. The number of threads spawned per block can be represented by[1]: threadsperblock = Db.x Db.y Db.z Variable Qualifiers The CUDA extensions for C contain three types of variable qualifiers, device, constant, and shared. Variables declared with the device qualifier reside within the memory of the device. These variables are in the global memory space and have the lifetime of the application. The constant qualifier declares a variable similar to device, but the value of which is constant. The third qualifier shared, declares a variable which is accessible from all threads[1]. 3
4 Learning About CUDA and GPGPU The starting point for the project was to learn about GPGPU and CUDA. The Internet severed as our best resource. We started with the wikipedia entry for general information and then followed the links to more authoritative websites. These links can be found in the References section. Install the CUDA SDK and Driver The CUDA SDK and drivers are available for free from nvidia s CUDA ZONE website[2]. After downloading the software we first installed the CUDA drivers. The drivers replace the video drivers that were installed to allow the CUDA extensions to run. Once the CUDA drivers were installed the SDK installer was run. This installed the files required to write code that uses the CUDA extensions. Along with the SDK a package of sample CUDA programs were installed. Integrate the CUDA SDK into Visual Studio 2008 After the CUDA driver is installed, it is necessary to set the environmental variables in Visual Studio to allow proper integration. Inside of Visual Studio, you must set the bin, include, lib and source paths to that of those installed by the driver and to that of those installed by the SDK.Once that is completed, you must create a custom build rule to allow for the CUDA application to be built by the NVCC compiler. Porting an Iterative Merge Sort to the CUDA Architecture Since CUDA does not support recursion, we had to implement an iterative version of merge sort. After a few various attempts, we settled on an implementation we found at [4]. Using existing CUDA examples and the CUDA Programming Guide[1], we created a host function which was passed two input arrays, the first being a list of the numbers to be sorted and the second was the destination array. Both of these arrays were first created and initialized in host memory. Then, using cudamalloc and cudamemcpy, they were shuttled onto the 4
5 GPU s dedicated memory. Inside the host function, the merge function is called. The merge function is a device function which performs the actual merge sort algorithm on the GPU. 3 Challenges The two main challenges involved in this project were setting up Visual Studio 2008 and writing the algorithm. After several hours of work we were able to get Visual Studio to build applications using the CUDA API. It was quite frustrating at the time, but after the correct files were added to the Windows PATH variable programs began to compile without any problems. Once we were able to use the API we were faced with the challenge of learning CUDA and then coding the merge sort algorithm. One of the more difficult tasks in writing the algorithm was deciding on how to convert the recursive merge sort into an iterative one. After finding some sample C code for an iterative merge sort online we were able to replicate it using the CUDA API. 4 Future Work If more time were alloted for this project it would have been interesting to look at the speed up gained by performing merge sort on the GPU. There is much further work that can be done with GPGPU. The increasing speed and decreasing price of GPUs makes them a promising platform for future research. Undoubtedly, better tools and easier extensions to implement GPGPU are not far down the road. 5 Conclusion The project was both fun and educational. We got a change to learn how to write parallel code that runs on the GPU. 5
6 6 Appendix Iterative Merge Sort [4] // This i s an i t e r a t i v e v e r s i o n o f merge s o r t. // I t merges p a i r s o f two c o n s e c u t i v e l i s t s one a f t e r another. // After scanning the whole array to do the above job, // i t goes to the next s t a g e. Variable k c o n t r o l s the s i z e // o f l i s t s to be merged. k d o u b l e s each time the main loop // i s executed. #include <s t d i o. h> #include <math. h> int i, n, t, k ; int a [ ], b [ ] ; int merge ( l, r, u ) int l, r, u ; int i, j, k ; i=l ; j=r ; k=l ; while ( i <r && j<u ) i f ( a [ i ]<=a [ j ] ) b [ k]=a [ i ] ; i ++; else b [ k]=a [ j ] ; j ++; k++; while ( i <r ) b [ k]=a [ i ] ; i ++; k++; while ( j<u ) b [ k]=a [ j ] ; j ++; k++; for ( k=l ; k<u ; k++) a [ k]=b [ k ] ; 6
7 s o r t ( ) int k, u ; k=1; while ( k<n ) i =1; while ( i+k<=n ) u=i+k 2 ; i f (u>n ) u=n+1; merge ( i, i+k, u ) ; i=i+k 2 ; k=k 2 ; main ( ) p r i n t f ( "input size \n" ) ; s c a n f ( "%d",&n ) ; / f o r ( i =1; i<=n ; i++) s c a n f ( %d,&a [ i ] ) ; / for ( i =1; i<=n ; i ++) a [ i ]=random ()%1000; t=c l o c k ( ) ; s o r t ( ) ; for ( i =1; i <=10; i ++) p r i n t f ( "%d ", a [ i ] ) ; p r i n t f ( "\n" ) ; p r i n t f ( "time= %d millisec\n", ( c l o c k () t ) / ) ; GPGPU Merge Sort Kernel / merge kernel. cu merge s o r t Implementation run in p a r a l l e l t h r e a d s on the GPU through t he nvidia CUDA Framework 7
8 Jim Kukunas and James Devine / #ifndef #define MERGE KERNEL CU MERGE KERNEL CU #define NUM 64 d e v i c e inline void Merge ( int values, int r e s u l t s, int l, int r, int u ) int i, j, k ; i=l ; j=r ; k=l ; while ( i <r && j<u ) i f ( v a l u e s [ i ]<=values [ j ] ) r e s u l t s [ k]= v a l u e s [ i ] ; i ++; else r e s u l t s [ k]= v a l u e s [ j ] ; j ++; k++; while ( i <r ) r e s u l t s [ k]= v a l u e s [ i ] ; i ++; k++; while ( j<u ) r e s u l t s [ k]= v a l u e s [ j ] ; j ++; k++; for ( k=l ; k<u ; k++) v a l u e s [ k]= r e s u l t s [ k ] ; 8
9 g l o b a l static void MergeSort ( int values, int r e s u l t s ) extern s h a r e d int shared [ ] ; const unsigned int t i d = threadidx. x ; int k, u, i ; // Copy i n p u t to shared mem. shared [ t i d ] = v a l u e s [ t i d ] ; s y n c t h r e a d s ( ) ; k = 1 ; while ( k < NUM) i = 1 ; while ( i+k <=NUM) u = i+k 2 ; i f ( u > NUM) u = NUM+1; Merge ( shared, r e s u l t s, i, i+k, u ) ; i = i+k 2 ; k = k 2 ; s y n c t h r e a d s ( ) ; v a l u e s [ t i d ] = shared [ t i d ] ; 9
10 #endif GPGPU Merge Sort Driver Application / merge. cu Tests the merge s o r t Implementation run in p a r a l l e l t h r e a d s on the GPU through t he nvidia CUDA Framework Jim Kukunas and James Devine / #include <s t d i o. h> #include <s t d l i b. h> #include < c u t i l i n l i n e. h> #include " merge_kernel. cu" int main ( int argc, char argv ) i f ( cutcheckcmdlineflag ( argc, ( const char ) argv, "device" ) ) c u t i l D e v i c e I n i t ( argc, argv ) ; else cudasetdevice ( cutgetmaxgflopsdeviceid ( ) ) ; int v a l u e s [NUM] ; / i n i t i a l i z e a random d a t a s e t / for ( int i = 0 ; i < NUM; i++) v a l u e s [ i ] = rand ( ) ; 10
11 int dvalues, r e s u l t s ; c u t i l S a f e C a l l ( cudamalloc ( ( void)& dvalues, sizeof ( int ) NUM) ) ; c u t i l S a f e C a l l ( cudamemcpy ( dvalues, values, sizeof ( int ) NUM, cudamemcpyhosttodevice ) c u t i l S a f e C a l l ( cudamalloc ( ( void)& r e s u l t s, sizeof ( int ) NUM) ) ; c u t i l S a f e C a l l ( cudamemcpy ( r e s u l t s, v a l u e s, sizeof ( int ) NUM, cudamemcpyhosttodevice ) ) MergeSort <<<1, NUM, sizeof ( int ) NUM2>>>(dvalues, r e s u l t s ) ; // check f o r any e r r o r s cutilcheckmsg ( " Kernel execution failed" ) ; c u t i l S a f e C a l l ( cudafree ( dvalues ) ) ; c u t i l S a f e C a l l ( cudamemcpy ( values, r e s u l t s, sizeof ( int )NUM, cudamemcpydevicetohost ) ) ; c u t i l S a f e C a l l ( cudafree ( r e s u l t s ) ) ; bool passed = true ; for ( int i = 1 ; i < NUM; i++) i f ( v a l u e s [ i 1] > v a l u e s [ i ] ) passed = f a l s e ; p r i n t f ( "Test %s\n", passed? "PASSED" : "FAILED" ) ; cudathreadexit ( ) ; c u t i l E x i t ( argc, argv ) ; 11
12 References [1] NVIDIA Corporation. Nvidia cuda programming guide. NVIDIA, NVIDIA_CUDA_Programming_Guide_2.1.pdf [2] NVIDIA Corporation. Cuda zone. NVIDIA, cuda_home.html [3] Mark Harris. General-purpose computation on graphics hardware. GPGPU.org, [4] Tadao Takaok. Iterative merge sort. takaoka/cosc229/imerge.c 12
CUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Introduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL
Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL Michał Wójcik, Tomasz Boiński Katedra Architektury Systemów Komputerowych Wydział Elektroniki, Telekomunikacji i Informatyki Politechnika
CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University
CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
CUDA. Multicore machines
CUDA GPU vs Multicore computers Multicore machines Emphasize multiple full-blown processor cores, implementing the complete instruction set of the CPU The cores are out-of-order implying that they could
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
OpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. [email protected] Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
IMAGE PROCESSING WITH CUDA
IMAGE PROCESSING WITH CUDA by Jia Tse Bachelor of Science, University of Nevada, Las Vegas 2006 A thesis submitted in partial fulfillment of the requirements for the Master of Science Degree in Computer
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Learn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
Spatial BIM Queries: A Comparison between CPU and GPU based Approaches
Technische Universität München Faculty of Civil, Geo and Environmental Engineering Chair of Computational Modeling and Simulation Prof. Dr.-Ing. André Borrmann Spatial BIM Queries: A Comparison between
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
Lecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles [email protected] Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
5 Arrays and Pointers
5 Arrays and Pointers 5.1 One-dimensional arrays Arrays offer a convenient way to store and access blocks of data. Think of arrays as a sequential list that offers indexed access. For example, a list of
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
GPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Informatica e Sistemi in Tempo Reale
Informatica e Sistemi in Tempo Reale Introduction to C programming Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa October 25, 2010 G. Lipari (Scuola Superiore Sant Anna)
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
Visual Basic Programming. An Introduction
Visual Basic Programming An Introduction Why Visual Basic? Programming for the Windows User Interface is extremely complicated. Other Graphical User Interfaces (GUI) are no better. Visual Basic provides
Programming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University [email protected] COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C
Embedded Systems A Review of ANSI C and Considerations for Embedded C Programming Dr. Jeff Jackson Lecture 2-1 Review of ANSI C Topics Basic features of C C fundamentals Basic data types Expressions Selection
Optimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Parallel Prefix Sum (Scan) with CUDA. Mark Harris [email protected]
Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
Writing Applications for the GPU Using the RapidMind Development Platform
Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
Phys4051: C Lecture 2 & 3. Comment Statements. C Data Types. Functions (Review) Comment Statements Variables & Operators Branching Instructions
Phys4051: C Lecture 2 & 3 Functions (Review) Comment Statements Variables & Operators Branching Instructions Comment Statements! Method 1: /* */! Method 2: // /* Single Line */ //Single Line /* This comment
Lecture 22: C Programming 4 Embedded Systems
Lecture 22: C Programming 4 Embedded Systems Today s Goals Basic C programming process Variables and constants in C Pointers to access addresses Using a High Level Language High-level languages More human
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
Le langage OCaml et la programmation des GPU
Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline
Hands-on CUDA exercises
Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, [email protected]
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Stacks. Linear data structures
Stacks Linear data structures Collection of components that can be arranged as a straight line Data structure grows or shrinks as we add or remove objects ADTs provide an abstract layer for various operations
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library
Comp151. Definitions & Declarations
Comp151 Definitions & Declarations Example: Definition /* reverse_printcpp */ #include #include using namespace std; int global_var = 23; // global variable definition void reverse_print(const
Sources: On the Web: Slides will be available on:
C programming Introduction The basics of algorithms Structure of a C code, compilation step Constant, variable type, variable scope Expression and operators: assignment, arithmetic operators, comparison,
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
GPU Tools Sandra Wienke
Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA
The C Programming Language course syllabus associate level
TECHNOLOGIES The C Programming Language course syllabus associate level Course description The course fully covers the basics of programming in the C programming language and demonstrates fundamental programming
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
Introduction to Java
Introduction to Java The HelloWorld program Primitive data types Assignment and arithmetic operations User input Conditional statements Looping Arrays CSA0011 Matthew Xuereb 2008 1 Java Overview A high
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
qwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasd fghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq
qwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasd fghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq Introduction to Programming using Java wertyuiopasdfghjklzxcvbnmqwertyui
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
gpus1 Ubuntu 10.04 Available via ssh
gpus1 Ubuntu 10.04 Available via ssh root@gpus1:[~]#lspci -v grep VGA 01:04.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (rev 0a) 03:00.0 VGA compatible controller: nvidia Corporation
Keil C51 Cross Compiler
Keil C51 Cross Compiler ANSI C Compiler Generates fast compact code for the 8051 and it s derivatives Advantages of C over Assembler Do not need to know the microcontroller instruction set Register allocation
C++ Programming Language
C++ Programming Language Lecturer: Yuri Nefedov 7th and 8th semesters Lectures: 34 hours (7th semester); 32 hours (8th semester). Seminars: 34 hours (7th semester); 32 hours (8th semester). Course abstract
AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
Radeon HD 2900 and Geometry Generation. Michael Doggett
Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command
Let s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university
Real-Time Realistic Rendering Michael Doggett Docent Department of Computer Science Lund university 30-5-2011 Visually realistic goal force[d] us to completely rethink the entire rendering process. Cook
Image Processing & Video Algorithms with CUDA
Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly
Figure 1: Graphical example of a mergesort 1.
CSE 30321 Computer Architecture I Fall 2011 Lab 02: Procedure Calls in MIPS Assembly Programming and Performance Total Points: 100 points due to its complexity, this lab will weight more heavily in your
How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)
TN203 Porting a Program to Dynamic C Introduction Dynamic C has a number of improvements and differences compared to many other C compiler systems. This application note gives instructions and suggestions
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji [email protected]
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji [email protected] Please Note IBM s statements regarding its plans, directions, and intent are subject to
CUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University Brook General purpose Streaming language DARPA Polymorphous Computing Architectures Stanford - Smart Memories UT Austin - TRIPS
Amazon EC2 Product Details Page 1 of 5
Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
