GPGPU Parallel Merge Sort Algorithm



Similar documents
CUDA Basics. Murphy Stein New York University

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to CUDA C

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

GPU Parallel Computing Architecture and CUDA Programming Model

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPU Programming Languages

CUDA. Multicore machines

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA programming on NVIDIA GPUs

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

ultra fast SOM using CUDA

Introduction to GPU hardware and to CUDA

OpenACC Basics Directive-based GPGPU Programming

IMAGE PROCESSING WITH CUDA

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

GPGPU Computing. Yong Cao

~ Greetings from WSU CAPPLab ~

Lecture 1: an introduction to CUDA

ME964 High Performance Computing for Engineering Applications

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

5 Arrays and Pointers

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Introduction to GPGPU. Tiziano Diamanti

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Computing - CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Informatica e Sistemi in Tempo Reale

L20: GPU Architecture and Models

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Next Generation GPU Architecture Code-named Fermi

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Visual Basic Programming. An Introduction

Programming GPUs with CUDA

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Optimizing Application Performance with CUDA Profiling Tools

Rethinking SIMD Vectorization for In-Memory Databases

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Writing Applications for the GPU Using the RapidMind Development Platform

Introduction to GPU Architecture

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

OpenACC 2.0 and the PGI Accelerator Compilers

OpenCL Programming for the CUDA Architecture. Version 2.3

Phys4051: C Lecture 2 & 3. Comment Statements. C Data Types. Functions (Review) Comment Statements Variables & Operators Branching Instructions

Lecture 22: C Programming 4 Embedded Systems

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Le langage OCaml et la programmation des GPU

Hands-on CUDA exercises

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Computer Graphics Hardware An Overview

Stacks. Linear data structures

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Comp151. Definitions & Declarations

Sources: On the Web: Slides will be available on:

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Introduction to Cloud Computing

GPU Tools Sandra Wienke

The C Programming Language course syllabus associate level

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

GPU Hardware Performance. Fall 2015

Introduction to Java

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

CONTENTS. I Introduction 2. II Preparation 2 II-A Hardware... 2 II-B Checking Versions for Hardware and Drivers... 3

Clustering Billions of Data Points Using GPUs

qwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasd fghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq

Stream Processing on GPUs Using Distributed Multimedia Middleware

Evaluation of CUDA Fortran for the CFD code Strukti

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

gpus1 Ubuntu Available via ssh

Keil C51 Cross Compiler

C++ Programming Language

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Radeon HD 2900 and Geometry Generation. Michael Doggett

Let s put together a Manual Processor

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Image Processing & Video Algorithms with CUDA

Figure 1: Graphical example of a mergesort 1.

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

CUDA Programming. Week 4. Shared memory and register

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Amazon EC2 Product Details Page 1 of 5

Binary search tree with SIMD bandwidth optimization using SSE

Transcription:

GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led to the the development of General Purpose computation on the GPU (GPGPU). nvidia has created a unified GPGPU architecture, known as CUDA, which can be utilized through language extensions to the C programming language. This work seeks to explore CUDA through the implementation of a parallel merge sort algorithm. 1 Introduction GPGPU allows for the parallel execution of code on the GPU. The GPU is essentially used as a co-processor to handle the execution of highly parallel code. There are very large performance improvements that can be realized by outsourcing highly parallel computationally intense tasks to the GPU. Typically, Central Processing Units (CPUs) achieve computational parallelism through the use of an out-of-order execution core, which typically contains only a few Arithmetic Logic Units (ALUs). ALUs are co-processors dedicated to integer and floating point computations. For example, three out of the six ports within the execution core of the Intel Core 2 Duo are ALUs. This means that, assuming a full pipeline, three arithmetic computations can execute at once. Graphical Processing Units (GPUs), unlike CPUs, are specialized for intense floating point calculations, and thus contain many stream processors. Stream processors are ALUs which specialize in a limited form of parallelism, similar to SIMD vertex instructions, in which there exists a stream of data and a kernel function which is applied to each element in the stream. These streams are ideal for intensive, highly specialized, parallel floating point operations. The nvidia GTX 260, for example, contains 250 on-board stream 1

Figure 1: CUDA Architecture processors. Compared with a CPU, the GPU has a much greater computation throughput, and thus it is ideal to outsource heavy embarrassingly parallel computations onto the GPU. This outsourcing onto the graphics card is known as General Purpose computation on Graphics Processing Units (GPGPU). Historically, GPGPU architectures were highly specific to certain graphics cards and required the developer to write all GPGPU code in assembly language, assembled specifically for the specific card in use, however with the increase in the number of stream processors, the increase in the amount of on-board memory, and increased pipeline speeds, the demand for GPGPU tools has increased and both nvidia and ATI have released unified GPGPU architectures for their latest cards. nvidia s architecture, CUDA, is a set of language extensions to the C programming language. The CUDA architecture can be seen in figure 1. 2 Extended CUDA C To allow for simple integration with existing applications, nvidia implemented their GPGPU architecture through extensions to the C programming language. The nvidia CUDA compiler supports full ISO C++ for code running on the host, but for functions executed on the device, only supports CUDA C. 2

Function Qualifiers The CUDA extensions for C contain three types of function type qualifiers, device, global and host. The device qualifier declares a function executed completely on the GPU, and hence this function is the most restricted. These functions can not be dereferenced, can not be recursive, can not contain static variable declarations and can not take a variable number of arguments. These functions can also only be called from the device. Functions declared with the global qualifier defines a kernel function, the function which will be executed on each element in the stream of data. global functions are executed on the GPU, but are only callable from the host. They also must return void. The third type of qualifier, host, declares a function which is executed on the host. These functions are only callable from the host.once a host function has been declared, it can be called using the following syntax: F UNC <<< Dg, Db, Ns >>> (P arameters); Here, Dg is a three-element tuple specifying the dimensions and size of the grid to be launched. Db is also a three-element tuple specifying the dimension and size of each block. Finally, Ns is of the type size_t, and represents the number of bytes in shared memory that is dynamically allocated per block. The number of threads spawned per block can be represented by[1]: threadsperblock = Db.x Db.y Db.z Variable Qualifiers The CUDA extensions for C contain three types of variable qualifiers, device, constant, and shared. Variables declared with the device qualifier reside within the memory of the device. These variables are in the global memory space and have the lifetime of the application. The constant qualifier declares a variable similar to device, but the value of which is constant. The third qualifier shared, declares a variable which is accessible from all threads[1]. 3

Learning About CUDA and GPGPU The starting point for the project was to learn about GPGPU and CUDA. The Internet severed as our best resource. We started with the wikipedia entry for general information and then followed the links to more authoritative websites. These links can be found in the References section. Install the CUDA SDK and Driver The CUDA SDK and drivers are available for free from nvidia s CUDA ZONE website[2]. After downloading the software we first installed the CUDA drivers. The drivers replace the video drivers that were installed to allow the CUDA extensions to run. Once the CUDA drivers were installed the SDK installer was run. This installed the files required to write code that uses the CUDA extensions. Along with the SDK a package of sample CUDA programs were installed. Integrate the CUDA SDK into Visual Studio 2008 After the CUDA driver is installed, it is necessary to set the environmental variables in Visual Studio to allow proper integration. Inside of Visual Studio, you must set the bin, include, lib and source paths to that of those installed by the driver and to that of those installed by the SDK.Once that is completed, you must create a custom build rule to allow for the CUDA application to be built by the NVCC compiler. Porting an Iterative Merge Sort to the CUDA Architecture Since CUDA does not support recursion, we had to implement an iterative version of merge sort. After a few various attempts, we settled on an implementation we found at [4]. Using existing CUDA examples and the CUDA Programming Guide[1], we created a host function which was passed two input arrays, the first being a list of the numbers to be sorted and the second was the destination array. Both of these arrays were first created and initialized in host memory. Then, using cudamalloc and cudamemcpy, they were shuttled onto the 4

GPU s dedicated memory. Inside the host function, the merge function is called. The merge function is a device function which performs the actual merge sort algorithm on the GPU. 3 Challenges The two main challenges involved in this project were setting up Visual Studio 2008 and writing the algorithm. After several hours of work we were able to get Visual Studio to build applications using the CUDA API. It was quite frustrating at the time, but after the correct files were added to the Windows PATH variable programs began to compile without any problems. Once we were able to use the API we were faced with the challenge of learning CUDA and then coding the merge sort algorithm. One of the more difficult tasks in writing the algorithm was deciding on how to convert the recursive merge sort into an iterative one. After finding some sample C code for an iterative merge sort online we were able to replicate it using the CUDA API. 4 Future Work If more time were alloted for this project it would have been interesting to look at the speed up gained by performing merge sort on the GPU. There is much further work that can be done with GPGPU. The increasing speed and decreasing price of GPUs makes them a promising platform for future research. Undoubtedly, better tools and easier extensions to implement GPGPU are not far down the road. 5 Conclusion The project was both fun and educational. We got a change to learn how to write parallel code that runs on the GPU. 5

6 Appendix Iterative Merge Sort [4] // This i s an i t e r a t i v e v e r s i o n o f merge s o r t. // I t merges p a i r s o f two c o n s e c u t i v e l i s t s one a f t e r another. // After scanning the whole array to do the above job, // i t goes to the next s t a g e. Variable k c o n t r o l s the s i z e // o f l i s t s to be merged. k d o u b l e s each time the main loop // i s executed. #include <s t d i o. h> #include <math. h> int i, n, t, k ; int a [ 1 0 0 0 0 0 ], b [ 1 0 0 0 0 0 ] ; int merge ( l, r, u ) int l, r, u ; int i, j, k ; i=l ; j=r ; k=l ; while ( i <r && j<u ) i f ( a [ i ]<=a [ j ] ) b [ k]=a [ i ] ; i ++; else b [ k]=a [ j ] ; j ++; k++; while ( i <r ) b [ k]=a [ i ] ; i ++; k++; while ( j<u ) b [ k]=a [ j ] ; j ++; k++; for ( k=l ; k<u ; k++) a [ k]=b [ k ] ; 6

s o r t ( ) int k, u ; k=1; while ( k<n ) i =1; while ( i+k<=n ) u=i+k 2 ; i f (u>n ) u=n+1; merge ( i, i+k, u ) ; i=i+k 2 ; k=k 2 ; main ( ) p r i n t f ( "input size \n" ) ; s c a n f ( "%d",&n ) ; / f o r ( i =1; i<=n ; i++) s c a n f ( %d,&a [ i ] ) ; / for ( i =1; i<=n ; i ++) a [ i ]=random ()%1000; t=c l o c k ( ) ; s o r t ( ) ; for ( i =1; i <=10; i ++) p r i n t f ( "%d ", a [ i ] ) ; p r i n t f ( "\n" ) ; p r i n t f ( "time= %d millisec\n", ( c l o c k () t ) / 1 0 0 0 ) ; GPGPU Merge Sort Kernel / merge kernel. cu merge s o r t Implementation run in p a r a l l e l t h r e a d s on the GPU through t he nvidia CUDA Framework 7

Jim Kukunas and James Devine / #ifndef #define MERGE KERNEL CU MERGE KERNEL CU #define NUM 64 d e v i c e inline void Merge ( int values, int r e s u l t s, int l, int r, int u ) int i, j, k ; i=l ; j=r ; k=l ; while ( i <r && j<u ) i f ( v a l u e s [ i ]<=values [ j ] ) r e s u l t s [ k]= v a l u e s [ i ] ; i ++; else r e s u l t s [ k]= v a l u e s [ j ] ; j ++; k++; while ( i <r ) r e s u l t s [ k]= v a l u e s [ i ] ; i ++; k++; while ( j<u ) r e s u l t s [ k]= v a l u e s [ j ] ; j ++; k++; for ( k=l ; k<u ; k++) v a l u e s [ k]= r e s u l t s [ k ] ; 8

g l o b a l static void MergeSort ( int values, int r e s u l t s ) extern s h a r e d int shared [ ] ; const unsigned int t i d = threadidx. x ; int k, u, i ; // Copy i n p u t to shared mem. shared [ t i d ] = v a l u e s [ t i d ] ; s y n c t h r e a d s ( ) ; k = 1 ; while ( k < NUM) i = 1 ; while ( i+k <=NUM) u = i+k 2 ; i f ( u > NUM) u = NUM+1; Merge ( shared, r e s u l t s, i, i+k, u ) ; i = i+k 2 ; k = k 2 ; s y n c t h r e a d s ( ) ; v a l u e s [ t i d ] = shared [ t i d ] ; 9

#endif GPGPU Merge Sort Driver Application / merge. cu Tests the merge s o r t Implementation run in p a r a l l e l t h r e a d s on the GPU through t he nvidia CUDA Framework Jim Kukunas and James Devine / #include <s t d i o. h> #include <s t d l i b. h> #include < c u t i l i n l i n e. h> #include " merge_kernel. cu" int main ( int argc, char argv ) i f ( cutcheckcmdlineflag ( argc, ( const char ) argv, "device" ) ) c u t i l D e v i c e I n i t ( argc, argv ) ; else cudasetdevice ( cutgetmaxgflopsdeviceid ( ) ) ; int v a l u e s [NUM] ; / i n i t i a l i z e a random d a t a s e t / for ( int i = 0 ; i < NUM; i++) v a l u e s [ i ] = rand ( ) ; 10

int dvalues, r e s u l t s ; c u t i l S a f e C a l l ( cudamalloc ( ( void)& dvalues, sizeof ( int ) NUM) ) ; c u t i l S a f e C a l l ( cudamemcpy ( dvalues, values, sizeof ( int ) NUM, cudamemcpyhosttodevice ) c u t i l S a f e C a l l ( cudamalloc ( ( void)& r e s u l t s, sizeof ( int ) NUM) ) ; c u t i l S a f e C a l l ( cudamemcpy ( r e s u l t s, v a l u e s, sizeof ( int ) NUM, cudamemcpyhosttodevice ) ) MergeSort <<<1, NUM, sizeof ( int ) NUM2>>>(dvalues, r e s u l t s ) ; // check f o r any e r r o r s cutilcheckmsg ( " Kernel execution failed" ) ; c u t i l S a f e C a l l ( cudafree ( dvalues ) ) ; c u t i l S a f e C a l l ( cudamemcpy ( values, r e s u l t s, sizeof ( int )NUM, cudamemcpydevicetohost ) ) ; c u t i l S a f e C a l l ( cudafree ( r e s u l t s ) ) ; bool passed = true ; for ( int i = 1 ; i < NUM; i++) i f ( v a l u e s [ i 1] > v a l u e s [ i ] ) passed = f a l s e ; p r i n t f ( "Test %s\n", passed? "PASSED" : "FAILED" ) ; cudathreadexit ( ) ; c u t i l E x i t ( argc, argv ) ; 11

References [1] NVIDIA Corporation. Nvidia cuda programming guide. NVIDIA, 2008. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ NVIDIA_CUDA_Programming_Guide_2.1.pdf [2] NVIDIA Corporation. Cuda zone. NVIDIA, 2009. http://www.nvidia.com/object/ cuda_home.html [3] Mark Harris. General-purpose computation on graphics hardware. GPGPU.org, 2009. http://gpgpu.org [4] Tadao Takaok. Iterative merge sort. http://www.cosc.canterbury.ac.nz/tad. takaoka/cosc229/imerge.c 12