GPUMemSort: A High Performance Graphic Co-processors Sorting Algorithm for Large Scale In-Memory Data

GPUMemSort: A High Performace Graphic Co-processors Sortig Algorithm for Large Scale I-Memory Data Yi Ye 1, Zhihui Du 1+, David A. Bader 2 1 Natioal Laboratory for Iformatio Sciece ad Techology Departmet of Computer Sciece ad Techology, Tsighua Uiversity, Beijig, 184, Chia + Correspodig Author s Email: duzh@tsighua.edu.c 2 College of Computig, Georgia Istitute of Techology, Atlata, GA, 3332, USA Abstract I this paper, we preset a GPU-based sortig algorithm, GPUMemSort, which achieves high performace i sortig large-scale i-memory data by exploitig highparallel GPU processors. It cosists of two algorithms: i-core algorithm, which is resposible for sortig data i GPU global memory efficietly, ad out-of-core algorithm, which is resposible for dividig large scale data ito multiple chuks that fit GPU global memory. GPUMemSort is implemeted based o NVIDIA CUDA framework ad some critical ad detailed optimizatio methods are also preseted. The tests of differet algorithms have bee ru o multiple data sets. The experimetal results show that our i-core sortig ca outperform other compariso-based algorithms ad GPUMemSort is highly effective i sortig large-scale i-memory data. Keywords: Parallel Sortig Algorithm, GPU, CUDA 1. Itroductio With the improvemet of CPU performace ad multicore CPU, badwidth betwee CPU ad memory becomes the bottleeck of large-scale computig. May hardware vedors, such as AMD, IBM, NVIDIA itegrate coprocessors to offload tasks from CPU ad this ca alleviate the effect caused by low CPU-memory badwidth. Meawhile, high performace computers ow much larger memory tha before, so it is very importat to develop efficiet co-processors algorithm to deal with large-scale i-memory data. Recetly, GPU has become the best-kow co-processor. It has bee utilized i may differet sorts of geeral purpose applicatios. GPU is suitable for highly parallel,compute itesive workloads, because of higher memory badwidth, thousads of hardware thread cotexts with This research is supported i part by Natioal Natural Sciece Foudatio of Chia((No. 61738,6773148 ad No.65339), Beijig Natural Sciece Foudatio(No. 48216), NSF Grats CNS-7837, IIP- 934114, OCI-9446 (Bader), ad the Ceter for Adaptive Supercomputig Software for Multithreaded Architectures (CASS-MT). hudreds of parallel compute pipelies executig programs i a SIMD fashios. The peak performace of GPUs has bee icreasig at the rate of 2.5-3. times a year, much faster tha the performace of CPUs based o Moore s law. Nowadays, several GPGPU (Geeral Purpose computig o GPUs) laguages,such as OpeCL[2] ad NVIDIA CUDA [1]are proposed for developers to use GPUs with exteded C programmig laguage, istead of graphics API. I CUDA, threads are orgaized i a hierarchy of grids, blocks, ad threads, which are executed i a SIMT (sigleistructio, multiple-thread) maer; threads are virtually mapped to a arbitrary umber of streamig multiprocessors (SMs) through warps. There exists several types of memory, such as register, local memory, shared memory, global memory, costat memory, etc. Differet type of memory ows differet characteristics. Therefore, how to orgaize the memory access hierarchy is very importat to improve programs performace. I this paper, the GPU part of our algorithm is implemeted with CUDA ad we will show how we desig ad optimize memory access patter i details. Mai Cotributio: We proposes a ovel graphics coprocessor sortig algorithm to sort large-scale i-memory data. Our idea is to split a large-scale sortig task ito a umber of disjoit oes which ca fit GPU memory. I geeral, our cotributios are as follows: (1) We provide the desig, detailed implemetatio ad tests of a graphics co-processors sortig algorithm, which ca sort large-scale i-memory data. (2) We ehace GPU Sample Sort[12] algorithm, the fastest compariso-based GPU sortig algorithm. The ehaced algorithm outperforms the others because it ca achieve better load balacig. The otatios i this paper are summarized i Table 1. The paper is orgaized as follows. Sectio 2 will itroduce the backgroud ad the related work. I sectio 3, the proposed algorithm is itroduced. Detailed implemetatio ad optimizatio will be preseted i sectio 4. Our experimetal results are show i sectio 5. I sectio 6, we will give the coclusio ad future work.

Table 1. NOTATIONS NOTATION DESCRIPTION N d s s[i] e[i] list[i] umber of elemets i the iput data set size of elemets which ca fit ito the global memory umber of chuks umber of sample poits the i th sample poit the i th iput elemet the i th sorted list 2. Backgroud ad Related Work 2.1. Parallel Sortig Algorithm Parallel sortig has bee studied extesively durig the past 3 years. Geerally, parallel sortig algorithms ca be divided ito two categories[3]: Partitio-based Sortig: First, use partitio keys to split the data ito disjoit buckets. Secod, sort each bucket idepedetly, the cocateate sorted buckets. Merge-based Sortig: First, partitio the iput data ito data chuks of approximately equal size ad sort these data chuks i differet processors. Secod, merge the data across all the processors. Each category has its ow potetial bottleeck. Partitiobased algorithms have to deal with problem of how to keep load balacig amog all the processors. Merge-based sortig algorithms perform well oly for a small umber of processors. To solve the load balace problem, Parallel Sortig by Regular Sample (PSRS)[5] guaratees that the size of data chuk assiged to processor is less tha ( 2 p p p + 1). 2 A ew oe [4] ca guaratee that each processor will have at most ( p + s p) elemets, where p s p ad s is 2 a parameter. 2.2. GPU Programmig with CUDA The NVIDIA CUDA programmig model is created for developig applicatios o GPU. Some major priciples [6] o this platform are: (1) Leverage zero-overhead thread schedulig to hide memory latecy. (2) Optimize the use of o-chip memory to reduce badwidth usage ad redudat executio. (3) Group threads to avoid SIMD pealties ad memory port/bak coflicts. (4) Threads withi a thread block ca commuicate via sychroizatio, but there is o built-i global commuicatio mechaism for all threads. 2.3. Parallel Sortig Algorithm based o GPU Sice most sortig algorithms are bouded by memory badwidth, sortig o the high-badwidth GPUs becomes a popular topic. Purcell[7] itroduced bitoic merge sort, while Kipfer ad Westerma [8]improved it to oddeve merge sort. Greβ ad Zachma[9] itroduced the GPUABiSort based o adaptive bitoic sortig. Naga K. Govidaraju[3] preseted a GPUTeraSort algorithm to sort billio record wide-key databases. Also, some CUDA-based sortig algorithms have bee proposed recetly. Erik Sitor [1]itroduced a hybrid algorithm combiig bucket sort ad merge sort, but ca oly sort floats as it uses a float4 i merge sort. Cederma [11]proposed Quicksort i CUDA, which is sesitive to the distributio of the iput data. The compariso-based Thrust Merge method by Nadathur Satish, etc combies odd-eve merge ad twoway merge to balace the load. Satishet.al.[13] preseted GPU radix sort for itegers. [12] is a radomized sample sort that sigificatly outperforms Thrust Merge. Because of its radom selectio, the load balacig is bad. However, most of them are desiged for small-scale data sortig ad are ieffective whe data size is larger tha the global memory size. 3. GPUMemSort Algorithm I this sectio, we will preset the two parts of GPUMemSort.The Out-of-core sortig ca divide largescale data ito multiple disjoited subsets ad assig them to GPU. The I-core sortig ca sort the subsets efficietly. 3.1 Out of core algorithm We adopt the idea of Determiistic Sample-based Parallel Sortig (DSPS) i out-of-core sortig algorithm. The idea behid DSPS algorithm is to fid s-1 samples to partitio the iput data set ito several data chuks. Elemets i (i+1)-th chuk are o less tha those i (i)-th chuk. The sizes of these chuks has a determiistic upper boud. They ca be put ito GPU global memory by adjustig parameter d i the algorithm accordig to the value of. The out-of-core algorithm ca be described as follows: Step 1: Divide the iput data set ito d chuks, each cotais d elemets. Without loss of geerality, we assume that d divides evely. Step 2: Copy the chuks to GPU global memory oe by oe, sort them by i-core algorithm. The split the chuk ito d buckets, bucket j of chuk i is called Bi[i][j]. The x th data i chuk i will be put ito bucket Bi[i][ x d ]. Copy these buckets back to mai memory. Step 3: Swap buckets amog chuks, i [, d-1], j (i, d-1], switch Bi[i][j] ad Bi[j][i]. So that ew chuk i cosists of {Bi[][i],Bi[1][i],...,Bi[d-1][i]}. Step 4: I the d-1 th chuk, i [, d-1],bi[i][d-1] selects the ((x+1) d 2 s ) th elemet as a sample cadidate. x s-1. The sample cadidates list should cotai s*d sample poits. Step 5: Sort the sample cadidate list, select each (k+1)*s sample poit, k [,d-2] as s[k], where s[d-1] is the largest. Copy the sample poits array from mai memory to GPU global memory.

Step 6: Copy each chuk to GPU global memory agai, split the chuk ito d buckets based o d sample poits. The bucket j of chuk i is called NS[i][j] j d-1. After splittig, all the elemets i NS[i][j] should be o larger tha s[j]. At last, copy these buckets back to mai memory. Step 7: Swap buckets amog chuks agai, ew chuk i cosists of {NS[][i],NS[1][i],...,NS[d-1][i]}. i [,d-1]. All the elemets i chuk i are o larger tha s[i]. Step 8: i [,d-1], calculate the total legth of chuk i. If the legth is less tha the threshold Θ, copy the whole chuk to GPU global memory, use icore sortig algorithm to sort it. Otherwise, we will copy NS[][i],NS[1][i],...,NS[d-1][i] to GPU oe by oe. For NS[j][i], split it ito two parts called part[j][i][] ad part[j][i][1], part[j][i][] cotais elemets equal to s[i] while part[j][i][1] cotais the rest. Copy back the part[j][i][1] to the mai memory, the merge all the part[j][i][1], j d-1 ito oe array. At last, sort this array by GPU ad write it back to the result set, fill out the rest part of result set usig s[i]. I step 8, the threshold Θ is the maximum size of array that ca be sorted o GPU. Accordig to coductio i [5], we ca easily get that: d θ, so d Θ 2 2s + 2 ( s Θ 4 +) d + s. This meas that if every chuk s size is guarateed to be less tha Θ, the umber of chuks splitted i step1 must be larger tha Θ 2 2s + 2 Θ ( s 4 +). Suppose that GPU is able to sort 128MB data set oce, the sample umber s=64, the N = 1 GB, accordig to the coduct above, d 8.47, so that d must be larger tha or equal to 9. 3.2 I core algorithm I-core algorithm is based o GPU Sample Sort, which is curretly the fastest compariso-based sortig algorithm. However, it ecouters load balacig problem. The key to make subsets well-balaced i sample sortig algorithm is to fid appropriate splitters, such as PSRS(Parallel Sortig by Regular Sample) ad DSPS. However, if they are directly ported to GPU, overhead of geeratig splitters will be eve much larger tha that of sortig imbalaced subsets. So it is importat to fid the tradeoff poit betwee them. Let s review the procedure of PSRS. Supposed that the size of data set is. First, split the data set ito p subsets. The, for each subset, select s-1 equidistat poits as sample cadidate poits. Fially, merge the (s-1)*p sample cadidate poits, sort them ad select s-1 equidistat poits as splitters. Overhead brought by splitters geeratio i PSRS is splittig the whole data set ad sortig all the subsets ad it is proportioal to the data size. I-core sortig algorithm uses a iovative strategy to select sample poits. First pick up a set from whole data set radomly. The size of set equals to (s-1)*k*m (k p), M is the maximum size of array that ca be sorted i share memory of oe SM. The, split the set ito k subsets ad assig k blocks to sort these subsets i parallel. Afterward, for each subset, select s-1 equidistat poits as sample cadidate poits. At last, merge the (s-1)*k samples, sort them ad select s-1 equidistat poits as splitters. The parameter k should be assiged at rutime depedig o data size. 4. Detailed Implemetatio ad Optimizatio Here we preset the detailed implemetatio ad optimizatio of GPUMemSort. First describe the task executio egie, which ca overlap data trasfer with GPU computatio based o pipelie. Secod idicate how to swap buckets betwee chuks. At last show the compesatio algorithm based o optimistic mechaism. 4.1 Task Executio Egie based o Pipelie The Data trasfer betwee CPU ad GPU is a sigificat overhead i GPUMemSort algorithm. Without optimizatio, more tha 3% of the time would be spet o data trasfer operatios betwee CPU ad GPU. O oe had, GPU ca ot be fully used because it will remai idle whe data trasfer operatios are performed. O the other had, sice the badwidth betwee CPU ad GPU is fully-duplex, oly 5% badwidth resource ca be used at most simultaeously. So Overlappig data trasfer from CPU to GPU, GPU computatio ad data trasfer from GPU to CPU will brig remarkable performace improvemet. Thus a task executio egie is implemeted based o pipelie mechaism. First, divide a sortig task ito three subtasks: CPU-GPU data trasfer, kerel sortig ad GPU- CPU data trasfer. The,pipelie these three type of subtasks based o streamig with asychroous memory copy techology proposed by CUDA.Streamig maitais the depedecy, while asychroous memory copy parallelizes data trasfer operatios ad sortig operatio. Fig. 1 shows the compariso betwee GPU classic computatio patter ad pipelie-based oe. 4.2 Implemetatio of Buckets Swap I the implemetatio of DSPS, differet buckets are swaped through etwork commuicatio because differet chuks are scattered i distributed memory. Poiters are used to avoid hard memory copy. I Algorithm 1, we preset the data structure of poiter array to swap buckets, ad the fuctio of data trasfer from mai memory to GPU global memory. Assig each data chuks TrasposeChuk structure, icludig a vector of TrasposeBlock to record the start address ad the size of a bucket. The swap the start address ad size i the correspodig TrasposeBlock structures. I the comig data trasfer, traverse the buckets ad copy them from mai memory to GPU global memory, thus hard memory copy is avoided.

Algorithm 1 Data Structure for buckets swap ad comig data trasfer algorithm Struct TrasposeBlock{ it* block ptr; log size; }; Struct TrasposeChuk{ TrasposeBlock blocks[d]; }; procedure memcpyf romhostt odevice(t raspose Chuk&chuk, it dvalue) offset ; for q = to d do TrasposeBlock& tmpblock chuk.blocks[q]; cudamemcpyhosttodevice (dvalue + offset,tmpblock.block ptr,sizeof(it) * tmpblock.size); offset offset + tmpblock.size; ed for Figure 1. Compariso betwee GPU classic computatio patter ad pipelie based computatio patter 4.3 Compesatio Algorithm based o Optimistic Machaism I the step 4 of DSPS, Est[i] is calculated to record the size of elemets equals to s[i] i sample cadidate list. I the comig splittig operatio for chuks, it should be guarateed that i NS[i], the umber of elemets equal to s[i] is p 2 s smaller tha Est[i]. If ot, we should try to shift this elemet to the adjacet buckets whe splittig. To add the compariso logic above ito chuks splittig module, a global variable should be maitaied for each bucket to keep record of the umber of elemets equal to correspodig splitter. Atomic FAA (Fetch Ad Add) method will be called a few times to keep cosistecy, thus deterioratig the performace. Otherwise, the size of chuks i the last step may exceed the threshold θ. I order to solve this problem, we propose a iovative compesatio algorithm based o optimistic mechaism. Assume that iϵ[, d-1], the umber of elemets i NS[i] equal to s[i] is o less tha Est[i] p 2 s is a small probability evet ad a compesatio logic is added i step 8. First, judge whether the size of each chuk is o larger tha the give θ. If yes, copy this chuk to GPU global memory ad sort it with i-core algorithm. Otherwise, copy NS[][i],NS[1][i],...,NS[d-1][i] to GPU oe by oe. For NS[j][i], split it ito two parts: part[j][i][] ad part[j][i][1], the former cotais elemets equal to s[i] while the latter cotais the rest. Copy back the part[j][i][1] to the mai memory, the merge all the part[j][i][1], j d-1 ito oe array, the sort this array employig GPU ad write it back to the result set. At last, fill the rest part of result set with s[i]. I Algorithm 2, preseted is the pseudo code of compesatio algorithm. 5. Experimetal Results I this sectio, we itroduce our hardware eviromet ad compare our i-core sortig with GPU, GPU ad Thrust Merge Sort based o six differet data sets ad show the performace ad scalability of GPUMemSort. 5.1 Hardware Eviromet Our system cosists of two GPU 26GTX co-processors, 16GB DDR3 mai memory ad a Itel Quad Core i5-75 CPU. Each GPU coects the mai memory through exclusive PCIe 16X data bus, providig 4GB/s badwidth with full duplex. Experimets have show that data trasmissios betwee each GPU ad mai memory will ot be affected too much. Also, time cosumed by data trasmissio betwee GPU ad mai memory ca be almost overlapped by GPU computatio. Table 1 shows the badwidth measuremet results i differet scearios. Table 2. GPU Memory badwidth measuremet results Test Cases Sigle GPU Two GPUs Device to Host 338.5MB/s 2785.1MB/s Host to Device 3285.5MB/s 282.1MB/s Device to Device 16481.5MB/s 16377.1MB/s GTX 26 with CUDA cosists of 16 SMs(Streamig Multiprocessor), each has 8 processors executig the same istructio o differet data. I CUDA, each SM supports up to 768 threads, ows 16KB of share memory ad has

5 45 4 5 45 4 35 35 3 3 25 2 15 25 2 15 1 1 5 5 16 32 48 64 Bucket ArraySize(MB) 16 32 48 64 Gaussia ArraySize(MB) Algorithm 2 Compesatio algorithm i CPU Side # chuk: [iput] TrasposeChuk of chuk which will be processed, # splitter: [iput] the correspodig splitter value # outputblock: [output] the poiter of array i which results will be writte back # splittersize: [output] the umber of elemets which equals to splitter i the chuk procedure hadlelogarrayexceptio(cost TrasposeChuk& chuk, cost it splitter, it* & outputblock, it& splittersize) it boudary[d]; // splitter of each bucket struct TrasposeChuk m chuk; alloc memory whose size equal to d i dboudary ad copy boudary to dboudary; for q = to d do // hadle blocks i chuk oe by oe. it* dbucketvalue = NULL; it* dbucketoutputvalue = NULL; cost TrasposeBlock& tmpblock = chuk.blocks[q]; alloc memory whose size equal to tmpblock.size i dbucketvalue ad copy tmpblock.block ptr to device memory; malloc tmpblock.size legth array to dbucketoutput- Value; splitequality kerel <<<BLOCK NUM,THREADS NUM>>> (dbucketvalue, tmpblock.size, splitter, dboudary); boudary[q] ΣdBoudary[i]; iϵ[, BLOCK N U M); prefixsum(dboudary); divide kerel <<<BLOCK NUM,THREADS NUM>>> (dbucketvalue, tmpblock.size, splitter, dboudary); copy dbucketoutputvalue back to outputblock i mai memory; ed for copy all buckets i m chuk to global memory; employ icore sortig algorithm to sort them; copy sorted buckets back to outputblock; pad the rest of outputblock usig splitter; free memory i device ad mai memory; 5 45 4 35 3 25 2 15 1 5 55 5 45 4 35 3 25 2 15 1 5 16 32 48 64 Sorted ArraySize(MB) 16 32 48 64 Uiform ArraySize(MB) (MS) 6 55 5 45 4 35 3 25 2 15 1 5 55 5 45 4 35 3 25 2 15 1 5 16 32 48 64 Staggered ArraySize(MB) 16 32 48 64 Zero ArraySize(MB) Figure 2. Performace compariso betwee i core sort ad other existig sort algorithms 8192 available registers. Threads are logically divided ito blocks assiged to a specific SM. Depedig o how may registers ad how may local memory the block of threads requires, there could be multiple blocks assiged to a SM. GPU Data is stored i a 512MB global memory. Each block ca use share memory as cache. Hardware ca coalesce several read or write operatios ito a big oe, so it is ecessary to keep threads visitig memory cosecutively. 5.2 Performace Evaluatio I this sectio, we first compare the performace of icore sort, GPU, GPU ad Thrust Merge Sort based o differet data sets of usiged itegers. Six differet types of data sets icludig Uiform,Sorted Zero,Bucket,Gaussia,Staggered [11]. Fig. 2 shows the result o data of differet array sizes: i-core sortig outperforms the others because it ca achieve good load balacig with small cost. The performace evaluatio of out-of-core algorithm o sigle GPU is show i Fig. 3, idicatig that out-of-core algorithm is robust ad is capable of hadlig data efficietly with differet distributios ad sizes. At last, the compariso of the out-of-core algorithm performace betwee sigle GPU ad two GPUs is show i Fig.4. It is clear that out-of-core sortig algorithm ca reach ear-liear speedup i two GPUs, showig that our out-ofcore algorithm has good scalability whe the badwidth betwee mai memory ad GPU memory is ot a bottleeck.

17 16 15 14 13 12 11 1 9 8 7 6 5 4 3 2 1 Bucket Gaussia Sorted Stagger Uiform Zero 256MB 512MB 124MB 248MB ArraySize Figure 3. Performace for differet data distributios of out of core algorithm 14 12 1 8 6 4 2 256MB 512MB 124MB 248MB ArraySize Sigle GPU Two GPU Figure 4. Performace compariso of out ofcore betwee sigle GPU ad two GPUs 6. Coclusio ad Future Work I this paper, we preset GPUMemSort: a high performace graphics co-processor sortig framework for largescale i-memory data by exploitig high-parallel GPU processors. We test the performace of the algorithm based o multiple data sets ad it shows that GPUMemSort sortig algorithm outperforms other multi-core based parallel sortig algorithms. I the future, we will try to exted our algorithm to multiple GPUs augmeted cluster, implemet ad optimize this algorithm i distributed heterogeeous architecture. I additio, we will try to ehace i-core sortig algorithm to help GPUMemSort reach eve higher performace. A sigificat coclusio draw from this work, is that our GPUMemSort ca break through the limitatio of GPU global memory ad ca sort large-scale i-memory data efficietly. Refereces [1] NVIDIA CUDA (Compute Uified Device Architecture) http://developer.vidia.com/object/cuda.html [2] OPENCL, http://www.khroos.org/opecl/ [3] N. Govidaraju, J. Gray, R. Kumar ad D. Maocha. GPUTeraSort: high performace graphics coprocessor sortig for large database maagemet. SIGMOD, 26 [4] D. R. Helma, J. JaJa, D. A. Bader. A New Determiistic Parallel Sortig Algorithm with a Experimetal Evaluatio. ACM Joural of Experimetal Algorithmics (JEA), September 1998, Volume 3. [5] H. Shi ad J. Schaeffer, Parallel Sortig by Regular samplig, Joural of Parallel ad Distributed Computig 14, pp. 361-372, 1992 [6] Shae Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stoe, David B. Kirk, We-mei W. Hwu: Optimizatio priciples ad applicatio performace evaluatio of a multithreaded GPU usig CUDA. PPOPP 28: 73-82 [7] T.Purcell,C.Doer,M.Cammarao,H.Jese,ad P.Haraha.Photo mappig o programmable graphics hardware. ACMSIGGRAPH/Eurographics Coferece o Graphics Hardware,pages 41C5,23 [8] P.Kipfer,M.Segal,ad R.Westerma.Uberflow:A gpu-based particle egie. SIGGRAPH/Euro graphics Workshop o Graphics Hardware,24. [9] Gre,A., ad Zachma, 26,GPU-ABiSort: Optimal Parallel Sortig o Stream Architectures, The 2th IEEE Iteratioal Parallel ad Distributed Processig Symposium, Rhodes Islad, Greece, pp. 1-1. [1] Sitor, E., ad Assarsso, U. 27, Fast Parallel GPU-Sortig Usig a Hybrid Algorithm, Joural of Parallel ad Distributed Computig, pp. 1381-1388. [11] Cederma, D., ad Tsigas, P., 28, A Practical Quicksort Algorithm for Graphics Processors, Techical Report, Gotheburg, Swede, pp. 246-258. [12] N.Leischer,V.Osipov,ad P.Saders.GPU sample sort. I IEEE Iteratioal Parallel ad Distributed Processig Symposium,21 (curretly available at http://arxiv1.library.corell.edu/abs/99.5649). [13] N.Satish,M.Harris,adM.Garlad.Desigig efficiet sortig algorithms for may-core GPUs.I IEEE Iteratioal Parallel ad Distributed Processig Symposium,29. [14] Felix Putze, Peter Saders, Johaes Sigler. The Multi-Core Stadard Template Library (Exteded Poster Abstract). Symposium o Priciples ad Practice of Parallel Programmig (PPoPP) 27