High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Size: px

Start display at page:

Download "High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach"

Flora Hodge
10 years ago
Views:

1 High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples Aversa, Italy 7th Internation Conference on Internet and Distributed Cloud Computing- IDCS'4

Industrial and Information Engineering Second University of Naples Aversa,

2 Motivations High Performance Computing requires expensive machines Leverage the virtually unlimited pool of resources offered by the Cloud The pay-as-you-go service model reduces initial investments Clouds' elasticity reduces computing power waste Ease applications' porting from on-premises environment to the Cloud Reusing existing sequential code A number of environments\technique\languages already exist for development of sequential programs Lack of shared programming interfaces can hamper the porting process Exploit the naturally distributed characteristics of Cloud solutions

$on-premises environment to the Cloud Reusing existing sequential code A number of environments\technique\languages already exist for development$

3 Objective Realize the automatic transformation of a class of sequential algorithms into a corresponding parallel version Make the parallel version compatible with a target Cloud environment Apply two levels of parallelization st level Use parallel skeletons to port to the Cloud nd level Use GPU simulation Serial Code Code Analyser Translator Parallel Code Parallel Skeletons MapReduce + GPGPU

environment Apply two levels of parallelization st level Use parallel skeletons to port to the

4 Employed technologies\ Parallel Skeletons There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns Functional point of view Skeletons are Higher-Order Functions Skeletons support a compositional semantic Applications become composition of state-less functions Orchestration and synchronization of the parallel activities are implicitly defined and hidden to the programmer

Skeletons are Higher-Order Functions Skeletons support a compositional semantic Applications become composition of

5 Employed technologies\ Map Reduce Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster

6 Employed technologies\3 GPGPU General-purpose computing on graphics processing units OpenCL is the currently dominant open general-purpose GPU computing language. The dominant proprietary framework is Nvidia's CUDA Single-Program Multiple Data (SPMD) CUDA programming use keywords provided as extensions to high-level programming languages like C/C++ A kernel is organized as a hierarchy structure in which threads are grouped into blocks, and blocks into a grid

The dominant proprietary framework is Nvidia's CUDA Single-Program Multiple Data (SPMD) CUDA programming use

7 Analysis of the source code Analysis of the AST through ROSE compiler Recognition of data structures Vectors, Matrices, Queues, Stacks, Lists... Recognition of computation algorithms Matrix multiplication The user is shown the PDG graph Control and data dependency Each node reports an ID which can be used to trace the code line and the relative control or data structure corresponding to it.

.. Recognition of computation algorithms Matrix multiplication The user is shown the PDG graph

8 Examples of recognized expressions. Matrix multiplication for (int i=0; i<n; i++) for (int j=0; j<m; j++) { C[i][j] = 0; for (int k=0; k<p; k++) C[i][j] = C[i][j] + A[i][k] * B[k][j]; }. Algebraic expressions involving matrices and vectors for (int i=0; i<n; i++) for (int j=0; j<m; j++)... c[i][j] = alfa * a[i][j] + beta * b[i][j]; c[i][j] = alfa * a[i][j] + beta * b[i][j] +gamma*d[i][j]+... c[i][j] = alfa*a[i][j]^...

9 Selection of the Skeleton Skeleton selection Users can tweak the dimension of the sub-block in which the matrices will be divided If CUDA is selected, options to determine grid and block dimensions are available A preview of the data distribution is shown

divided If CUDA is selected, options to determine grid and

10 Matrix sub-block Multiplication p m X N n M P = M m Distribution of blocks will be handled by a Map function Calculations are executed by Reduce function First round: execute sub-matrix multiplication Second round: sum the partial results of the sub-block multiplication

are executed by Reduce function First round: execute sub-matrix

11 K0=0,,0 V0=A,0, K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 5 K0=0,,0 V0=A,0,0 4 K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 3 K0=0,,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 A 0 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 Matrix sub-block Multiplication st round Map Function B K0=0,0,0 V0=A,0,0 0 3 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0, K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0, K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0, K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0, K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 0 3 K0=0,,0 V0=A,0,0 Distribution of blocks will be handled by a Map function Calculations are executed by Reduce function

K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 4 5 6 7 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 8 9 0 K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0,0 3 4 5 K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 6

12 Matrix sub-block Multiplication = st round Reduce Function

13 Matrix sub-block Multiplication nd round functions K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 Map Function Tecniche di trasformazione automatica del codice per l High Reduce Function + = Barbato -

0 Map Function Tecniche di trasformazione automatica del codice per l High

14 Matrix sub-block Multiplication Code produced for st round

15 Use of GPGPU Added CUDA code is in charge of: Allocating data structures on the GPU Copying data onto the GPU Kernel execution Copying data back from the GPU De-allocation of data structures GPGPU parallelization applied to Reduce function Used on the code produced in the second round Users can set the number of GPU threads Default value depends on matrices' dimensions

structures GPGPU parallelization applied to Reduce function Used on the code produced in the

16 Adding CUDA code class MyReducerCUDA : public Reducer { public: MyReducerCUDA(TaskContext& context) { } void reduce(reducecontext& context) { float *A_h = (float *) malloc((n)*(m)*sizeof(float)); float *B_h = (float *) malloc((m)*(p)*sizeof(float)); float *C_h = (float *) malloc((n)*(p)*sizeof(float)); while ( context.nextvalue() ) { string line = context.getinputvalue(); vector<string> indicesandvalue = splitstring(line, ","); int i = toint(indicesandvalue[]); int j = toint(indicesandvalue[]); float value = tofloat(indicesandvalue[3]); if(indicesandvalue[0].compare("a")==0) A_h[i*m+j] = value; else B_h[i*p+j] = value; } string key = context.getinputkey(); vector<string> blockindices = splitstring(key, ","); for(int row=0; row<n; row++) for(int col=0; col<p; col++) { int i = toint(blockindices[0])*n + row; int j = toint(blockindices[])*p + col; string ii = tostring(i); string jj = tostring(j); string value = tostring(c_h[row*p+col]); context.emit(ii+","+jj+",", value); } } }; // Device Allocation float *A_d; float *B_d; float *C_d; cudamalloc( (void**)&a_d, (n)*(m)*sizeof(float) ); cudamalloc( (void**)&b_d, (m)*(p)*sizeof(float) ); cudamalloc( (void**)&c_d, (n)*(p)*sizeof(float) ); // Move data to device cudamemcpy( A_d, A_h, (n)*(m)*sizeof(float),cudamemcpyhosttodevice ); cudamemcpy( B_d, B_h, (m)*(p)*sizeof(float),cudamemcpyhosttodevice ); // Launch the kernel dim3 dimblock( DIM_BLOCK_X, DIM_BLOCK_Y ); dim3 dimgrid( DIM_GRID_X, DIM_GRID_Y ); multiply_matrix<<<dimgrid, dimblock>>>(a_d, B_d, C_d, n, m, p); // Move data from device cudamemcpy( C_h, C_d, (n)*(p)*sizeof(float), cudamemcpydevicetohost ); // Device De-allocation cudafree( A_d ); cudafree( B_d ); cudafree( C_d );

getinputvalue(); vector<string> indicesandvalue = splitstring(line, ","); int i = toint(indicesandvalue[]); int j = toint(indicesandvalue[]); float value = tofloat(indicesandvalue[3]);

17 Quick Demo

18 Quick Demo

19 Quick Demo

20 Quick Demo

21 Quick Demo

22 Quick Demo

23 Quick Demo

24 Conclusions and Future Work We are still at a preliminary stage Need skeletons for different computation algorithms Need to specialize skeletons for different programming paradigms Need skeletons for different Cloud platforms A performance evaluation of the produced code is missing Overhead of the recognition and transformation process has to be checked Matrices of important dimension are needed for the evaluation Time needed to transfer data to the cloud has to be considered When GPU parallelization is used, time needed to transfer data onto it has to be considered

25 Thanks for your attention

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel