High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples Aversa, Italy 7th Internation Conference on Internet and Distributed Cloud Computing- IDCS'4
Motivations High Performance Computing requires expensive machines Leverage the virtually unlimited pool of resources offered by the Cloud The pay-as-you-go service model reduces initial investments Clouds' elasticity reduces computing power waste Ease applications' porting from on-premises environment to the Cloud Reusing existing sequential code A number of environments\technique\languages already exist for development of sequential programs Lack of shared programming interfaces can hamper the porting process Exploit the naturally distributed characteristics of Cloud solutions
Objective Realize the automatic transformation of a class of sequential algorithms into a corresponding parallel version Make the parallel version compatible with a target Cloud environment Apply two levels of parallelization st level Use parallel skeletons to port to the Cloud nd level Use GPU simulation Serial Code Code Analyser Translator Parallel Code Parallel Skeletons MapReduce + GPGPU
Employed technologies\ Parallel Skeletons There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns Functional point of view Skeletons are Higher-Order Functions Skeletons support a compositional semantic Applications become composition of state-less functions Orchestration and synchronization of the parallel activities are implicitly defined and hidden to the programmer
Employed technologies\ Map Reduce Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
Employed technologies\3 GPGPU General-purpose computing on graphics processing units OpenCL is the currently dominant open general-purpose GPU computing language. The dominant proprietary framework is Nvidia's CUDA Single-Program Multiple Data (SPMD) CUDA programming use keywords provided as extensions to high-level programming languages like C/C++ A kernel is organized as a hierarchy structure in which threads are grouped into blocks, and blocks into a grid
Analysis of the source code Analysis of the AST through ROSE compiler Recognition of data structures Vectors, Matrices, Queues, Stacks, Lists... Recognition of computation algorithms Matrix multiplication The user is shown the PDG graph Control and data dependency Each node reports an ID which can be used to trace the code line and the relative control or data structure corresponding to it.
Examples of recognized expressions. Matrix multiplication for (int i=0; i<n; i++) for (int j=0; j<m; j++) { C[i][j] = 0; for (int k=0; k<p; k++) C[i][j] = C[i][j] + A[i][k] * B[k][j]; }. Algebraic expressions involving matrices and vectors for (int i=0; i<n; i++) for (int j=0; j<m; j++)... c[i][j] = alfa * a[i][j] + beta * b[i][j]; c[i][j] = alfa * a[i][j] + beta * b[i][j] +gamma*d[i][j]+... c[i][j] = alfa*a[i][j]^...
Selection of the Skeleton Skeleton selection Users can tweak the dimension of the sub-block in which the matrices will be divided If CUDA is selected, options to determine grid and block dimensions are available A preview of the data distribution is shown
Matrix sub-block Multiplication p m X N n M P + + + + = M m Distribution of blocks will be handled by a Map function Calculations are executed by Reduce function First round: execute sub-matrix multiplication Second round: sum the partial results of the sub-block multiplication
K0=0,,0 V0=A,0,0 6 7 8 9 0 3 4 5 6 7 8 9 0 3 K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 5 K0=0,,0 V0=A,0,0 4 K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 3 K0=0,,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 A 0 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 Matrix sub-block Multiplication st round Map Function B K0=0,0,0 V0=A,0,0 0 3 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 4 5 6 7 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 8 9 0 K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0,0 3 4 5 K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 6 7 8 9 K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 0 3 K0=0,,0 V0=A,0,0 Distribution of blocks will be handled by a Map function Calculations are executed by Reduce function
Matrix sub-block Multiplication = st round Reduce Function
Matrix sub-block Multiplication nd round functions K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 Map Function Tecniche di trasformazione automatica del codice per l High Reduce Function + = Barbato -
Matrix sub-block Multiplication Code produced for st round
Use of GPGPU Added CUDA code is in charge of: Allocating data structures on the GPU Copying data onto the GPU Kernel execution Copying data back from the GPU De-allocation of data structures GPGPU parallelization applied to Reduce function Used on the code produced in the second round Users can set the number of GPU threads Default value depends on matrices' dimensions
Adding CUDA code class MyReducerCUDA : public Reducer { public: MyReducerCUDA(TaskContext& context) { } void reduce(reducecontext& context) { float *A_h = (float *) malloc((n)*(m)*sizeof(float)); float *B_h = (float *) malloc((m)*(p)*sizeof(float)); float *C_h = (float *) malloc((n)*(p)*sizeof(float)); while ( context.nextvalue() ) { string line = context.getinputvalue(); vector<string> indicesandvalue = splitstring(line, ","); int i = toint(indicesandvalue[]); int j = toint(indicesandvalue[]); float value = tofloat(indicesandvalue[3]); if(indicesandvalue[0].compare("a")==0) A_h[i*m+j] = value; else B_h[i*p+j] = value; } string key = context.getinputkey(); vector<string> blockindices = splitstring(key, ","); for(int row=0; row<n; row++) for(int col=0; col<p; col++) { int i = toint(blockindices[0])*n + row; int j = toint(blockindices[])*p + col; string ii = tostring(i); string jj = tostring(j); string value = tostring(c_h[row*p+col]); context.emit(ii+","+jj+",", value); } } }; // Device Allocation float *A_d; float *B_d; float *C_d; cudamalloc( (void**)&a_d, (n)*(m)*sizeof(float) ); cudamalloc( (void**)&b_d, (m)*(p)*sizeof(float) ); cudamalloc( (void**)&c_d, (n)*(p)*sizeof(float) ); // Move data to device cudamemcpy( A_d, A_h, (n)*(m)*sizeof(float),cudamemcpyhosttodevice ); cudamemcpy( B_d, B_h, (m)*(p)*sizeof(float),cudamemcpyhosttodevice ); // Launch the kernel dim3 dimblock( DIM_BLOCK_X, DIM_BLOCK_Y ); dim3 dimgrid( DIM_GRID_X, DIM_GRID_Y ); multiply_matrix<<<dimgrid, dimblock>>>(a_d, B_d, C_d, n, m, p); // Move data from device cudamemcpy( C_h, C_d, (n)*(p)*sizeof(float), cudamemcpydevicetohost ); // Device De-allocation cudafree( A_d ); cudafree( B_d ); cudafree( C_d );
Quick Demo
Quick Demo
Quick Demo
Quick Demo
Quick Demo
Quick Demo
Quick Demo
Conclusions and Future Work We are still at a preliminary stage Need skeletons for different computation algorithms Need to specialize skeletons for different programming paradigms Need skeletons for different Cloud platforms A performance evaluation of the produced code is missing Overhead of the recognition and transformation process has to be checked Matrices of important dimension are needed for the evaluation Time needed to transfer data to the cloud has to be considered When GPU parallelization is used, time needed to transfer data onto it has to be considered
Thanks for your attention