High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Similar documents

GPU Parallel Computing Architecture and CUDA Programming Model

HPC with Multicore and GPUs

Introduction to GPU Programming Languages

IMAGE PROCESSING WITH CUDA

Hybrid Programming with MPI and OpenMP

Introduction to CUDA C

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

GPGPU Parallel Merge Sort Algorithm

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Algorithm for Dense Matrix Multiplication

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Matrix Multiplication

Spring 2011 Prof. Hyesoon Kim

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Programming GPUs with CUDA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to GPU hardware and to CUDA

Le langage OCaml et la programmation des GPU

Parallel Computing for Data Science

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Stream Processing on GPUs Using Distributed Multimedia Middleware

Lecture 3. Optimising OpenCL performance

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

Part I Courses Syllabus

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

~ Greetings from WSU CAPPLab ~

Introduction to Programming (in C++) Multi-dimensional vectors. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

CUDA Basics. Murphy Stein New York University

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

OpenACC Basics Directive-based GPGPU Programming

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

GPU Computing - CUDA

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Amazon EC2 Product Details Page 1 of 5

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

CUDA Programming. Week 4. Shared memory and register

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

OpenACC 2.0 and the PGI Accelerator Compilers

Parallel Programming Survey

NVIDIA Tools For Profiling And Monitoring. David Goodwin

What is Multi Core Architecture?

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Parallel Computing with MATLAB

Next Generation GPU Architecture Code-named Fermi

HIGH PERFORMANCE BIG DATA ANALYTICS

OpenCL Programming for the CUDA Architecture. Version 2.3

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Chapter 6: Programming Languages

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Introduction to GPU Computing

The Stratosphere Big Data Analytics Platform

Report: Declarative Machine Learning on MapReduce (SystemML)

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Evaluation of CUDA Fortran for the CFD code Strukti

Optimizing Application Performance with CUDA Profiling Tools

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

The Methodology of Application Development for Hybrid Architectures

Rootbeer: Seamlessly using GPUs from Java

Performance Analysis for GPU Accelerated Applications

Programming Exercise 3: Multi-class Classification and Neural Networks

Learn CUDA in an Afternoon: Hands-on Practical Exercises

GPU Tools Sandra Wienke

GPGPU in Scientific Applications

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Optimizing compilers. CS Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations

Linux Performance Optimizations for Big Data Environments

GPGPUs, CUDA and OpenCL

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Aneka Dynamic Provisioning

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Shared Memory Abstractions for Heterogeneous Multicore Processors

Introduction to GPGPU. Tiziano Diamanti

ultra fast SOM using CUDA

1. If we need to use each thread to calculate one output element of a vector addition, what would

Resource Scheduling Best Practice in Hybrid Clusters

CLEVER: a CLoud-Enabled Virtual EnviRonment

Transcription:

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples Aversa, Italy 7th Internation Conference on Internet and Distributed Cloud Computing- IDCS'4

Motivations High Performance Computing requires expensive machines Leverage the virtually unlimited pool of resources offered by the Cloud The pay-as-you-go service model reduces initial investments Clouds' elasticity reduces computing power waste Ease applications' porting from on-premises environment to the Cloud Reusing existing sequential code A number of environments\technique\languages already exist for development of sequential programs Lack of shared programming interfaces can hamper the porting process Exploit the naturally distributed characteristics of Cloud solutions

Objective Realize the automatic transformation of a class of sequential algorithms into a corresponding parallel version Make the parallel version compatible with a target Cloud environment Apply two levels of parallelization st level Use parallel skeletons to port to the Cloud nd level Use GPU simulation Serial Code Code Analyser Translator Parallel Code Parallel Skeletons MapReduce + GPGPU

Employed technologies\ Parallel Skeletons There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns Functional point of view Skeletons are Higher-Order Functions Skeletons support a compositional semantic Applications become composition of state-less functions Orchestration and synchronization of the parallel activities are implicitly defined and hidden to the programmer

Employed technologies\ Map Reduce Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster

Employed technologies\3 GPGPU General-purpose computing on graphics processing units OpenCL is the currently dominant open general-purpose GPU computing language. The dominant proprietary framework is Nvidia's CUDA Single-Program Multiple Data (SPMD) CUDA programming use keywords provided as extensions to high-level programming languages like C/C++ A kernel is organized as a hierarchy structure in which threads are grouped into blocks, and blocks into a grid

Analysis of the source code Analysis of the AST through ROSE compiler Recognition of data structures Vectors, Matrices, Queues, Stacks, Lists... Recognition of computation algorithms Matrix multiplication The user is shown the PDG graph Control and data dependency Each node reports an ID which can be used to trace the code line and the relative control or data structure corresponding to it.

Examples of recognized expressions. Matrix multiplication for (int i=0; i<n; i++) for (int j=0; j<m; j++) { C[i][j] = 0; for (int k=0; k<p; k++) C[i][j] = C[i][j] + A[i][k] * B[k][j]; }. Algebraic expressions involving matrices and vectors for (int i=0; i<n; i++) for (int j=0; j<m; j++)... c[i][j] = alfa * a[i][j] + beta * b[i][j]; c[i][j] = alfa * a[i][j] + beta * b[i][j] +gamma*d[i][j]+... c[i][j] = alfa*a[i][j]^...

Selection of the Skeleton Skeleton selection Users can tweak the dimension of the sub-block in which the matrices will be divided If CUDA is selected, options to determine grid and block dimensions are available A preview of the data distribution is shown

Matrix sub-block Multiplication p m X N n M P + + + + = M m Distribution of blocks will be handled by a Map function Calculations are executed by Reduce function First round: execute sub-matrix multiplication Second round: sum the partial results of the sub-block multiplication

K0=0,,0 V0=A,0,0 6 7 8 9 0 3 4 5 6 7 8 9 0 3 K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 5 K0=0,,0 V0=A,0,0 4 K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 3 K0=0,,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 A 0 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 Matrix sub-block Multiplication st round Map Function B K0=0,0,0 V0=A,0,0 0 3 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 4 5 6 7 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 8 9 0 K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0,0 3 4 5 K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 6 7 8 9 K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 0 3 K0=0,,0 V0=A,0,0 Distribution of blocks will be handled by a Map function Calculations are executed by Reduce function

Matrix sub-block Multiplication = st round Reduce Function

Matrix sub-block Multiplication nd round functions K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 Map Function Tecniche di trasformazione automatica del codice per l High Reduce Function + = Barbato -

Matrix sub-block Multiplication Code produced for st round

Use of GPGPU Added CUDA code is in charge of: Allocating data structures on the GPU Copying data onto the GPU Kernel execution Copying data back from the GPU De-allocation of data structures GPGPU parallelization applied to Reduce function Used on the code produced in the second round Users can set the number of GPU threads Default value depends on matrices' dimensions

Adding CUDA code class MyReducerCUDA : public Reducer { public: MyReducerCUDA(TaskContext& context) { } void reduce(reducecontext& context) { float *A_h = (float *) malloc((n)*(m)*sizeof(float)); float *B_h = (float *) malloc((m)*(p)*sizeof(float)); float *C_h = (float *) malloc((n)*(p)*sizeof(float)); while ( context.nextvalue() ) { string line = context.getinputvalue(); vector<string> indicesandvalue = splitstring(line, ","); int i = toint(indicesandvalue[]); int j = toint(indicesandvalue[]); float value = tofloat(indicesandvalue[3]); if(indicesandvalue[0].compare("a")==0) A_h[i*m+j] = value; else B_h[i*p+j] = value; } string key = context.getinputkey(); vector<string> blockindices = splitstring(key, ","); for(int row=0; row<n; row++) for(int col=0; col<p; col++) { int i = toint(blockindices[0])*n + row; int j = toint(blockindices[])*p + col; string ii = tostring(i); string jj = tostring(j); string value = tostring(c_h[row*p+col]); context.emit(ii+","+jj+",", value); } } }; // Device Allocation float *A_d; float *B_d; float *C_d; cudamalloc( (void**)&a_d, (n)*(m)*sizeof(float) ); cudamalloc( (void**)&b_d, (m)*(p)*sizeof(float) ); cudamalloc( (void**)&c_d, (n)*(p)*sizeof(float) ); // Move data to device cudamemcpy( A_d, A_h, (n)*(m)*sizeof(float),cudamemcpyhosttodevice ); cudamemcpy( B_d, B_h, (m)*(p)*sizeof(float),cudamemcpyhosttodevice ); // Launch the kernel dim3 dimblock( DIM_BLOCK_X, DIM_BLOCK_Y ); dim3 dimgrid( DIM_GRID_X, DIM_GRID_Y ); multiply_matrix<<<dimgrid, dimblock>>>(a_d, B_d, C_d, n, m, p); // Move data from device cudamemcpy( C_h, C_d, (n)*(p)*sizeof(float), cudamemcpydevicetohost ); // Device De-allocation cudafree( A_d ); cudafree( B_d ); cudafree( C_d );

Quick Demo

Quick Demo

Quick Demo

Quick Demo

Quick Demo

Quick Demo

Quick Demo

Conclusions and Future Work We are still at a preliminary stage Need skeletons for different computation algorithms Need to specialize skeletons for different programming paradigms Need skeletons for different Cloud platforms A performance evaluation of the produced code is missing Overhead of the recognition and transformation process has to be checked Matrices of important dimension are needed for the evaluation Time needed to transfer data to the cloud has to be considered When GPU parallelization is used, time needed to transfer data onto it has to be considered

Thanks for your attention