High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Size: px
Start display at page:

Download "High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach"

Transcription

1 High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples Aversa, Italy 7th Internation Conference on Internet and Distributed Cloud Computing- IDCS'4

2 Motivations High Performance Computing requires expensive machines Leverage the virtually unlimited pool of resources offered by the Cloud The pay-as-you-go service model reduces initial investments Clouds' elasticity reduces computing power waste Ease applications' porting from on-premises environment to the Cloud Reusing existing sequential code A number of environments\technique\languages already exist for development of sequential programs Lack of shared programming interfaces can hamper the porting process Exploit the naturally distributed characteristics of Cloud solutions

3 Objective Realize the automatic transformation of a class of sequential algorithms into a corresponding parallel version Make the parallel version compatible with a target Cloud environment Apply two levels of parallelization st level Use parallel skeletons to port to the Cloud nd level Use GPU simulation Serial Code Code Analyser Translator Parallel Code Parallel Skeletons MapReduce + GPGPU

4 Employed technologies\ Parallel Skeletons There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns Functional point of view Skeletons are Higher-Order Functions Skeletons support a compositional semantic Applications become composition of state-less functions Orchestration and synchronization of the parallel activities are implicitly defined and hidden to the programmer

5 Employed technologies\ Map Reduce Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster

6 Employed technologies\3 GPGPU General-purpose computing on graphics processing units OpenCL is the currently dominant open general-purpose GPU computing language. The dominant proprietary framework is Nvidia's CUDA Single-Program Multiple Data (SPMD) CUDA programming use keywords provided as extensions to high-level programming languages like C/C++ A kernel is organized as a hierarchy structure in which threads are grouped into blocks, and blocks into a grid

7 Analysis of the source code Analysis of the AST through ROSE compiler Recognition of data structures Vectors, Matrices, Queues, Stacks, Lists... Recognition of computation algorithms Matrix multiplication The user is shown the PDG graph Control and data dependency Each node reports an ID which can be used to trace the code line and the relative control or data structure corresponding to it.

8 Examples of recognized expressions. Matrix multiplication for (int i=0; i<n; i++) for (int j=0; j<m; j++) { C[i][j] = 0; for (int k=0; k<p; k++) C[i][j] = C[i][j] + A[i][k] * B[k][j]; }. Algebraic expressions involving matrices and vectors for (int i=0; i<n; i++) for (int j=0; j<m; j++)... c[i][j] = alfa * a[i][j] + beta * b[i][j]; c[i][j] = alfa * a[i][j] + beta * b[i][j] +gamma*d[i][j]+... c[i][j] = alfa*a[i][j]^...

9 Selection of the Skeleton Skeleton selection Users can tweak the dimension of the sub-block in which the matrices will be divided If CUDA is selected, options to determine grid and block dimensions are available A preview of the data distribution is shown

10 Matrix sub-block Multiplication p m X N n M P = M m Distribution of blocks will be handled by a Map function Calculations are executed by Reduce function First round: execute sub-matrix multiplication Second round: sum the partial results of the sub-block multiplication

11 K0=0,,0 V0=A,0, K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 5 K0=0,,0 V0=A,0,0 4 K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 3 K0=0,,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 A 0 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0,0 Matrix sub-block Multiplication st round Map Function B K0=0,0,0 V0=A,0,0 0 3 K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0, K0=0,0,0 V0=A,0,0 K0=0,0,0 V0=A,0, K0=0,0,0 V0=A,0,0 K0=0,,0 V0=A,0, K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0, K0=0,,0 V0=A,0,0 K0=0,,0 V0=A,0,0 0 3 K0=0,,0 V0=A,0,0 Distribution of blocks will be handled by a Map function Calculations are executed by Reduce function

12 Matrix sub-block Multiplication = st round Reduce Function

13 Matrix sub-block Multiplication nd round functions K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 K0=0,0 V0=A,0,0.0 Map Function Tecniche di trasformazione automatica del codice per l High Reduce Function + = Barbato -

14 Matrix sub-block Multiplication Code produced for st round

15 Use of GPGPU Added CUDA code is in charge of: Allocating data structures on the GPU Copying data onto the GPU Kernel execution Copying data back from the GPU De-allocation of data structures GPGPU parallelization applied to Reduce function Used on the code produced in the second round Users can set the number of GPU threads Default value depends on matrices' dimensions

16 Adding CUDA code class MyReducerCUDA : public Reducer { public: MyReducerCUDA(TaskContext& context) { } void reduce(reducecontext& context) { float *A_h = (float *) malloc((n)*(m)*sizeof(float)); float *B_h = (float *) malloc((m)*(p)*sizeof(float)); float *C_h = (float *) malloc((n)*(p)*sizeof(float)); while ( context.nextvalue() ) { string line = context.getinputvalue(); vector<string> indicesandvalue = splitstring(line, ","); int i = toint(indicesandvalue[]); int j = toint(indicesandvalue[]); float value = tofloat(indicesandvalue[3]); if(indicesandvalue[0].compare("a")==0) A_h[i*m+j] = value; else B_h[i*p+j] = value; } string key = context.getinputkey(); vector<string> blockindices = splitstring(key, ","); for(int row=0; row<n; row++) for(int col=0; col<p; col++) { int i = toint(blockindices[0])*n + row; int j = toint(blockindices[])*p + col; string ii = tostring(i); string jj = tostring(j); string value = tostring(c_h[row*p+col]); context.emit(ii+","+jj+",", value); } } }; // Device Allocation float *A_d; float *B_d; float *C_d; cudamalloc( (void**)&a_d, (n)*(m)*sizeof(float) ); cudamalloc( (void**)&b_d, (m)*(p)*sizeof(float) ); cudamalloc( (void**)&c_d, (n)*(p)*sizeof(float) ); // Move data to device cudamemcpy( A_d, A_h, (n)*(m)*sizeof(float),cudamemcpyhosttodevice ); cudamemcpy( B_d, B_h, (m)*(p)*sizeof(float),cudamemcpyhosttodevice ); // Launch the kernel dim3 dimblock( DIM_BLOCK_X, DIM_BLOCK_Y ); dim3 dimgrid( DIM_GRID_X, DIM_GRID_Y ); multiply_matrix<<<dimgrid, dimblock>>>(a_d, B_d, C_d, n, m, p); // Move data from device cudamemcpy( C_h, C_d, (n)*(p)*sizeof(float), cudamemcpydevicetohost ); // Device De-allocation cudafree( A_d ); cudafree( B_d ); cudafree( C_d );

17 Quick Demo

18 Quick Demo

19 Quick Demo

20 Quick Demo

21 Quick Demo

22 Quick Demo

23 Quick Demo

24 Conclusions and Future Work We are still at a preliminary stage Need skeletons for different computation algorithms Need to specialize skeletons for different programming paradigms Need skeletons for different Cloud platforms A performance evaluation of the produced code is missing Overhead of the recognition and transformation process has to be checked Matrices of important dimension are needed for the evaluation Time needed to transfer data to the cloud has to be considered When GPU parallelization is used, time needed to transfer data onto it has to be considered

25 Thanks for your attention

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

IMAGE PROCESSING WITH CUDA

IMAGE PROCESSING WITH CUDA IMAGE PROCESSING WITH CUDA by Jia Tse Bachelor of Science, University of Nevada, Las Vegas 2006 A thesis submitted in partial fulfillment of the requirements for the Master of Science Degree in Computer

More information

Hybrid Programming with MPI and OpenMP

Hybrid Programming with MPI and OpenMP Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set

More information

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau Retour d expérience : portage d une application haute-performance vers un langage de haut niveau ComPAS/RenPar 2013 Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte 16 Janvier 2013 Our Goals Globally

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

GPGPU Parallel Merge Sort Algorithm

GPGPU Parallel Merge Sort Algorithm GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering

More information

Parallel Algorithm for Dense Matrix Multiplication

Parallel Algorithm for Dense Matrix Multiplication Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Programming GPUs with CUDA

Programming GPUs with CUDA Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University johnmc@rice.edu COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Le langage OCaml et la programmation des GPU

Le langage OCaml et la programmation des GPU Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline

More information

Parallel Computing for Data Science

Parallel Computing for Data Science Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Lecture 3. Optimising OpenCL performance

Lecture 3. Optimising OpenCL performance Lecture 3 Optimising OpenCL performance Based on material by Benedict Gaster and Lee Howes (AMD), Tim Mattson (Intel) and several others. - Page 1 Agenda Heterogeneous computing and the origins of OpenCL

More information

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2 NVIDIA CUDA NVIDIA CUDA C Programming Guide Version 4.2 4/16/2012 Changes from Version 4.1 Updated Chapter 4, Chapter 5, and Appendix F to include information on devices of compute capability 3.0. Replaced

More information

Part I Courses Syllabus

Part I Courses Syllabus Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

Introduction to Programming (in C++) Multi-dimensional vectors. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

Introduction to Programming (in C++) Multi-dimensional vectors. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC Introduction to Programming (in C++) Multi-dimensional vectors Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC Matrices A matrix can be considered a two-dimensional vector,

More information

CUDA Basics. Murphy Stein New York University

CUDA Basics. Murphy Stein New York University CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture

More information

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

More information

OpenACC Basics Directive-based GPGPU Programming

OpenACC Basics Directive-based GPGPU Programming OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,

More information

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL Michał Wójcik, Tomasz Boiński Katedra Architektury Systemów Komputerowych Wydział Elektroniki, Telekomunikacji i Informatyki Politechnika

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

GPU Computing - CUDA

GPU Computing - CUDA GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

Amazon EC2 Product Details Page 1 of 5

Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

CUDA Programming. Week 4. Shared memory and register

CUDA Programming. Week 4. Shared memory and register CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

What is Multi Core Architecture?

What is Multi Core Architecture? What is Multi Core Architecture? When a processor has more than one core to execute all the necessary functions of a computer, it s processor is known to be a multi core architecture. In other words, a

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures 1 Hanwoong Jung, and 2 Youngmin Yi, 1 Soonhoi Ha 1 School of EECS, Seoul National University, Seoul, Korea {jhw7884, sha}@iris.snu.ac.kr

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

HIGH PERFORMANCE BIG DATA ANALYTICS

HIGH PERFORMANCE BIG DATA ANALYTICS HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum

More information

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

More information

Chapter 6: Programming Languages

Chapter 6: Programming Languages Chapter 6: Programming Languages Computer Science: An Overview Eleventh Edition by J. Glenn Brookshear Copyright 2012 Pearson Education, Inc. Chapter 6: Programming Languages 6.1 Historical Perspective

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information

The Stratosphere Big Data Analytics Platform

The Stratosphere Big Data Analytics Platform The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data

More information

Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

More information

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild. Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

HARNESS project: Managing Heterogeneous Compute Resources for a Cloud Platform

HARNESS project: Managing Heterogeneous Compute Resources for a Cloud Platform HARNESS project: Managing Heterogeneous Compute Resources for a Cloud Platform J. G. F. Coutinho 1, O. Pell 2, E. O Neill 3, P. Sanders 2, J. McGlone 3, P. Grigoras 1, W. Luk 1, and C. Ragusa 2 1 Imperial

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

Parallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com

Parallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs. http://mapgraph.io

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs. http://mapgraph.io MapGraph A High Level API for Fast Development of High Performance Graphic Analytics on GPUs http://mapgraph.io Zhisong Fu, Michael Personick and Bryan Thompson SYSTAP, LLC Outline Motivations MapGraph

More information

The Methodology of Application Development for Hybrid Architectures

The Methodology of Application Development for Hybrid Architectures Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department

More information

Rootbeer: Seamlessly using GPUs from Java

Rootbeer: Seamlessly using GPUs from Java Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java

More information

Performance Analysis for GPU Accelerated Applications

Performance Analysis for GPU Accelerated Applications Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871

More information

Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

GPU Tools Sandra Wienke

GPU Tools Sandra Wienke Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA

More information

Rule-Based Program Transformation for Hybrid Architectures CSW Workshop Towards Portable Libraries for Hybrid Systems

Rule-Based Program Transformation for Hybrid Architectures CSW Workshop Towards Portable Libraries for Hybrid Systems Rule-Based Program Transformation for Hybrid Architectures CSW Workshop Towards Portable Libraries for Hybrid Systems M. Carro 1,2, S. Tamarit 2, G. Vigueras 1, J. Mariño 2 1 IMDEA Software Institute,

More information

GPGPU in Scientific Applications

GPGPU in Scientific Applications West Pomeranian University of Technology Plan of presentation Parallel computing GPGPU GPGPU technologies Scientific applications Computational limits Resources Speed: Faster hardware Optimized software

More information

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015 CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015 1. Goals and Overview 1. In this MP you will design a Dynamic Load Balancer architecture for a Distributed System 2. You will

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Optimizing compilers. CS6013 - Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations

Optimizing compilers. CS6013 - Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations Optimizing compilers CS6013 - Modern Compilers: Theory and Practise Overview of different optimizations V. Krishna Nandivada IIT Madras Copyright c 2015 by Antony L. Hosking. Permission to make digital

More information

Linux Performance Optimizations for Big Data Environments

Linux Performance Optimizations for Big Data Environments Linux Performance Optimizations for Big Data Environments Dominique A. Heger Ph.D. DHTechnologies (Performance, Capacity, Scalability) www.dhtusa.com Data Nubes (Big Data, Hadoop, ML) www.datanubes.com

More information

Best Practice mini-guide accelerated clusters

Best Practice mini-guide accelerated clusters Using General Purpose GPUs Alan Gray, EPCC Anders Sjöström, LUNARC Nevena Ilieva-Litova, NCSA Partial content by CINECA: http://www.hpc.cineca.it/content/gpgpu-general-purpose-graphics-processing-unit

More information

GPGPUs, CUDA and OpenCL

GPGPUs, CUDA and OpenCL GPGPUs, CUDA and OpenCL Timo Lilja January 21, 2010 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 1 / 42 Course arrangements Course code: T-106.5800 Seminar on Software Techniques Credits: 3 Thursdays

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon

More information

Aneka Dynamic Provisioning

Aneka Dynamic Provisioning MANJRASOFT PTY LTD Aneka Aneka 2.0 Manjrasoft 10/22/2010 This document describes the dynamic provisioning features implemented in Aneka and how it is possible to leverage dynamic resources for scaling

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Shared Memory Abstractions for Heterogeneous Multicore Processors

Shared Memory Abstractions for Heterogeneous Multicore Processors Shared Memory Abstractions for Heterogeneous Multicore Processors Scott Schneider Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment

More information

Program Grid and HPC5+ workshop

Program Grid and HPC5+ workshop Program Grid and HPC5+ workshop 24-30, Bahman 1391 Tuesday Wednesday 9.00-9.45 9.45-10.30 Break 11.00-11.45 11.45-12.30 Lunch 14.00-17.00 Workshop Rouhani Karimi MosalmanTabar Karimi G+MMT+K Opening IPM_Grid

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Compositional hardware virtualization. Raphael kena Poss University of Amsterdam January 11th, 2014

Compositional hardware virtualization. Raphael kena Poss University of Amsterdam January 11th, 2014 Compositional hardware virtualization Raphael kena Poss University of Amsterdam January 11th, 2014 Context Programming is a human activity More and people need to program There are programming problems

More information

1. If we need to use each thread to calculate one output element of a vector addition, what would

1. If we need to use each thread to calculate one output element of a vector addition, what would Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x

More information

Resource Scheduling Best Practice in Hybrid Clusters

Resource Scheduling Best Practice in Hybrid Clusters Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

CLEVER: a CLoud-Enabled Virtual EnviRonment

CLEVER: a CLoud-Enabled Virtual EnviRonment CLEVER: a CLoud-Enabled Virtual EnviRonment Francesco Tusa Maurizio Paone Massimo Villari Antonio Puliafito {ftusa,mpaone,mvillari,apuliafito}@unime.it Università degli Studi di Messina, Dipartimento di

More information