Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas

Similar documents
Parallelization: Binary Tree Traversal

Parallel Computing. Parallel shared memory computing with OpenMP

HPCC - Hrothgar Getting Started User Guide MPI Programming

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Case Study on Productivity and Performance of GPGPUs

Parallel Programming Survey

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

A Cost-Evaluation of MapReduce Applications in the Cloud

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Why Choose C/C++ as the programming language? Parallel Programming in C/C++ - OpenMP versus MPI

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Parallel Computing. Shared memory parallel programming with OpenMP

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

OpenACC Programming on GPUs

Cloud Computing. Chapter 1 Introducing Cloud Computing

OpenACC 2.0 and the PGI Accelerator Compilers

Programming the Intel Xeon Phi Coprocessor

To connect to the cluster, simply use a SSH or SFTP client to connect to:

CP3109: Introduction to Cloud Computing

Lightning Introduction to MPI Programming

Introduction to Hybrid Programming

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Hybrid Programming with MPI and OpenMP

An Introduction to Parallel Computing/ Programming

Performance Analysis for GPU Accelerated Applications

MPI and Hybrid Programming Models. William Gropp

OpenACC Basics Directive-based GPGPU Programming

OpenACC Programming and Best Practices Guide

SCALING USER-SESSIONS FOR LOAD TESTING OF INTERNET APPLICATIONS

A quick tutorial on Intel's Xeon Phi Coprocessor

Spring 2011 Prof. Hyesoon Kim

A Brief Introduction to Apache Tez

AWS Account Setup and Services Overview

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

Developing MapReduce Programs

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

High Performance Computing

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Scalability evaluation of barrier algorithms for OpenMP

Part I Courses Syllabus

Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University

High Throughput Sequencing Data Analysis using Cloud Computing

COSCO 2015 Heterogeneous Computing Programming

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Chapter 2 Parallel Architecture, Software And Performance

MapReduce (in the cloud)

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud-based Analytics and Map Reduce

OpenMP and Performance

High performance computing systems. Lab 1

PUBLIC CLOUD USAGE TRENDS

Allinea Performance Reports User Guide. Version 6.0.6

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Getting OpenMP Up To Speed

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Cloud Computing Summary and Preparation for Examination

Accelerating classical MD for multi-core CPUs and GPUs

Getting Started with Hadoop with Amazon s Elastic MapReduce

Session 2: MUST. Correctness Checking

An Introduction to Parallel Programming with OpenMP

Parallel and Distributed Computing Programming Assignment 1

Experiences with HPC on Windows

Evaluation of CUDA Fortran for the CFD code Strukti

Parallel Programming with MPI on the Odyssey Cluster

Scheduling in the Cloud

Cloud Computing. Chapter 1 Introducing Cloud Computing

Data-Flow Awareness in Parallel Data Processing

On the Importance of Thread Placement on Multicore Architectures

C-Meter: A Framework for Performance Analysis of Computing Clouds

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Petascale Software Challenges. William Gropp

Amazon Web Services (AWS) Setup Guidelines

OpenCL for programming shared memory multicore CPUs

An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Multi-Threading Performance on Commodity Multi-Core Processors

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Transcription:

Cloud-based OpenMP Parallelization Using a MapReduce Runtime Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas 1

MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); 2

MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); 3

MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); OpenMR 4

OpenMP Directive-annotated code #pragma omp parallel for private(i) for(i = 0; i < N; i++) { a[i] = 2*a[i]; 5

MapReduce Framework for distributed processing BigData Functional programming concepts 6

map({1,2,3,4,(*2)) {2,4,6,8 reduce({1,2,3,4,(*)) {24 7

for(i = 0; i < 16; i++) { a[i] = 2*a[i]; 8

OpenMR 9

OpenMR 10

OpenMR Syntax #pragma omp mapreduce for input(a[]) \ output(sum) reduction(+:sum) for(i = 0; i < N; i++) { sum += 2*a[i]; 11

#pragma omp mapreduce data int a[10000]; #pragma omp mapreduce data int b[10000:nodes][10000]; #pragma omp mapreduce data int c[10000:nodes][10000:nodes]; 12

OpenMR Preparing the MR job - before #pragma omp mapreduce for input(a[]) \ output(sum) reduction(+:sum) for(i = 0; i < N; i++) { sum += 2*a[i]; 13

OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); 14

OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); omr_trigger_and_wait(); 15

OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); omr_trigger_and_wait(); omr_retrieve_output(); sum += file_read(omr_output); 16

OpenMR Preparing the MR job - mapper int main() { sum = 0; file_read(omr_data, &a, N); while(buffer = getline()) { i = getiteration(buffer); sum = 2*a[i]; // The actual loop body printf( 0\t%d\n, sum); // Print output 17

OpenMR Preparing the MR job - reducer int main() { sum = 0; while(buffer = getline()) { tmp = getvalue(buffer); sum += tmp; printf( %d\n, sum); // Print output 18

OpenMR Classes of applications DOALL loops No explicit synchronization constructs 19

Benchmarks Experiments: benchmarks with code equivalent to OpenMR SPEC OMP2012 (compute-bound) Rodinia (compute-bound) Synthetic benchmarks (I/O-bound) 20

SPEC 358botsalgn 372smithwa Rodinia b+tree lavamd myocyte Synthetic Vector Add Dot Product Matrix-vector Multiplication Matrix-matrix Multiplication 21

Experimental Setup 1 Baseline: Intel Xeon (8-core) x 3 2 Cloud experiments: Amazon AWS: S3 + EC2 + EMR 235 EC2 instances (Intel Xeon) 1 m1small 234 c1medium (468 vcpus) 22

Results 358botsalgn + Amazon EMR 23

Results 372smithwa + Amazon EMR 24

Results lavamd + Amazon EMR 25

Results Synthetic + Amazon EMR 26

Related work Based on Target Elasticity Fault tolerance Programmability MPI --- Cloud CPUs No No - OpenMP Pthreads Local CPUs --- --- ++ MapReduce --- Cloud CPUs Yes Yes -/+ OpenACC OpenMP, OpenCL, CUDA Local Accelerators --- --- ++ SnuCL OpenCL Cloud Accel No No -/+ PGAS --- Cloud CPUs No No (X10) -/+ Elastic OpenMP OpenMP Cloud CPUs Yes (vertical) No + OpenMR OpenMP, MapReduce Cloud CPUs Yes (horizontal) Yes ++ 27

Cloud-based OpenMP Parallelization Using a MapReduce Runtime Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas 28