Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas

Cloud-based OpenMP Parallelization Using a MapReduce Runtime Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas 1

MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); 2

MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); 3

MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); OpenMR 4

OpenMP Directive-annotated code #pragma omp parallel for private(i) for(i = 0; i < N; i++) { a[i] = 2*a[i]; 5

MapReduce Framework for distributed processing BigData Functional programming concepts 6

map({1,2,3,4,(*2)) {2,4,6,8 reduce({1,2,3,4,(*)) {24 7

for(i = 0; i < 16; i++) { a[i] = 2*a[i]; 8

OpenMR 9

OpenMR 10

OpenMR Syntax #pragma omp mapreduce for input(a[]) \ output(sum) reduction(+:sum) for(i = 0; i < N; i++) { sum += 2*a[i]; 11

#pragma omp mapreduce data int a[10000]; #pragma omp mapreduce data int b[10000:nodes][10000]; #pragma omp mapreduce data int c[10000:nodes][10000:nodes]; 12

OpenMR Preparing the MR job - before #pragma omp mapreduce for input(a[]) \ output(sum) reduction(+:sum) for(i = 0; i < N; i++) { sum += 2*a[i]; 13

OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); 14

OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); omr_trigger_and_wait(); 15

OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); omr_trigger_and_wait(); omr_retrieve_output(); sum += file_read(omr_output); 16

OpenMR Preparing the MR job - mapper int main() { sum = 0; file_read(omr_data, &a, N); while(buffer = getline()) { i = getiteration(buffer); sum = 2*a[i]; // The actual loop body printf( 0\t%d\n, sum); // Print output 17

OpenMR Preparing the MR job - reducer int main() { sum = 0; while(buffer = getline()) { tmp = getvalue(buffer); sum += tmp; printf( %d\n, sum); // Print output 18

OpenMR Classes of applications DOALL loops No explicit synchronization constructs 19

Benchmarks Experiments: benchmarks with code equivalent to OpenMR SPEC OMP2012 (compute-bound) Rodinia (compute-bound) Synthetic benchmarks (I/O-bound) 20

SPEC 358botsalgn 372smithwa Rodinia b+tree lavamd myocyte Synthetic Vector Add Dot Product Matrix-vector Multiplication Matrix-matrix Multiplication 21

Experimental Setup 1 Baseline: Intel Xeon (8-core) x 3 2 Cloud experiments: Amazon AWS: S3 + EC2 + EMR 235 EC2 instances (Intel Xeon) 1 m1small 234 c1medium (468 vcpus) 22

Results 358botsalgn + Amazon EMR 23

Results 372smithwa + Amazon EMR 24

Results lavamd + Amazon EMR 25

Results Synthetic + Amazon EMR 26

Related work Based on Target Elasticity Fault tolerance Programmability MPI --- Cloud CPUs No No - OpenMP Pthreads Local CPUs --- --- ++ MapReduce --- Cloud CPUs Yes Yes -/+ OpenACC OpenMP, OpenCL, CUDA Local Accelerators --- --- ++ SnuCL OpenCL Cloud Accel No No -/+ PGAS --- Cloud CPUs No No (X10) -/+ Elastic OpenMP OpenMP Cloud CPUs Yes (vertical) No + OpenMR OpenMP, MapReduce Cloud CPUs Yes (horizontal) Yes ++ 27

Cloud-based OpenMP Parallelization Using a MapReduce Runtime Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas 28