Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas

Size: px

Start display at page:

Download "Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas"

Andra Byrd
10 years ago
Views:

1 Cloud-based OpenMP Parallelization Using a MapReduce Runtime Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas 1

2 MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); 2

$MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!$

3 MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); 3

4 MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if(my_rank!= 0) { sprintf(greeting, "Greetings from process %d of %d!", my_rank, comm_sz); MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD); else { printf("greetings from process %d of %d!\n", my_rank, comm_sz); for(int q = 1; q < comm_sz; q++) { MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s\n", greeting); MPI_Finalize(); OpenMR 4

5 OpenMP Directive-annotated code #pragma omp parallel for private(i) for(i = 0; i < N; i++) { a[i] = 2*a[i]; 5

6 MapReduce Framework for distributed processing BigData Functional programming concepts 6

7 map({1,2,3,4,(*2)) {2,4,6,8 reduce({1,2,3,4,(*)) {24 7

8 for(i = 0; i < 16; i++) { a[i] = 2*a[i]; 8

9 OpenMR 9

10 OpenMR 10

11 OpenMR Syntax #pragma omp mapreduce for input(a[]) \ output(sum) reduction(+:sum) for(i = 0; i < N; i++) { sum += 2*a[i]; 11

12 #pragma omp mapreduce data int a[10000]; #pragma omp mapreduce data int b[10000:nodes][10000]; #pragma omp mapreduce data int c[10000:nodes][10000:nodes]; 12

13 OpenMR Preparing the MR job - before #pragma omp mapreduce for input(a[]) \ output(sum) reduction(+:sum) for(i = 0; i < N; i++) { sum += 2*a[i]; 13

14 OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); 14

15 OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); omr_trigger_and_wait(); 15

16 OpenMR Preparing the MR job - after for(i = 0; i < N; i++) { file_write(omr_input, &i, 1); file_write(omr_data, &a, N); omr_push_files(); omr_trigger_and_wait(); omr_retrieve_output(); sum += file_read(omr_output); 16

17 OpenMR Preparing the MR job - mapper int main() { sum = 0; file_read(omr_data, &a, N); while(buffer = getline()) { i = getiteration(buffer); sum = 2*a[i]; // The actual loop body printf( 0\t%d\n, sum); // Print output 17

18 OpenMR Preparing the MR job - reducer int main() { sum = 0; while(buffer = getline()) { tmp = getvalue(buffer); sum += tmp; printf( %d\n, sum); // Print output 18

19 OpenMR Classes of applications DOALL loops No explicit synchronization constructs 19

20 Benchmarks Experiments: benchmarks with code equivalent to OpenMR SPEC OMP2012 (compute-bound) Rodinia (compute-bound) Synthetic benchmarks (I/O-bound) 20

21 SPEC 358botsalgn 372smithwa Rodinia b+tree lavamd myocyte Synthetic Vector Add Dot Product Matrix-vector Multiplication Matrix-matrix Multiplication 21

22 Experimental Setup 1 Baseline: Intel Xeon (8-core) x 3 2 Cloud experiments: Amazon AWS: S3 + EC2 + EMR 235 EC2 instances (Intel Xeon) 1 m1small 234 c1medium (468 vcpus) 22

23 Results 358botsalgn + Amazon EMR 23

24 Results 372smithwa + Amazon EMR 24

25 Results lavamd + Amazon EMR 25

26 Results Synthetic + Amazon EMR 26

27 Related work Based on Target Elasticity Fault tolerance Programmability MPI --- Cloud CPUs No No - OpenMP Pthreads Local CPUs MapReduce --- Cloud CPUs Yes Yes -/+ OpenACC OpenMP, OpenCL, CUDA Local Accelerators SnuCL OpenCL Cloud Accel No No -/+ PGAS --- Cloud CPUs No No (X10) -/+ Elastic OpenMP OpenMP Cloud CPUs Yes (vertical) No + OpenMR OpenMP, MapReduce Cloud CPUs Yes (horizontal) Yes ++ 27

28 Cloud-based OpenMP Parallelization Using a MapReduce Runtime Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas 28

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First