Parallel and Distributed Computing Programming Assignment 1



Similar documents
Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Matrix Multiplication

Glossary of Object Oriented Terms

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Big-data Analytics: Challenges and Opportunities

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

GDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial

Chapter 2 Parallel Architecture, Software And Performance

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

MPI Hands-On List of the exercises

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

HW4: Merge Sort. 1 Assignment Goal. 2 Description of the Merging Problem. Course: ENEE759K/CMSC751 Title:

Debugging with TotalView

The programming language C. sws1 1

arrays C Programming Language - Arrays

HPC Wales Skills Academy Course Catalogue 2015

Sources: On the Web: Slides will be available on:

C++FA 3.1 OPTIMIZING C++

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

An Introduction to Parallel Computing/ Programming

2: Computer Performance

MPI and Hybrid Programming Models. William Gropp

- An Essential Building Block for Stable and Reliable Compute Clusters

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Advanced Computer Networks Project 2: File Transfer Application

Matlab on a Supercomputer

The Asterope compute cluster

C Compiler Targeting the Java Virtual Machine

PGR Computing Programming Skills

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

What is Multi Core Architecture?

A3 Computer Architecture

Midterm Exam #2 Solutions November 10, 1999 CS162 Operating Systems

Real-Time Scheduling 1 / 39

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

The Design of the Inferno Virtual Machine. Introduction

Object Oriented Software Design

Optimizing matrix multiplication Amitabha Banerjee

Hybrid Programming with MPI and OpenMP

Outline. hardware components programming environments. installing Python executing Python code. decimal and binary notations running Sage

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Symbol Tables. Introduction

Buffer Management 5. Buffer Management

Operation Count; Numerical Linear Algebra

Memory Allocation. Static Allocation. Dynamic Allocation. Memory Management. Dynamic Allocation. Dynamic Storage Allocation

Introducing the Multilevel Model for Change

Using Power to Improve C Programming Education

CS 300 Data Structures Syllabus - Fall 2014

Project Time Management

GPU Tools Sandra Wienke

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Comparing RTOS to Infinite Loop Designs

Lab 2: Swat ATM (Machine (Machine))

50 Computer Science MI-SG-FLD050-02

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Parallel Computing. Parallel shared memory computing with OpenMP

Habanero Extreme Scale Software Research Project

Spark. Fast, Interactive, Language- Integrated Cluster Computing

14:440:127 Introduction to Computers for Engineers. Notes for Lecture 06

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

An Introduction To Simple Scheduling (Primarily targeted at Arduino Platform)

MapReduce Evaluator: User Guide

Lecture 1: the anatomy of a supercomputer

Cloud Computing. Up until now

Parallelization: Binary Tree Traversal

Platter. Track. Index Mark. Disk Storage. PHY 406F - Microprocessor Interfacing Techniques

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Parallel Computing with Mathematica UVACSE Short Course

Introduction to Programming System Design. CSCI 455x (4 Units)

Introduction to LabVIEW Design Patterns

Building Embedded Systems

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Debugging. Common Semantic Errors ESE112. Java Library. It is highly unlikely that you will write code that will work on the first go

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

IS0020 Program Design and Software Tools Midterm, Feb 24, Instruction

Object Oriented Software Design

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

MATLAB Functions. function [Out_1,Out_2,,Out_N] = function_name(in_1,in_2,,in_m)

7.1 Our Current Model

CUDA Programming. Week 4. Shared memory and register

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

benchmarking Amazon EC2 for high-performance scientific computing

#820 Computer Programming 1A

Phys4051: C Lecture 2 & 3. Comment Statements. C Data Types. Functions (Review) Comment Statements Variables & Operators Branching Instructions

Developing MapReduce Programs

Learn CUDA in an Afternoon: Hands-on Practical Exercises

EE361: Digital Computer Organization Course Syllabus

CSC 2405: Computer Systems II

Outline. Conditional Statements. Logical Data in C. Logical Expressions. Relational Examples. Relational Operators

Moving from CS 61A Scheme to CS 61B Java

Transcription:

Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong on the Penguin cluster using MPI. The second program should provide an estimate of the cost of a floating point operation. 1 Ping-Pong A ping-pong program uses two processes and simply sends a message from (say) process 0 to process 1, and then process 1 sends a message back to process 0. We ll use this approach to evaluate the performance of message-passing using MPI. In its simplest form, the ping-pong algorithm can be described by the following pseudocode. /* Use two processes */ start timer; stop timer; print time; else /* my_rank == 1 */ { Since the times are taken on a single node, ping-pong avoids the problems we might encounter with different clocks on different nodes. Since the elapsed time will depend on the length of the message, you ll need to take timings for various size messages. At a minimum you should take timings for messages with lengths 0 bytes, 1024 bytes,..., 131,072 bytes. After taking your timings you can use least squares to fit a line to your data. The reciprocal of the slope is often called the bandwidth, and the intercept the latency. (You can use a software package such as Matlab or Excel to do the least-squares calculations.) 1

The default timer provided by our MPI implementation (MPI Wtime()) provides a resolution of 1 microsecond, which may not be much less than the cost of a single ping-pong for small messages. So instead of sending a single message, you should do something like this: start timer; for (i = 0; i < MESSAGE_COUNT; i++) { stop timer; print Average message time = elapsed time/(2*message_count); else { for (i = 0; i < MESSAGE_COUNT; i++) { This formulation adds in the loop overhead, but it s still better than the single ping-pong. A MESSAGE_COUNT of 100 isn t too big. 2 Cost of Floating Point Operations The performance of floating point operations is highly dependent on the code that s executing them. For example, if most of the operations are carried out on operands in level one cache, then the average performance will be far superior to that of code in which many operations are carried out on operands that need to be loaded from main memory. So we can t give a definitive answer such as on machine X, the average runtime of a floating point operation is Y seconds. Undeterred by such practical considerations, though, we ll persevere. For our purposes, let s define the cost of a floating point operation to be the average run time of one of the operations in the multiplication of two 1000 1000 matrices. That is, suppose we compute the elapsed time of the follwing calculation: for (i = 0; i < 1000; i++) for (j = 0; j < 1000; j++) { C[i][j] = 0.0; for (k = 0; k < 1000; k++) C[i][j] += A[i][k]*B[k][j]; Then, since this code executes 1000 3 floating point multiplies and 1000 3 floating point adds, we ll define the cost of a floating point operation to be Cost of floating point operation = (Elapsed time)/(2 10 9 ) 2

3 Caveats and Details The two programs should take no input. The ping-pong program should print out average times in seconds for the various messages sizes: 0 bytes, 1024 bytes, etc. The matrix-multiplication program should print out the average time in seconds per floating-point operation. In the ping-pong program be sure to initialize your buffers. There are implementations of MPI that crash when an uninitialized buffer is sent. For MPI, you should test messages sent across Infiniband. So you should use two nodes for your final data collection. If you use two nodes, by default the MPI implementation will place one process on one node and one process on the other. For MPI, use MPI Wtime() for timing: double start, finish, elapsed; start = MPI_Wtime(); loop of pingpongs; finish = MPI_Wtime(); average elapsed = (finish-start)/(2*number of iterations); Many implementations of MPI do some on-the-fly set up when messages are initially sent. These messages may cause your program to report somewhat slower times. This doesn t seem to be an issue on the Penguin cluster, but if you re going to be running the program on other systems it may be useful to add 10-15 ping-pongs before starting the timed ping-pongs. Use the macro GET TIME() defined in timer.h (on the class website) for timing matrix multiplication. It takes a double (not a pointer to a double) as an argument, and it returns the number of seconds since some time in the past as a double. It s very likely that the runtime system on the cluster will give a segmentation fault if you try using a declaration such as double A[1000][1000]; The matrix is too big. You re more likely to have success if you allocate the matrices from the heap: double* A;... A = malloc(1000*1000*sizeof(double));... free(a); 3

If you do this, you ll need to take care of converting a subscript with two indexes into a single subscript. An easy way to do this is to use A[i*1000 + j] instead of A[i][j]. 4 Analysis You should include an analysis of your results with your source code. The discussion should include estimates of the slope and intercept for the least-squares regression line approximating your data for the MPI program. Here are some other issues you should consider. 1. How well did the least squares regression line fit the data? What was the correlation coefficient? 2. Instead of using elapsed time, we might have used CPU time. Would this have made a difference to the times? Would it have given more accurate or less accurate times? Would it have had different effects on the reported times for the two different programs? 3. Was there much variance in the ping-pong times for messages of a fixed size? If so, can you offer a suggestion about why this might be the case? (You may need to generate more data to answer this question.) 4. Assuming your estimates are correct, what (if anything) can be said about the relative costs of communication and computation on the Penguin cluster? 5 Coding and Debugging When you write a parallel program, you should be extremely careful to follow all the good programming practices you learned in your previous classes top-down design, modular code, incremental development. Parallel programs are far more complex than serial ones; so good program design is even more important now. Probably the simplest approach to debugging parallel programs is to add printf statements at critical points in the code. Note that in order to be sure you re seeing the output as it occurs, you should follow each debugging printf by a fflush(stdout). Also, since multiple processes can produce output, it s essential that each statement be preceded by its process rank. For example, your debug output statements might have the form printf("proc %d >... (message)...\n", my_rank,...); fflush(stdout); The version of MPI that we re running on the cluster does have some support for using gdb (the GNU debugger). To use this, you need to compile your program with the -g option and execute it with the -gdb option: $ mpiexec -gdb -n <number of processes> <executable> For details on this, take a look at the MPICH2 User s Guide. It can be downloaded from http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1. 3.2-userguide.pdf. 4

6 Documentation Follow the standard rules for documentation. Describe the purpose and algorithm, the parameters and variables, and the input and output of each routine. You should also include the analysis with the documentation. 7 Subversion You should put copies of your source files and any makefiles in your prog1 subversion directory by 10 am on Monday, February 7. (See the document http://www.cs.usfca.edu/ peter/cs625/svn.html on the class website.) You should also turn in a print-out of your source to me before 2 pm on the 7th. 8 Grading Correctness will be 60% of your grade. Does your program correctly ping-pong for the various required message sizes? Is the matrix-multiplication correct? Are the numbers the programs generate reasonable? Static features will be 10% of your grade. Is the source code nicely formatted and easy to read? Is the source well-documented? Your analysis will be 30% of your grade. Did you answer the required questions for the analysis? Does your data support your conclusions? 9 Collaboration You may discuss all aspects of the program with your classmates. However, you should never show any of your code to another student, and you should never copy anyone else s code without an explicit acknowledgment. 5