Parallel Algorithm for Dense Matrix Multiplication

Similar documents

Hybrid Programming with MPI and OpenMP

What is Multi Core Architecture?

Matrix Multiplication

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

Cellular Computing on a Linux Cluster

6. Cholesky factorization

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

SOLVING LINEAR SYSTEMS

Spring 2011 Prof. Hyesoon Kim

DATA ANALYSIS II. Matrix Algorithms

FPGA area allocation for parallel C applications

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Here are some examples of combining elements and the operations used:

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP

1.2 Solving a System of Linear Equations

Reliable Systolic Computing through Redundancy

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

Lecture Notes 2: Matrices as Systems of Linear Equations

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Performance metrics for parallelism

Elementary Matrices and The LU Factorization

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Optimization on Huygens

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

Performance Analysis and Optimization Tool

Advances in Natural and Applied Sciences

Introduction to Matrix Algebra

Lecture 4: Partitioned Matrices and Determinants

A New Digital Encryption Scheme: Binary Matrix Rotations Encryption Algorithm

Advances in Smart Systems Research : ISSN : Vol. 3. No. 3 : pp.

7 Gaussian Elimination and LU Factorization

Fast Multipole Method for particle interactions: an open source parallel library component

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Math 312 Homework 1 Solutions

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

1 Review of Least Squares Solutions to Overdetermined Systems

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Contributions to Gang Scheduling

HPC enabling of OpenFOAM R for CFD applications

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Introduction to Programming (in C++) Multi-dimensional vectors. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

solution flow update flow

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

Factoring Algorithms

3 P0 P0 P3 P3 8 P1 P0 P2 P3 P1 P2

Linear Algebra Notes

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Least-Squares Intersection of Lines

Data Mining: Algorithms and Applications Matrix Math Review

PeerMon: A Peer-to-Peer Network Monitoring System

A Case Study - Scaling Legacy Code on Next Generation Platforms

3.2 Matrix Multiplication

CUDAMat: a CUDA-based matrix class for Python

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Systems of Linear Equations

Report: Declarative Machine Learning on MapReduce (SystemML)

Shared Memory Abstractions for Heterogeneous Multicore Processors

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line.

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Name: Section Registered In:

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Distributed communication-aware load balancing with TreeMatch in Charm++

Lecture 2 Matrix Operations

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

F Matrix Calculus F 1

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

1 Determinants and the Solvability of Linear Systems

Unit 1. Today I am going to discuss about Transportation problem. First question that comes in our mind is what is a transportation problem?

Further Analysis Of A Framework To Analyze Network Performance Based On Information Quality

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

is in plane V. However, it may be more convenient to introduce a plane coordinate system in V.

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Scientific Computing Programming with Parallel Objects

Parallel and Distributed Computing Programming Assignment 1

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Linear Algebra: Vectors

Direct Methods for Solving Linear Systems. Matrix Factorization

Performance of Scientific Processing in Networks of Workstations: Matrix Multiplication Example

PARALLEL PROGRAMMING

Map-Reduce for Machine Learning on Multicore

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Problem set on Cross Product

Functional-Repair-by-Transfer Regenerating Codes

Load Balancing Techniques

Mesh Generation and Load Balancing

Notes on Cholesky Factorization

Transcription:

Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia

Outline Problem definition Assumptions Implementation Test Results Future work Conclusions

Problem definition Given a matrix A(m r) m rows and r columns, where each of its elements is denoted a ij with 1 i m and 1 j r, and a matrix B(r n) of r rows and n columns, where each of its elements is denoted b ij with 1 i r, and 1 j n, the matrix C resulting from the operation of multiplication of matrices A and B, C = A B, is such that each of its elements is denoted ij with 1 i m and 1 j n, and is calculated follows

Problem definition (Cont.) m x r a 11 a 12 a 13 b 11 b 12 a 21 a 22 a 23 b 21 b 22 b 31 b 32 r x n The simplest way of multiplying two matrices takes about n 3 steps.

Problem definition (Cont.) The number of operation required to multiply A x B is: m n (2r-1) For simplicity, usually it is analyzed in terms of square matrices of order n. So that the quantity of basic operations between scalars is : 2n 3 - n 2 = O(n 3 )

Sequential algorithm for (i = 0; i < n; i++) for (j = 0; i < n; j++) c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j] end for end for end for

Assumptions For simplicity, we will work with square matrices of size n x n. Considered the number of processors available in parallel machines as p. The matrixes to multiply will be A and B. Both will be treated as dense matrices (with few 0's), the result will be stored it in the matrix C. It is assumed that the processing nodes are homogeneous, due this homogeneity it is possible achieve load balancing.

Implementation Consider two square matrices A and B of size n that have to be multiplied: 1. Partition these matrices in square blocks p, where p is the number of processes available. 2. Create a matrix of processes of size p 1/2 x p 1/2 so that each process can maintain a block of A matrix and a block of B matrix. 3. Each block is sent to each process, and the copied sub blocks are multiplied together and the results added to the partial results in the C sub-blocks. 4. The A sub-blocks are rolled one step to the left and the B sub-blocks are rolled one step upward. 5. Repeat steps 3 y 4 sqrt(p) times.

Implementation (Cont.)

Example Matrices to be multiplied

Example: These matrices are divided into 4 square blocks as follows: P 0,0 P 0,1 P 0,0 P 0,1 2 1 0 7 5 3 1 6 6 1 4 5 2 3 6 5 P 1,0 P 1,1 P 1,0 P 1,1 9 2 5 3 4 4 7 2 1 9 4 0 8-8 -8 5

Example: Matrices A and B after the initial alignment. 2 1 0 7 5 3 1 6 6 1 4 5 8-8 -8 5 4 4 7 2 9 2 5 3 1 9 4 0 2 3 6 5

Example: Local matrix multiplication. C 0,0 = X = 2 1 0 7 6 1 4 5 16 7 28 35 5 3 1 6 8-8 -8 5 C 0,1 = X = 16-25 -40 22 4 4 7 2 1 9 4 0 C 1,0 = X = 20 36 15 63 9 2 5 3 2 3 6 5 C 1,1 = X = 30 37 42 39

Example: Shift A one step to left, shift B one step up 5 3 1 6 2 1 0 7 1 9 4 0 2 3 6 5 9 2 5 3 4 4 7 2 6 1 4 5 8-8 -8 5

Example: Local matrix multiplication. C 0,0 = C 0,0 + 2 1 X 6 1 = + = 0 7 4 5 16 7 28 35 17 45 25 9 33 52 53 44 5 3 1 6 8-8 16-25 -40 22 10 11 42 35 C 0,1 = C 0,1 + X -8 5 = + + = 26-14 2 57 4 4 7 2 1 9 4 0 20 36 15 63 62 19 42 33 C 1,0 = C 1,0 + X = + + = 82 55 57 96 9 2 2 3 30 37 0-12 C 1,1 = C 1,1 + 5 3 X 6 5 = 42 39 + 40-46 + = 30 25 82-7

Test Objective: Analyze speedup achieved by the parallel algorithm when increases the size of the input data and the number of cores of the architecture. Constraints: The number of cores used must be a exact square root. Must be possible the exact distribution to the total amount of data into the available cores.

Test Execution: Run sequential algorithm on a single processor/ core. For test the parallel algorithm were used the following number of cores: 4,9,16,25,36,49,64,100 The results were obtained from the average over three tests of the algorithms. Test performed in matrices with dimensions up 1000x1000, increasing with steps of 100.

Running Time (sec.) Test Results - Sequential 9 Sequential 8 7 6 5 4 Sequential 3 2 1 0 100 200 300 400 500 600 700 800 900 1000

Time sec. Test Results Sequential vs. Parallel 9 Sequential vs Parallel Procs # 100 8 7 6 5 4 Procs. 1 Procs. 100 3 2 1 0 100 200 300 400 500 600 700 800 900 1000

Time (sec) Test Results - Parallel 4 Size:600x600 3.5 3 2.5 2 Size:600x600 1.5 1 0.5 0 Procs. 1 Procs. 25 Procs. 36 Procs. 64 Procs. 100

Test Results - Parallel 16 Speedup 14 12 10 8 Speedup 6 4 2 0 Procs. 25 Procs. 36 Procs. 64 Procs. 100

Time (sec.) Test Results Parallel (Counter Intuitive in small test data) 0.06 Size:100 x 100 0.05 0.04 0.03 Size:100 x 100 0.02 0.01 0 Procs. 1 Procs. 4 Procs. 16 Procs. 25 Procs. 100

Further work Implement the algorithm in OpenMP to compare the performance of the two solutions. Implementation in MPI platform using threads when a processor receives more than one piece of data.

Conclusions The distribution of data and computing division across multiple processors offers many advantages: With MPI it is required less effort in terms of the timing required for data handling, since each process has its own portion. MPI offers flexibility for data exchange.

References [1] V. Vassilevska Williams, "Breaking the Coppersmith- Winograd barrier, [Online]. Available: http://www.cs.berkeley.edu/~virgi/matrixmult.pdf. [Accessed 18 09 2012]. [2] Gupta, Anshul; Kumar, Vipin;, "Scalability of Parallel Algorithms for Matrix Multiplication," Parallel Processing, 1993. ICPP 1993. International Conference on, vol.3, no., pp.115-123, 16-20 Aug. 1993 doi: 10.1109/ICPP.1993.160 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber =4134256&isnumber=4134231

Questions