The PRAM Model and Algorithms. Advanced Topics Spring 2008 Prof. Robert van Engelen

Similar documents
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

CS473 - Algorithms I

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

The Butterfly, Cube-Connected-Cycles and Benes Networks

Matrix-Chain Multiplication

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Scalability and Classifications

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms

THE DESIGN OF AN EFFICIENT LOAD BALANCING ALGORITHM EMPLOYING BLOCK DESIGN. Ilyong Chung and Yongeun Bae. 1. Introduction

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Introduction to Scheduling Theory

Chapter 7. Matrices. Definition. An m n matrix is an array of numbers set out in m rows and n columns. Examples. (

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Systolic Computing. Fundamentals

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

OPTIMAL BINARY SEARCH TREES

Parallel Programming

Solutions to Homework 6

Topological Properties

Annotation to the assignments and the solution sheet. Note the following points

Big-data Analytics: Challenges and Opportunities

Converting a Number from Decimal to Binary

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Data Streams A Tutorial

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. The Best-fit Heuristic for the Rectangular Strip Packing Problem: An Efficient Implementation

Single machine parallel batch scheduling with unbounded capacity

Partitioning and Divide and Conquer Strategies

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

High Performance Computing

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

6.852: Distributed Algorithms Fall, Class 2

HSL and its out-of-core solver

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Three Effective Top-Down Clustering Algorithms for Location Database Systems

PROVABLY GOOD PARTITIONING AND LOAD BALANCING ALGORITHMS FOR PARALLEL ADAPTIVE N-BODY SIMULATION

Big Data and Scripting. Part 4: Memory Hierarchies

Beyond the Stars: Revisiting Virtual Cluster Embeddings

Web Data Extraction: 1 o Semestre 2007/2008

Matrix Multiplication

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Integer Factorization using the Quadratic Sieve

HW4: Merge Sort. 1 Assignment Goal. 2 Description of the Merging Problem. Course: ENEE759K/CMSC751 Title:

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Symbol Tables. Introduction

1 Definitions. Supplementary Material for: Digraphs. Concept graphs

Interconnection Network Design

Factoring Algorithms

PES Institute of Technology-BSC QUESTION BANK

2. (a) Explain the strassen s matrix multiplication. (b) Write deletion algorithm, of Binary search tree. [8+8]

Solving a 2D Knapsack Problem Using a Hybrid Data-Parallel/Control Style of Computing

The Mixed Binary Euclid Algorithm

Scheduling Shop Scheduling. Tim Nieberg

Approximation Algorithms

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)


Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

Network (Tree) Topology Inference Based on Prüfer Sequence

Many algorithms, particularly divide and conquer algorithms, have time complexities which are naturally

6 March Array Implementation of Binary Trees

Loop Invariants and Binary Search

Assessment for Master s Degree Program Fall Spring 2011 Computer Science Dept. Texas A&M University - Commerce

Approximability of Two-Machine No-Wait Flowshop Scheduling with Availability Constraints

CS473 - Algorithms I

A Model of Computation for MapReduce

Binary Search Trees CMPSC 122

Introduction to Parallel Programming and MapReduce

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Parallel and Distributed Computing Programming Assignment 1

2.1 Complexity Classes

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Plaxton Routing. - From a peer - to - peer network point of view. Lars P. Wederhake

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Introduction to DISC and Hadoop

Transcription:

The PRAM Model and Algorithms Advanced Topics Spring 2008 Prof. Robert van Engelen

Overview The PRAM model of parallel computation Simulations between PRAM models Work-time presentation framework of parallel algorithms Example algorithms HPC Fall 2007 2

The PRAM Model of Parallel Computation Parallel Random Access Machine (PRAM) Natural extension of RAM: each processor is a RAM Processors operate synchronously Earliest and best-known model of parallel computation Shared Memory Shared memory with m locations p processors, each with private memory P 1 P 2 P 3 P p All processors operate synchronously, by executing load, store, and operations on data HPC Fall 2007 3

Synchronous PRAM Synchronous PRAM is a SIMD-style model All processors execute the same program All processors execute the same PRAM step instruction stream in lock-step Effect of operation depends on local data Instructions can be selectively disabled (if-then-else flow) Asynchronous PRAM Several competing models No lock-step HPC Fall 2007 4

Classification of PRAM Model A PRAM step ( clock cycle ) consists of three phases 1. Read: each processor may read a value from shared memory 2. Compute: each processor may perform operations on local data 3. Write: each processor may write a value to shared memory Model is refined for concurrent read/write capability Exclusive Read Exclusive Write (EREW) Concurrent Read Exclusive Write (CREW) Concurrent Read Concurrent Write (CRCW) CRCW PRAM Common CRCW: all processors must write the same value Arbitrary CRCW: one of the processors succeeds in writing Priority CRCW: processor with highest priority succeeds in writing HPC Fall 2007 5

Comparison of PRAM Models A model A is less powerful compared to model B if either The time complexity is asymptotically less in model B for solving a problem compared to A Or the time complexity is the same and the work complexity is asymptotically less in model B compared to A From weakest to strongest: EREW CREW Common CRCW Arbitrary CRCW Priority CRCW HPC Fall 2007 6

Simulations Between PRAM Models An algorithm designed for a weaker model can be executed within the same time complexity and work complexity on a stronger model An algorithm designed for a stronger model can be simulated on a weaker model, either with Asymptotically more processors (more work) Or asymptotically more time HPC Fall 2007 7

Simulating a Priority CRCW on an EREW PRAM Theorem: An algorithm that runs in T time on the p-processor priority CRCW PRAM can be simulated by EREW PRAM to run in O(T log p) time A concurrent read or write of an p-processor CRCW PRAM can be implemented on a p-processor EREW PRAM to execute in O(log p) time Q 1,,Q p CRCW processors, such that Q i has to read (write) M[j i ] P 1,,P p EREW processors M 1,,M p denote shared memory locations for special use P i stores <j i,i> in M i Sort pairs in lexicographically non-decreasing order in O(log p) time using EREW merge sort algorithm Pick representative from each block of pairs that have same first component in O(1) time Representative P i reads (writes) from M[k] with <k,_> in M i and copies data to each M in the block in O(log p) time using EREW segmented parallel prefix algorithm P i reads data from M i HPC Fall 2007 8

Reduction on the EREW PRAM Reduce p values on the p-processor EREW PRAM in O(log p) time Reduction algorithm uses exclusive reads and writes Algorithm is the basis of other EREW algorithms HPC Fall 2007 9

Sum on the EREW PRAM Sum of n values using n processors (i) Input: A[1,,n], n = 2 k Output: S begin B[i] := A[i] for h = 1 to log n do if i < n/2 h then B[i] := B[2i-1] + B[2i] if i = 1 then S := B[i] end HPC Fall 2007 10

Matrix Multiplication Consider n n matrix multiplication with n 3 processors Each c ij = k=1..n a ik b kj can be computed on the CREW PRAM in parallel using n processors in O(log n) time On the EREW PRAM exclusive reads of a ij and b ij values can be satisfied by making n copies of a and b, which takes O(log n) time with n processors (broadcast tree) Total time is still O(log n) Memory requirement is huge HPC Fall 2007 11

Matrix Multiplication on the CREW PRAM Matrix multiply with n 3 processors (i,j,l) Input: n n matrices A and B, n = 2 k Output: C = AB begin C [i,j,l] := A[i,l]B[l,j] for h = 1 to log n do if i < n/2 h then C [i,j,l] := C [i,j,2l-1] + C [i,j,2l] if l = 1 then C[i,j] := C [i,j,1] end HPC Fall 2007 12

The WT Scheduling Principle The work-time (WT) scheduling principle schedules p processors to execute an algorithm Algorithm has T(n) time steps A time step can be parallel, i.e. pardo Let W i (n) be the number of operations (work) performed in time unit i, 1 < i < T(n) Simulate each set of W i (n) operations in W i (n)/p parallel steps, for each 1 < i < T(n) The p-processor PRAM takes i W i (n)/p < i ( W i (n)/p +1) < W(n)/p + T(n) steps, where W(n) is the total number of operations HPC Fall 2007 13

Work-Time Presentation The WT presentation can be used to determine computation and communication requirements of an algorithm The upper-level WT presentation framework describes the algorithm in terms of a sequence of time units The lower-level follows the WT scheduling principle HPC Fall 2007 14

Matrix Multiplication on the CREW PRAM WT-Presentation Input: n n matrices A and B, n = 2 k Output: C = AB begin for 1 < i, j, l < n pardo C [i,j,l] := A[i,l]B[l,j] for h = 1 to log n do for 1 < i, j < n, 1 < l < n/2 h pardo C [i,j,l] := C [i,j,2l-1] + C [i,j,2l] for 1 < i, j < n pardo C[i,j] := C [i,j,1] end WT scheduling principle: O(n 3 /p + log n) time HPC Fall 2007 15

PRAM Recursive Prefix Sum Algorithm Input: Array of (x 1, x 2,, x n ) elements, n = 2 k Output: Prefix sums s i, 1 < i < n begin if n = 1 then s 1 = x 1 ; exit for 1 < i < n/2 pardo y i := x 2i-1 + x 2i Recursively compute prefix sums of y and store in z for 1 < i < n pardo if i is even then s i := z i/2 else if i = 1 then s 1 := x 1 else s i := z (i-1)/2 + x i end HPC Fall 2007 16

Proof of Work Optimality Theorem: The PRAM prefix sum algorithm correctly computes the prefix sum and takes T(n) = O(log n) time using a total of W(n) = O(n) operations Proof by induction on k, where input size n = 2 k Base case k = 0: s 1 = x 1 Assume correct for n = 2 k For n = 2 k+1 For all 1 < j < n/2 we have z j = y 1 + y 2 + + y j = (x 1 + x 2 ) + (x 3 + x 4 ) + (x 2j-1 + x 2j ) Hence, for i = 2j < n we have s i = s 2j = z j = z i/2 And i = 2j+1 < n we have s i = s 2j+1 = s 2j + x 2j+1 = z j + x 2j+1 = z (i-1)/2 + x i T(n) = T(n/2) + a T(n) = O(log n) W(n) = W(n/2) + bn W(n) = O(n) HPC Fall 2007 17

PRAM Nonrecursive Prefix Sum Input: Array A of size n = 2 k Output: Prefix sums in C[0,j], 1 < j < n begin for 1 < j < n pardo B[0,j] := A[j] for h = 1 to log n do for 1 < j < n/2 h pardo B[h,j] := B[h-1,2j-1] + B[h-1,2j] for h = log n to 0 do for 1 < j < n/2 h pardo if j is even then C[h,j] := C[h+1,j/2] else if i = 1 then C[h,1] := B[h,1] else C[h,j] := C[h+1,(j-1)/2] + B[h,j] end HPC Fall 2007 18

First Pass: Bottom-Up B[3,j] = 27 B[2,j] = 4 23 B[1,j] = 8-4 17 6 B[0,j] = 5 3-6 2 7 10-2 8 A[j] = 5 3-6 2 7 10-2 8 HPC Fall 2007 19

Second Pass: Top-Down B[3,j] = C[3,j] = 27 27 B[2,j] = C[2,j] = 4 4 23 27 B[1,j] = C[1,j] = 8 8-4 4 17 21 6 27 B[0,j] = C[0,j] = 5 5 3 8-6 2 2 4 7 11 10 21-2 19 8 27 A[j] = 5 3-6 2 7 10-2 8 HPC Fall 2007 20

Pointer Jumping Finding the roots of a forest using pointer-jumping HPC Fall 2007 21

Pointer Jumping on the CREW PRAM Input: A forest of trees, each with a self-loop at its root, consisting of arcs (i,p(i)) and nodes i, where 1 < i < n Output: For each node i, the root S[i] begin for 1 < i < n pardo S[i] := P[i] while S[i] S[S[i]] do S[i] := S[S[i]] end T(n) = O(log h) with h the maximum height of trees W(n) = O(n log h) HPC Fall 2007 22

PRAM Model Summary PRAM removes algorithmic details concerning synchronization and communication, allowing the algorithm designer to focus on problem properties A PRAM algorithm includes an explicit understanding of the operations performed at each time unit and an explicit allocation of processors to jobs at each time unit PRAM design paradigms have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models A SIMD network model considers communication diameter, bisection width, and scalability properties of the network topology of a parallel machine such as a mesh or hypercube HPC Fall 2007 23

Further Reading An Introduction to Parallel Algorithms, by J. JaJa, 1992 HPC Fall 2007 24