Performance metrics for parallelism

Similar documents

Performance metrics for parallel systems

Parallel Scalable Algorithms- Performance Parameters

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Distributed Load Balancing for Machines Fully Heterogeneous

CS 575 Parallel Processing

Parallel Algorithm for Dense Matrix Multiplication

A Review of Customized Dynamic Load Balancing for a Network of Workstations

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Performance Evaluation for Parallel Systems: A Survey

Various Schemes of Load Balancing in Distributed Systems- A Review

Using for Distributed Ensemble Learning

DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS

Binary search tree with SIMD bandwidth optimization using SSE

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

2: Computer Performance

Analysis of Binary Search algorithm and Selection Sort algorithm

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Dynamic Load Balancing using Graphics Processors

Chapter 2 Parallel Architecture, Software And Performance

Introduction to Cluster Computing

Cellular Computing on a Linux Cluster

Optimizing Shared Resource Contention in HPC Clusters

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

Contributions to Gang Scheduling

Garbage Collection in the Java HotSpot Virtual Machine

Resource Allocation and the Law of Diminishing Returns

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Multicore Parallel Computing with OpenMP

MAQAO Performance Analysis and Optimization Tool

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Energy Efficient MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

A new binary floating-point division algorithm and its software implementation on the ST231 processor

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Offline sorting buffers on Line

Paul Brebner, Senior Researcher, NICTA,

Keys to node-level performance analysis and threading in HPC applications

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Resource Models: Batch Scheduling

SMock A Test Platform for the Evaluation of Monitoring Tools

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon

Chapter 18: Database System Architectures. Centralized Systems

Multi-core Programming System Overview

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hardware Assisted Virtualization

32 K. Stride K 8 K 32 K 128 K 512 K 2 M

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Computer Architecture TDTS10

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

Muse Server Sizing. 18 June Document Version Muse

3 Extending the Refinement Calculus

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Mesh Generation and Load Balancing

Informatica Ultra Messaging SMX Shared-Memory Transport

Discuss the size of the instance for the minimum spanning tree problem.

An HPC Application Deployment Model on Azure Cloud for SMEs

Towards running complex models on big data

Program Optimization for Multi-core Architectures

High Performance Computing. Course Notes HPC Fundamentals

Parallelization: Binary Tree Traversal

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Operating System Components

System Models for Distributed and Cloud Computing

Successful Implementation of L3B: Low Level Load Balancer

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Large Scale Simulation on Clusters using COMSOL 4.2

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Overlapping Data Transfer With Application Execution on Clusters

An Introduction to Parallel Computing/ Programming

FPGA-based Multithreading for In-Memory Hash Joins

Rackspace Cloud Databases and Container-based Virtualization

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Transcription:

Performance metrics for parallelism 8th of November, 2013

Sources Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley.

Definition (Parallel overhead) let T seq be the time taken by a sequential algorithm; let T p be the time taken by a parallelisation of that algorithm, using p processes. Then, the parallel overhead T o is given by T o = pt p T s. (Effort is proportional to the number of workers multiplied with the duration of their work, that is, equal to pt p.) Best case: T o = 0, such that T p = T seq /p.

Definition (Speedup) Let T seq, p, and T p be as before. Then, the speedup S is given by S(p) = T seq /T p.

Definition (Speedup) Let T seq, p, and T p be as before. Then, the speedup S is given by S(p) = T seq /T p. Target: S = p (no overhead; T o = 0). Best case: S > p (superlinear speedup). Worst case: S < 1 (slowdown).

What is T seq? Many sequential algorithms solving the same problem.

What is T seq? Many sequential algorithms solving the same problem. When determining the speedup S, compare against the best sequential algorithm (that is available on your architecture).

What is T seq? Many sequential algorithms solving the same problem. When determining the speedup S, compare against the best sequential algorithm (that is available on your architecture). When determining the overhead T o, compare against the most similar algorithm (maybe even take T seq = T 1 ).

Definition (strong scaling) S(p) = T seq /T p = Ω(p) (i.e., lim sup p S(p)/p > 0) 6 5 MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling 512k-length vector log 2 speedup 4 3 2 1 0 1 2 3 4 5 6 7 log 2 p Question: is it reasonable to expect strong scalability for (good) parallel algorithms?

Definition (strong scaling) S(p) = T seq /T p = Ω(p) (i.e., lim sup p S(p)/p > 0) 6 5 MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling 512k-length vector 32M-length vector log 2 speedup 4 3 2 1 0 1 2 3 4 5 6 7 log 2 p Answer: not as p. You cannot efficiently clean a table with 50 people, or paint a wall with 500 painters.

Definition (weak scaling) S(n) = T seq (n)/t p (n) = Ω(1), with p fixed. 100 80 BSPlib: FFT on two different parallel machines Perfect speedup Lynx (distributed-memory) HP DL980 (shared-memory) speedup 60 40 20 0 10 12 14 16 18 20 22 24 26 log 2 n For large enough problems, we do expect to make maximum use of our parallel computer.

: example MulticoreBSP for Java: FFT on a Sun UltraSparc T2 35 30 25 speedup 20 15 10 Perfect scaling 64 processes 5 0 14 16 18 20 22 24 26 log 2 n This processor advertises 64 processors,

: example MulticoreBSP for Java: FFT on a Sun UltraSparc T2 35 30 25 speedup 20 15 10 Perfect scaling 64 processes 5 0 14 16 18 20 22 24 26 log 2 n This processor advertises 64 processors, but only has 32 FPUs. We oversubscribed!

MulticoreBSP for Java: FFT on a Sun UltraSparc T2 35 30 25 speedup 20 15 10 Perfect scaling 64 processes 32 processes 5 0 14 16 18 20 22 24 26 log 2 n This processor advertises 64 processors, but only has 32 FPUs. Be careful with oversubscription! (Including hyperthreading!)

MulticoreBSP for Java: FFT on a Sun UltraSparc T2 35 30 25 speedup 20 15 10 Perfect scaling 64 processes 32 processes 5 0 14 16 18 20 22 24 26 log 2 n This processor advertises 64 processors, but only has 32 FPUs. Q: would you say this algorithm scales on the Sun Ultrasparc T2?

A: if the speedup stabilises around 16x, then yes, since the relative efficiency is stable. Definition (Parallel efficiency) Let T seq, p, T p, and S as before. The parallel efficiency E equals E = T seq p /T p = T seq /pt p = S/p.

If there is no overhead (T o = pt p T seq = 0), the efficiency E = 1; decreasing the overhead increases the efficiency. Weak scalability: what happens if the problem size increases?

If there is no overhead (T o = pt p T seq = 0), the efficiency E = 1; decreasing the overhead increases the efficiency. Weak scalability: what happens if the problem size increases? But what is a sensible definition of the problem size? Consider the following applications: inner-product calculation; binary search; sorting (quicksort).

Problem Size Run-time Inner-product Θ(n) bytes Θ(n) flops Binary search Θ(n) bytes Θ(log 2 n) comparisons Sorting Θ(n) bytes Θ(n log 2 n) swaps FFT Θ(n) bytes Θ(n log 2 n) flops Hence the problem size is best identified by T seq. Question: How should the ratio T o /T seq behave as T seq, for the algorithm to scale in a weak sense?

If T o /T seq = c, with c R 0 constant, then pt p T seq T seq = ps 1 1 = c, so S = Note that here, E = S/p = 1 c+1 p, which is constant when p is fixed. c + 1 Question: How should the ratio T o /T seq behave as p, for the algorithm to scale in a strong sense?

If T o /T seq = c, with c R 0 constant, then pt p T seq T seq = ps 1 1 = c, so S = Note that here, E = S/p = 1 c+1 p c + 1. which is still constant! Answer: Exactly the same! Both strong and weak scalability are iso-efficiency constraints (E remains constant).

Definition (iso-efficiency) Let E be as before. Suppose T o = f (T seq, p) is a known function. Then the iso-efficiency relation is given by T seq = 1 1 E 1f (T seq, p). This follows from the definition of E: E 1 = pt p /T seq + 1 T seq T seq = 1 + pt p T seq T seq = 1 + T o T seq, so T seq (E 1 1) = T o.

Questions Does this scale? Yes! (It s again an FFT) 10 8 6 Time 4 2 0-2 5 10 15 20 25 30 Problem size

Questions Does this scale? Yes! (It s again an FFT) 10 Measured running time Theoretical optimum (asymptotic) Execution time (logarithmic) 8 6 4 2 0-2 5 10 15 20 25 30 log 2 n

Questions Better use speedups when investigating scalability. 70 HP DL980 60 50 speedup 40 30 20 10 0 10 12 14 16 18 20 22 24 26 log 2 n

In summary... A BSP algorithm is scalable when T = O(T seq /p + p). This considers scalability of the speedup and includes parallel overhead.

In summary... A BSP algorithm is scalable when T = O(T seq /p + p). This considers scalability of the speedup and includes parallel overhead. It does not include memory scalability: M = O(M seq /p + p), where M is the memory taken by one BSP process and M seq the memory requirement of the best sequential algorithm. Question: does this definition of a scalable algorithm make sense?

In summary... It does indeed make sense: If T o = Θ(p), then E 1 = pt p T s E = = 1 + pt p T s T s = 1 + T o T s. 1 1 + p/t s = T s T s + p. Note lim p E = 0 (strong scalability, Amdahl s Law) and lim Ts = 1 (weak scalability). For iso-efficiency, the ratio T o and p has to change appropriately.

Practical session Subjects: Coming Thursday, see http://people.cs.kuleuven.be/ ~albert-jan.yzelman/education/parco13/ 1 Amdahl s Law 2 Parallel Sorting 3 Parallel Grid computations

Insights from the practical session This intuition leads to a definition of how parallel certain algorithms are: Definition (Parallelism) Consider a parallel algorithm that runs in T p time. Let T seq the time taken by the best sequential algorithm that solves the same problem. Then the parallelism is given by T seq T T seq = lim. p T p This kind of analysis is fundamental for fine-grained parallelisation schemes. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: an efficient multithreaded runtime system. SIGPLAN Not. 30, 8 (August 1995), pp. 207-216.