Performance metrics for parallel systems



Similar documents
Parallel Scalable Algorithms- Performance Parameters

Performance metrics for parallelism

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Successful Implementation of L3B: Low Level Load Balancer

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Six Strategies for Building High Performance SOA Applications

Performance Evaluation for Parallel Systems: A Survey

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Measuring Parallel Processor Performance

Chapter 2 Parallel Architecture, Software And Performance

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Software and the Concurrency Revolution

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Resource Allocation and the Law of Diminishing Returns

EEM 486: Computer Architecture. Lecture 4. Performance

on an system with an infinite number of processors. Calculate the speedup of

QSEM SM : Quantitative Scalability Evaluation Method

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

Parallel Algorithm Engineering

Contributions to Gang Scheduling

Load Imbalance Analysis

System Models for Distributed and Cloud Computing

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Multi-GPU Load Balancing for Simulation and Rendering

Spatial Interaction Model Optimisation on. Parallel Computers

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

An HPC Application Deployment Model on Azure Cloud for SMEs

Internet Content Distribution

Cellular Computing on a Linux Cluster

Performance evaluation

Solid State Drive Architecture

Parallelism and Cloud Computing

DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Enhancing SQL Server Performance

Petascale Software Challenges. William Gropp


Week 1 out-of-class notes, discussions and sample problems

Computer Architecture TDTS10

Q & A From Hitachi Data Systems WebTech Presentation:

FPGA area allocation for parallel C applications

Control 2004, University of Bath, UK, September 2004

Various Schemes of Load Balancing in Distributed Systems- A Review

Binary search tree with SIMD bandwidth optimization using SSE

Application Performance Analysis of the Cortex-A9 MPCore

Multi-Core Programming

Parallel Computing for Digital Signal Processing on Mobile Device GPUs

Chapter 1 Computer System Overview

In-memory Tables Technology overview and solutions

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Multicore Parallel Computing with OpenMP

The Methodology of Application Development for Hybrid Architectures

Leveraging EMC Fully Automated Storage Tiering (FAST) and FAST Cache for SQL Server Enterprise Deployments

On Demand Loading of Code in MMUless Embedded System

High Performance Computing. Course Notes HPC Fundamentals

Universal Scalability Law

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

The Comeback of Batch Tuning

Thread level parallelism

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

How To Improve Performance On A Single Chip Computer

An Introduction to Parallel Computing/ Programming

Module 11. Software Project Planning. Version 2 CSE IIT, Kharagpur

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

SOME STRAIGHT TALK ABOUT THE COSTS OF DATA WAREHOUSING

Pipelining Review and Its Limitations

SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform

EE361: Digital Computer Organization Course Syllabus

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon

CHAPTER 1 INTRODUCTION

Understanding the Benefits of IBM SPSS Statistics Server

Load Balancing using Potential Functions for Hierarchical Topologies

Transcription:

Performance metrics for parallel systems S.S. Kadam C-DAC, Pune sskadam@cdac.in C-DAC/SECG/2006 1

Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism 2

Execution time Serial runtime of a program Time elapsed between the beginning and end of execution on a sequential computer Usually denoted by T S Parallel runtime Time elapsed from start of the parallel computation to end of execution by the last processing element (PE) Usually denoted by T P 3

Total parallel overhead Total time spent by all the PEs over and above that required by the fastest known sequential algorithm on a single PE for solving the same problem Represented by an overhead function (T O ) Total time spent in solving a problem using p PEs is pt P Time spent for performing useful work is T S The remainder is overhead T O given by T O = pt P -T S 4

Speedup Measures performance gain achieved by parallelizing a given application over sequential implementation Captures relative benefit of solving a problem in parallel Defined as ratio of time taken to solve a problem on a single PE to time required to solve the same problem on a parallel computer with p PEs, that is S = T S /T P 5

Speedup: Some observations Serial runtime of the best sequential algorithm should be considered while computing the speedup ratio The p PEs used are assumed to be identical to the one used by the sequential algorithm Speedup, in theory, cannot exceed number of PEs, p Speedup greater than p possible only if each PE spends less than T S / p time solving the problem Single PE could then be time-slided to achieve faster serial program Which contradicts assumption of fastest serial program as basis for speedup 6

Super-linear speedup A phenomenon where a speedup greater that p is sometimes observed in certain parallel applications Happens when Work performed by a serial algorithm is greater than its parallel formulation The hardware features put the serial implementation at a disadvantage Data for a problem too large to fit into cache of single PE In parallel implementation, individual data-partitions maybe small enough to fit individual caches of the PEs 7

Efficiency Measures fraction of time for which a PE is usefully employed Defined as ratio of the speedup to the number of PEs E = S/p = T S / (T P * p) In ideal parallel systems Speedup is equal to p, and Efficiency is equal to one In practice Speedup is less than p, and Efficiency is between zero and one 8

Scalability of parallel systems The efficiency of a parallel program can be written as or The total overhead function T o is an increasing function of p For a given problem size Value of T S remains constant As we increase number of PEs, T o increases The overall efficiency of the parallel program goes down This is the case for all parallel programs 9

Scaling characteristics of parallel program Total overhead function T o is a function of both problem size T s and the number of processing elements p In many cases, T o grows sub-linearly with respect to T s In such cases, efficiency increases if problem size is increased keeping number of PEs constant One can simultaneously increase the problem size and number of processors to keep efficiency constant Such systems are called scalable parallel systems 10

Amdahl s law If f is fraction of sequential part in the program, Amdahl s law states that the Speedup = 1/[f + (1-f) / P] S <= 1 / [f + (1-f)/p] f - fraction of sequential part, S - speedup, p - no of processors Example, if f = 0.1, P = 10 PEs 1 S = 0.1 + (0.9) / 10 1 = 5 0.1 + (0.09) As P S 10 11

Observations of the Amdahl s law Small number of sequential operations can significantly limit speedup achievable by a parallel computer This is one of the stronger arguments against parallel computers Amdahl s arguments serves as a way of determining whether an algorithm is a good candidate for parallelisation This argument doesn t take in to account the problem size In most applications as data size increases, the sequential part diminishes 12

Reference Performance metrics for parallel systems Introduction to Parallel Computing, Ananth Grama, et. al., Pearson Education, Ltd., Second Edition, 2004 13