Performance metrics for parallelism

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Performance metrics for parallelism"

Transcription

1 Performance metrics for parallelism 8th of November, 2013

2 Sources Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley.

3 Definition (Parallel overhead) let T seq be the time taken by a sequential algorithm; let T p be the time taken by a parallelisation of that algorithm, using p processes. Then, the parallel overhead T o is given by T o = pt p T s. (Effort is proportional to the number of workers multiplied with the duration of their work, that is, equal to pt p.) Best case: T o = 0, such that T p = T seq /p.

4 Definition (Speedup) Let T seq, p, and T p be as before. Then, the speedup S is given by S(p) = T seq /T p.

5 Definition (Speedup) Let T seq, p, and T p be as before. Then, the speedup S is given by S(p) = T seq /T p. Target: S = p (no overhead; T o = 0). Best case: S > p (superlinear speedup). Worst case: S < 1 (slowdown).

6 What is T seq? Many sequential algorithms solving the same problem.

7 What is T seq? Many sequential algorithms solving the same problem. When determining the speedup S, compare against the best sequential algorithm (that is available on your architecture).

8 What is T seq? Many sequential algorithms solving the same problem. When determining the speedup S, compare against the best sequential algorithm (that is available on your architecture). When determining the overhead T o, compare against the most similar algorithm (maybe even take T seq = T 1 ).

9 Definition (strong scaling) S(p) = T seq /T p = Ω(p) (i.e., lim sup p S(p)/p > 0) 6 5 MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling 512k-length vector log 2 speedup log 2 p Question: is it reasonable to expect strong scalability for (good) parallel algorithms?

10 Definition (strong scaling) S(p) = T seq /T p = Ω(p) (i.e., lim sup p S(p)/p > 0) 6 5 MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling 512k-length vector 32M-length vector log 2 speedup log 2 p Answer: not as p. You cannot efficiently clean a table with 50 people, or paint a wall with 500 painters.

11 Definition (weak scaling) S(n) = T seq (n)/t p (n) = Ω(1), with p fixed BSPlib: FFT on two different parallel machines Perfect speedup Lynx (distributed-memory) HP DL980 (shared-memory) speedup log 2 n For large enough problems, we do expect to make maximum use of our parallel computer.

12 : example MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes log 2 n This processor advertises 64 processors,

13 : example MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. We oversubscribed!

14 MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes 32 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. Be careful with oversubscription! (Including hyperthreading!)

15 MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes 32 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. Q: would you say this algorithm scales on the Sun Ultrasparc T2?

16 A: if the speedup stabilises around 16x, then yes, since the relative efficiency is stable. Definition (Parallel efficiency) Let T seq, p, T p, and S as before. The parallel efficiency E equals E = T seq p /T p = T seq /pt p = S/p.

17 If there is no overhead (T o = pt p T seq = 0), the efficiency E = 1; decreasing the overhead increases the efficiency. Weak scalability: what happens if the problem size increases?

18 If there is no overhead (T o = pt p T seq = 0), the efficiency E = 1; decreasing the overhead increases the efficiency. Weak scalability: what happens if the problem size increases? But what is a sensible definition of the problem size? Consider the following applications: inner-product calculation; binary search; sorting (quicksort).

19 Problem Size Run-time Inner-product Θ(n) bytes Θ(n) flops Binary search Θ(n) bytes Θ(log 2 n) comparisons Sorting Θ(n) bytes Θ(n log 2 n) swaps FFT Θ(n) bytes Θ(n log 2 n) flops Hence the problem size is best identified by T seq. Question: How should the ratio T o /T seq behave as T seq, for the algorithm to scale in a weak sense?

20 If T o /T seq = c, with c R 0 constant, then pt p T seq T seq = ps 1 1 = c, so S = Note that here, E = S/p = 1 c+1 p, which is constant when p is fixed. c + 1 Question: How should the ratio T o /T seq behave as p, for the algorithm to scale in a strong sense?

21 If T o /T seq = c, with c R 0 constant, then pt p T seq T seq = ps 1 1 = c, so S = Note that here, E = S/p = 1 c+1 p c + 1. which is still constant! Answer: Exactly the same! Both strong and weak scalability are iso-efficiency constraints (E remains constant).

22 Definition (iso-efficiency) Let E be as before. Suppose T o = f (T seq, p) is a known function. Then the iso-efficiency relation is given by T seq = 1 1 E 1f (T seq, p). This follows from the definition of E: E 1 = pt p /T seq + 1 T seq T seq = 1 + pt p T seq T seq = 1 + T o T seq, so T seq (E 1 1) = T o.

23 Questions Does this scale? Yes! (It s again an FFT) Time Problem size

24 Questions Does this scale? Yes! (It s again an FFT) 10 Measured running time Theoretical optimum (asymptotic) Execution time (logarithmic) log 2 n

25 Questions Better use speedups when investigating scalability. 70 HP DL speedup log 2 n

26 In summary... A BSP algorithm is scalable when T = O(T seq /p + p). This considers scalability of the speedup and includes parallel overhead.

27 In summary... A BSP algorithm is scalable when T = O(T seq /p + p). This considers scalability of the speedup and includes parallel overhead. It does not include memory scalability: M = O(M seq /p + p), where M is the memory taken by one BSP process and M seq the memory requirement of the best sequential algorithm. Question: does this definition of a scalable algorithm make sense?

28 In summary... It does indeed make sense: If T o = Θ(p), then E 1 = pt p T s E = = 1 + pt p T s T s = 1 + T o T s p/t s = T s T s + p. Note lim p E = 0 (strong scalability, Amdahl s Law) and lim Ts = 1 (weak scalability). For iso-efficiency, the ratio T o and p has to change appropriately.

29 Practical session Subjects: Coming Thursday, see ~albert-jan.yzelman/education/parco13/ 1 Amdahl s Law 2 Parallel Sorting 3 Parallel Grid computations

30 Insights from the practical session This intuition leads to a definition of how parallel certain algorithms are: Definition (Parallelism) Consider a parallel algorithm that runs in T p time. Let T seq the time taken by the best sequential algorithm that solves the same problem. Then the parallelism is given by T seq T T seq = lim. p T p This kind of analysis is fundamental for fine-grained parallelisation schemes. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou Cilk: an efficient multithreaded runtime system. SIGPLAN Not. 30, 8 (August 1995), pp

Performance metrics for parallel systems

Performance metrics for parallel systems Performance metrics for parallel systems S.S. Kadam C-DAC, Pune sskadam@cdac.in C-DAC/SECG/2006 1 Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Formal Metrics for Large-Scale Parallel Performance

Formal Metrics for Large-Scale Parallel Performance Formal Metrics for Large-Scale Parallel Performance Kenneth Moreland and Ron Oldfield Sandia National Laboratories, Albuquerque, NM 87185, USA {kmorel,raoldfi}@sandia.gov Abstract. Performance measurement

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Institute for Multimedia and Software Engineering Conduction of Exercises: Institute for Multimedia eda and Software Engineering g BB 315c, Tel: 379-1174 E-mail: marius.rosu@uni-due.de

More information

Distributed Load Balancing for Machines Fully Heterogeneous

Distributed Load Balancing for Machines Fully Heterogeneous Internship Report 2 nd of June - 22 th of August 2014 Distributed Load Balancing for Machines Fully Heterogeneous Nathanaël Cheriere nathanael.cheriere@ens-rennes.fr ENS Rennes Academic Year 2013-2014

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Performance Metrics for Parallel Programs. 8 March 2010

Performance Metrics for Parallel Programs. 8 March 2010 Performance Metrics for Parallel Programs 8 March 2010 Content measuring time towards analytic modeling, execution time, overhead, speedup, efficiency, cost, granularity, scalability, roadblocks, asymptotic

More information

CS 575 Parallel Processing

CS 575 Parallel Processing CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5

More information

{emery,browne}@cs.utexas.edu ABSTRACT. Keywords scalable, load distribution, load balancing, work stealing

{emery,browne}@cs.utexas.edu ABSTRACT. Keywords scalable, load distribution, load balancing, work stealing Scalable Load Distribution and Load Balancing for Dynamic Parallel Programs E. Berger and J. C. Browne Department of Computer Science University of Texas at Austin Austin, Texas 78701 USA 01-512-471-{9734,9579}

More information

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable

More information

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

More information

Performance Evaluation for Parallel Systems: A Survey

Performance Evaluation for Parallel Systems: A Survey Performance Evaluation for Parallel Systems: A Survey Lei Hu and Ian Gorton Department of Computer Systems School of Computer Science and Engineering University of NSW, Sydney 2052, Australia E-mail: {lei,

More information

Using Mining@Home for Distributed Ensemble Learning

Using Mining@Home for Distributed Ensemble Learning Using Mining@Home for Distributed Ensemble Learning Eugenio Cesario 1, Carlo Mastroianni 1, and Domenico Talia 1,2 1 ICAR-CNR, Italy {cesario,mastroianni}@icar.cnr.it 2 University of Calabria, Italy talia@deis.unical.it

More information

DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS

DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS Simon Ostermann, Radu Prodan and Thomas Fahringer Institute of Computer Science, University of Innsbruck Technikerstrasse 21a, Innsbruck, Austria

More information

Various Schemes of Load Balancing in Distributed Systems- A Review

Various Schemes of Load Balancing in Distributed Systems- A Review 741 Various Schemes of Load Balancing in Distributed Systems- A Review Monika Kushwaha Pranveer Singh Institute of Technology Kanpur, U.P. (208020) U.P.T.U., Lucknow Saurabh Gupta Pranveer Singh Institute

More information

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:

More information

Dynamic Load Balancing using Graphics Processors

Dynamic Load Balancing using Graphics Processors I.J. Intelligent Systems and Applications, 2014, 05, 70-75 Published Online April 2014 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijisa.2014.05.07 Dynamic Load Balancing using Graphics Processors

More information

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case. Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity

More information

Analysis of Binary Search algorithm and Selection Sort algorithm

Analysis of Binary Search algorithm and Selection Sort algorithm Analysis of Binary Search algorithm and Selection Sort algorithm In this section we shall take up two representative problems in computer science, work out the algorithms based on the best strategy to

More information

Scalable Computing in the Multicore Era

Scalable Computing in the Multicore Era Scalable Computing in the Multicore Era Xian-He Sun, Yong Chen and Surendra Byna Illinois Institute of Technology, Chicago IL 60616, USA Abstract. Multicore architecture has become the trend of high performance

More information

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Dynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas

Dynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas CHAPTER Dynamic Load Balancing 35 Using Work-Stealing Daniel Cederman and Philippas Tsigas In this chapter, we present a methodology for efficient load balancing of computational problems that can be easily

More information

Pipelining and load-balancing in parallel joins on distributed machines

Pipelining and load-balancing in parallel joins on distributed machines NP-PAR 05 p. / Pipelining and load-balancing in parallel joins on distributed machines M. Bamha bamha@lifo.univ-orleans.fr Laboratoire d Informatique Fondamentale d Orléans (France) NP-PAR 05 p. / Plan

More information

Introduction to Cluster Computing

Introduction to Cluster Computing Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Introduction Goal/Idea Phases Mandatory Assignments Tools Timeline/Exam General info Introduction Supercomputers are expensive Workstations

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

2: Computer Performance

2: Computer Performance 2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12

More information

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached

More information

Relating Empirical Performance Data to Achievable Parallel Application Performance

Relating Empirical Performance Data to Achievable Parallel Application Performance Published in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), Vol. III, Las Vegas, Nev., USA, June 28-July 1, 1999, pp. 1627-1633.

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 18, 1037-1048 (2002) Short Paper Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors PANGFENG

More information

Garbage Collection in the Java HotSpot Virtual Machine

Garbage Collection in the Java HotSpot Virtual Machine http://www.devx.com Printed from http://www.devx.com/java/article/21977/1954 Garbage Collection in the Java HotSpot Virtual Machine Gain a better understanding of how garbage collection in the Java HotSpot

More information

Resource Allocation and the Law of Diminishing Returns

Resource Allocation and the Law of Diminishing Returns esource Allocation and the Law of Diminishing eturns Julia and Igor Korsunsky The law of diminishing returns is a well-known topic in economics. The law states that the output attributed to an individual

More information

Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

More information

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015 CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Linda Shapiro Today Registration should be done. Homework 1 due 11:59 pm next Wednesday, January 14 Review math essential

More information

Keys to node-level performance analysis and threading in HPC applications

Keys to node-level performance analysis and threading in HPC applications Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

Resource Models: Batch Scheduling

Resource Models: Batch Scheduling Resource Models: Batch Scheduling Last Time» Cycle Stealing Resource Model Large Reach, Mass Heterogeneity, complex resource behavior Asynchronous Revocation, independent, idempotent tasks» Resource Sharing

More information

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline

More information

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2) Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon NAMD2- Greater Scalability for Parallel Molecular Dynamics Laxmikant Kale, Robert Steel, Milind Bhandarkar,, Robert Bunner, Attila Gursoy,, Neal Krawetz,, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan,,

More information

32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M

32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M Evaluating a Real Machine Performance Isolation using Microbenchmarks Choosing Workloads Evaluating a Fixed-size Machine Varying Machine Size Metrics All these issues, plus more, relevant to evaluating

More information

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3 A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

A new binary floating-point division algorithm and its software implementation on the ST231 processor

A new binary floating-point division algorithm and its software implementation on the ST231 processor 19th IEEE Symposium on Computer Arithmetic (ARITH 19) Portland, Oregon, USA, June 8-10, 2009 A new binary floating-point division algorithm and its software implementation on the ST231 processor Claude-Pierre

More information

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun CS550 Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun Email: sun@iit.edu, Phone: (312) 567-5260 Office hours: 2:10pm-3:10pm Tuesday, 3:30pm-4:30pm Thursday at SB229C,

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a Parallel Computer Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0 Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without

More information

A graph-oriented task manager for small multiprocessor systems.

A graph-oriented task manager for small multiprocessor systems. A graph-oriented task manager for small multiprocessor systems. Xavier Verians 1, Jean-Didier Legat 1, Jean-Jacques Quisquater 1 and Benoit Macq 2 1 Microelectronics Laboratory, Université Catholique de

More information

Notation: - active node - inactive node

Notation: - active node - inactive node Dynamic Load Balancing Algorithms for Sequence Mining Valerie Guralnik, George Karypis fguralnik, karypisg@cs.umn.edu Department of Computer Science and Engineering/Army HPC Research Center University

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

CS/COE 1501 http://cs.pitt.edu/~bill/1501/

CS/COE 1501 http://cs.pitt.edu/~bill/1501/ CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithm Analysis Page 1 BFH-TI: Softwareschule Schweiz Algorithm Analysis Dr. CAS SD01 Algorithm Analysis Page 2 Outline Course and Textbook Overview Analysis of Algorithm Pseudo-Code and Primitive Operations

More information

Paul Brebner, Senior Researcher, NICTA, Paul.Brebner@nicta.com.au

Paul Brebner, Senior Researcher, NICTA, Paul.Brebner@nicta.com.au Is your Cloud Elastic Enough? Part 2 Paul Brebner, Senior Researcher, NICTA, Paul.Brebner@nicta.com.au Paul Brebner is a senior researcher in the e-government project at National ICT Australia (NICTA,

More information

Mesh Generation and Load Balancing

Mesh Generation and Load Balancing Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable

More information

Informatica Ultra Messaging SMX Shared-Memory Transport

Informatica Ultra Messaging SMX Shared-Memory Transport White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade

More information

An HPC Application Deployment Model on Azure Cloud for SMEs

An HPC Application Deployment Model on Azure Cloud for SMEs An HPC Application Deployment Model on Azure Cloud for SMEs Fan Ding CLOSER 2013, Aachen, Germany, May 9th,2013 Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Windows Azure Relevant Technology

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

SMock A Test Platform for the Evaluation of Monitoring Tools

SMock A Test Platform for the Evaluation of Monitoring Tools SMock A Test Platform for the Evaluation of Monitoring Tools User Manual Ruth Mizzi Faculty of ICT University of Malta June 20, 2013 Contents 1 Introduction 3 1.1 The Architecture and Design of SMock................

More information

Program Optimization for Multi-core Architectures

Program Optimization for Multi-core Architectures Program Optimization for Multi-core Architectures Sanjeev K Aggarwal (ska@iitk.ac.in) M Chaudhuri (mainak@iitk.ac.in) R Moona (moona@iitk.ac.in) Department of Computer Science and Engineering, IIT Kanpur

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al. Geosci. Model Dev. Discuss., 8, C1166 C1176, 2015 www.geosci-model-dev-discuss.net/8/c1166/2015/ Author(s) 2015. This work is distributed under the Creative Commons Attribute 3.0 License. Geoscientific

More information

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms

More information

System Models for Distributed and Cloud Computing

System Models for Distributed and Cloud Computing System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

Successful Implementation of L3B: Low Level Load Balancer

Successful Implementation of L3B: Low Level Load Balancer Successful Implementation of L3B: Low Level Load Balancer Kiril Cvetkov, Sasko Ristov, Marjan Gusev University Ss Cyril and Methodius, FCSE, Rugjer Boshkovic 16, Skopje, Macedonia Email: kiril cvetkov@yahoo.com,

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

Hardware Assisted Virtualization

Hardware Assisted Virtualization Hardware Assisted Virtualization G. Lettieri 21 Oct. 2015 1 Introduction In the hardware-assisted virtualization technique we try to execute the instructions of the target machine directly on the host

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

Large Scale Simulation on Clusters using COMSOL 4.2

Large Scale Simulation on Clusters using COMSOL 4.2 Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,

More information

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE 1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

Load balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008

Load balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008 Load balancing Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008 1 Today s sources CS 194/267 at UCB (Yelick/Demmel) Intro

More information

The Piranha computer algebra system. introduction and implementation details

The Piranha computer algebra system. introduction and implementation details : introduction and implementation details Advanced Concepts Team European Space Agency (ESTEC) Course on Differential Equations and Computer Algebra Estella, Spain October 29-30, 2010 Outline A Brief Overview

More information

3 Extending the Refinement Calculus

3 Extending the Refinement Calculus Building BSP Programs Using the Refinement Calculus D.B. Skillicorn? Department of Computing and Information Science Queen s University, Kingston, Canada skill@qucis.queensu.ca Abstract. We extend the

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Instruction scheduling

Instruction scheduling Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.

More information

Performance Evaluation and Analysis of Parallel Computers Workload

Performance Evaluation and Analysis of Parallel Computers Workload , pp.127-134 http://dx.doi.org/10.14257/ijgdc.2016.9.1.13 Performance Evaluation and Analysis of Parallel Computers Workload M.Narayana Moorthi 1 and R.Manjula 2 1 Assistant Professor, (SG), SCOPE-VIT

More information

Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors

Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Sudarsanam P Abstract G. Singaravel Parallel computing is an base mechanism for data process with scheduling task,

More information

A Comparison of Distributed Systems: ChorusOS and Amoeba

A Comparison of Distributed Systems: ChorusOS and Amoeba A Comparison of Distributed Systems: ChorusOS and Amoeba Angelo Bertolli Prepared for MSIT 610 on October 27, 2004 University of Maryland University College Adelphi, Maryland United States of America Abstract.

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information