Performance metrics for parallelism

Size: px
Start display at page:

Download "Performance metrics for parallelism"

Transcription

1 Performance metrics for parallelism 8th of November, 2013

2 Sources Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley.

3 Definition (Parallel overhead) let T seq be the time taken by a sequential algorithm; let T p be the time taken by a parallelisation of that algorithm, using p processes. Then, the parallel overhead T o is given by T o = pt p T s. (Effort is proportional to the number of workers multiplied with the duration of their work, that is, equal to pt p.) Best case: T o = 0, such that T p = T seq /p.

4 Definition (Speedup) Let T seq, p, and T p be as before. Then, the speedup S is given by S(p) = T seq /T p.

5 Definition (Speedup) Let T seq, p, and T p be as before. Then, the speedup S is given by S(p) = T seq /T p. Target: S = p (no overhead; T o = 0). Best case: S > p (superlinear speedup). Worst case: S < 1 (slowdown).

6 What is T seq? Many sequential algorithms solving the same problem.

7 What is T seq? Many sequential algorithms solving the same problem. When determining the speedup S, compare against the best sequential algorithm (that is available on your architecture).

8 What is T seq? Many sequential algorithms solving the same problem. When determining the speedup S, compare against the best sequential algorithm (that is available on your architecture). When determining the overhead T o, compare against the most similar algorithm (maybe even take T seq = T 1 ).

9 Definition (strong scaling) S(p) = T seq /T p = Ω(p) (i.e., lim sup p S(p)/p > 0) 6 5 MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling 512k-length vector log 2 speedup log 2 p Question: is it reasonable to expect strong scalability for (good) parallel algorithms?

10 Definition (strong scaling) S(p) = T seq /T p = Ω(p) (i.e., lim sup p S(p)/p > 0) 6 5 MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling 512k-length vector 32M-length vector log 2 speedup log 2 p Answer: not as p. You cannot efficiently clean a table with 50 people, or paint a wall with 500 painters.

11 Definition (weak scaling) S(n) = T seq (n)/t p (n) = Ω(1), with p fixed BSPlib: FFT on two different parallel machines Perfect speedup Lynx (distributed-memory) HP DL980 (shared-memory) speedup log 2 n For large enough problems, we do expect to make maximum use of our parallel computer.

12 : example MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes log 2 n This processor advertises 64 processors,

13 : example MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. We oversubscribed!

14 MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes 32 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. Be careful with oversubscription! (Including hyperthreading!)

15 MulticoreBSP for Java: FFT on a Sun UltraSparc T speedup Perfect scaling 64 processes 32 processes log 2 n This processor advertises 64 processors, but only has 32 FPUs. Q: would you say this algorithm scales on the Sun Ultrasparc T2?

16 A: if the speedup stabilises around 16x, then yes, since the relative efficiency is stable. Definition (Parallel efficiency) Let T seq, p, T p, and S as before. The parallel efficiency E equals E = T seq p /T p = T seq /pt p = S/p.

17 If there is no overhead (T o = pt p T seq = 0), the efficiency E = 1; decreasing the overhead increases the efficiency. Weak scalability: what happens if the problem size increases?

18 If there is no overhead (T o = pt p T seq = 0), the efficiency E = 1; decreasing the overhead increases the efficiency. Weak scalability: what happens if the problem size increases? But what is a sensible definition of the problem size? Consider the following applications: inner-product calculation; binary search; sorting (quicksort).

19 Problem Size Run-time Inner-product Θ(n) bytes Θ(n) flops Binary search Θ(n) bytes Θ(log 2 n) comparisons Sorting Θ(n) bytes Θ(n log 2 n) swaps FFT Θ(n) bytes Θ(n log 2 n) flops Hence the problem size is best identified by T seq. Question: How should the ratio T o /T seq behave as T seq, for the algorithm to scale in a weak sense?

20 If T o /T seq = c, with c R 0 constant, then pt p T seq T seq = ps 1 1 = c, so S = Note that here, E = S/p = 1 c+1 p, which is constant when p is fixed. c + 1 Question: How should the ratio T o /T seq behave as p, for the algorithm to scale in a strong sense?

21 If T o /T seq = c, with c R 0 constant, then pt p T seq T seq = ps 1 1 = c, so S = Note that here, E = S/p = 1 c+1 p c + 1. which is still constant! Answer: Exactly the same! Both strong and weak scalability are iso-efficiency constraints (E remains constant).

22 Definition (iso-efficiency) Let E be as before. Suppose T o = f (T seq, p) is a known function. Then the iso-efficiency relation is given by T seq = 1 1 E 1f (T seq, p). This follows from the definition of E: E 1 = pt p /T seq + 1 T seq T seq = 1 + pt p T seq T seq = 1 + T o T seq, so T seq (E 1 1) = T o.

23 Questions Does this scale? Yes! (It s again an FFT) Time Problem size

24 Questions Does this scale? Yes! (It s again an FFT) 10 Measured running time Theoretical optimum (asymptotic) Execution time (logarithmic) log 2 n

25 Questions Better use speedups when investigating scalability. 70 HP DL speedup log 2 n

26 In summary... A BSP algorithm is scalable when T = O(T seq /p + p). This considers scalability of the speedup and includes parallel overhead.

27 In summary... A BSP algorithm is scalable when T = O(T seq /p + p). This considers scalability of the speedup and includes parallel overhead. It does not include memory scalability: M = O(M seq /p + p), where M is the memory taken by one BSP process and M seq the memory requirement of the best sequential algorithm. Question: does this definition of a scalable algorithm make sense?

28 In summary... It does indeed make sense: If T o = Θ(p), then E 1 = pt p T s E = = 1 + pt p T s T s = 1 + T o T s p/t s = T s T s + p. Note lim p E = 0 (strong scalability, Amdahl s Law) and lim Ts = 1 (weak scalability). For iso-efficiency, the ratio T o and p has to change appropriately.

29 Practical session Subjects: Coming Thursday, see ~albert-jan.yzelman/education/parco13/ 1 Amdahl s Law 2 Parallel Sorting 3 Parallel Grid computations

30 Insights from the practical session This intuition leads to a definition of how parallel certain algorithms are: Definition (Parallelism) Consider a parallel algorithm that runs in T p time. Let T seq the time taken by the best sequential algorithm that solves the same problem. Then the parallelism is given by T seq T T seq = lim. p T p This kind of analysis is fundamental for fine-grained parallelisation schemes. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou Cilk: an efficient multithreaded runtime system. SIGPLAN Not. 30, 8 (August 1995), pp

Performance metrics for parallel systems

Performance metrics for parallel systems Performance metrics for parallel systems S.S. Kadam C-DAC, Pune [email protected] C-DAC/SECG/2006 1 Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Distributed Load Balancing for Machines Fully Heterogeneous

Distributed Load Balancing for Machines Fully Heterogeneous Internship Report 2 nd of June - 22 th of August 2014 Distributed Load Balancing for Machines Fully Heterogeneous Nathanaël Cheriere [email protected] ENS Rennes Academic Year 2013-2014

More information

CS 575 Parallel Processing

CS 575 Parallel Processing CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5

More information

Parallel Algorithm for Dense Matrix Multiplication

Parallel Algorithm for Dense Matrix Multiplication Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem

More information

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

More information

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable

More information

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

More information

Performance Evaluation for Parallel Systems: A Survey

Performance Evaluation for Parallel Systems: A Survey Performance Evaluation for Parallel Systems: A Survey Lei Hu and Ian Gorton Department of Computer Systems School of Computer Science and Engineering University of NSW, Sydney 2052, Australia E-mail: {lei,

More information

Various Schemes of Load Balancing in Distributed Systems- A Review

Various Schemes of Load Balancing in Distributed Systems- A Review 741 Various Schemes of Load Balancing in Distributed Systems- A Review Monika Kushwaha Pranveer Singh Institute of Technology Kanpur, U.P. (208020) U.P.T.U., Lucknow Saurabh Gupta Pranveer Singh Institute

More information

Using Mining@Home for Distributed Ensemble Learning

Using Mining@Home for Distributed Ensemble Learning Using Mining@Home for Distributed Ensemble Learning Eugenio Cesario 1, Carlo Mastroianni 1, and Domenico Talia 1,2 1 ICAR-CNR, Italy {cesario,mastroianni}@icar.cnr.it 2 University of Calabria, Italy [email protected]

More information

DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS

DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS Simon Ostermann, Radu Prodan and Thomas Fahringer Institute of Computer Science, University of Innsbruck Technikerstrasse 21a, Innsbruck, Austria

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

2: Computer Performance

2: Computer Performance 2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12

More information

Analysis of Binary Search algorithm and Selection Sort algorithm

Analysis of Binary Search algorithm and Selection Sort algorithm Analysis of Binary Search algorithm and Selection Sort algorithm In this section we shall take up two representative problems in computer science, work out the algorithms based on the best strategy to

More information

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case. Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity

More information

Dynamic Load Balancing using Graphics Processors

Dynamic Load Balancing using Graphics Processors I.J. Intelligent Systems and Applications, 2014, 05, 70-75 Published Online April 2014 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijisa.2014.05.07 Dynamic Load Balancing using Graphics Processors

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

Introduction to Cluster Computing

Introduction to Cluster Computing Introduction to Cluster Computing Brian Vinter [email protected] Overview Introduction Goal/Idea Phases Mandatory Assignments Tools Timeline/Exam General info Introduction Supercomputers are expensive Workstations

More information

Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

More information

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 18, 1037-1048 (2002) Short Paper Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors PANGFENG

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

Garbage Collection in the Java HotSpot Virtual Machine

Garbage Collection in the Java HotSpot Virtual Machine http://www.devx.com Printed from http://www.devx.com/java/article/21977/1954 Garbage Collection in the Java HotSpot Virtual Machine Gain a better understanding of how garbage collection in the Java HotSpot

More information

Resource Allocation and the Law of Diminishing Returns

Resource Allocation and the Law of Diminishing Returns esource Allocation and the Law of Diminishing eturns Julia and Igor Korsunsky The law of diminishing returns is a well-known topic in economics. The law states that the output attributed to an individual

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, [email protected] Abstract 1 Interconnect quality

More information

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb Binary Search Trees Lecturer: Georgy Gimel farb COMPSCI 220 Algorithms and Data Structures 1 / 27 1 Properties of Binary Search Trees 2 Basic BST operations The worst-case time complexity of BST operations

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

CS/COE 1501 http://cs.pitt.edu/~bill/1501/

CS/COE 1501 http://cs.pitt.edu/~bill/1501/ CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

A new binary floating-point division algorithm and its software implementation on the ST231 processor

A new binary floating-point division algorithm and its software implementation on the ST231 processor 19th IEEE Symposium on Computer Arithmetic (ARITH 19) Portland, Oregon, USA, June 8-10, 2009 A new binary floating-point division algorithm and its software implementation on the ST231 processor Claude-Pierre

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]

More information

Paul Brebner, Senior Researcher, NICTA, [email protected]

Paul Brebner, Senior Researcher, NICTA, Paul.Brebner@nicta.com.au Is your Cloud Elastic Enough? Part 2 Paul Brebner, Senior Researcher, NICTA, [email protected] Paul Brebner is a senior researcher in the e-government project at National ICT Australia (NICTA,

More information

Keys to node-level performance analysis and threading in HPC applications

Keys to node-level performance analysis and threading in HPC applications Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015 CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Linda Shapiro Today Registration should be done. Homework 1 due 11:59 pm next Wednesday, January 14 Review math essential

More information

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2) Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse

More information

Resource Models: Batch Scheduling

Resource Models: Batch Scheduling Resource Models: Batch Scheduling Last Time» Cycle Stealing Resource Model Large Reach, Mass Heterogeneity, complex resource behavior Asynchronous Revocation, independent, idempotent tasks» Resource Sharing

More information

SMock A Test Platform for the Evaluation of Monitoring Tools

SMock A Test Platform for the Evaluation of Monitoring Tools SMock A Test Platform for the Evaluation of Monitoring Tools User Manual Ruth Mizzi Faculty of ICT University of Malta June 20, 2013 Contents 1 Introduction 3 1.1 The Architecture and Design of SMock................

More information

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon NAMD2- Greater Scalability for Parallel Molecular Dynamics Laxmikant Kale, Robert Steel, Milind Bhandarkar,, Robert Bunner, Attila Gursoy,, Neal Krawetz,, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan,,

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Hardware Assisted Virtualization

Hardware Assisted Virtualization Hardware Assisted Virtualization G. Lettieri 21 Oct. 2015 1 Introduction In the hardware-assisted virtualization technique we try to execute the instructions of the target machine directly on the host

More information

32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M

32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M Evaluating a Real Machine Performance Isolation using Microbenchmarks Choosing Workloads Evaluating a Fixed-size Machine Varying Machine Size Metrics All these issues, plus more, relevant to evaluating

More information

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3 A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, [email protected]

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],

More information

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a Parallel Computer Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun CS550 Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun Email: [email protected], Phone: (312) 567-5260 Office hours: 2:10pm-3:10pm Tuesday, 3:30pm-4:30pm Thursday at SB229C,

More information

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0 Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without

More information

3 Extending the Refinement Calculus

3 Extending the Refinement Calculus Building BSP Programs Using the Refinement Calculus D.B. Skillicorn? Department of Computing and Information Science Queen s University, Kingston, Canada [email protected] Abstract. We extend the

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Mesh Generation and Load Balancing

Mesh Generation and Load Balancing Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable

More information

Informatica Ultra Messaging SMX Shared-Memory Transport

Informatica Ultra Messaging SMX Shared-Memory Transport White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade

More information

Discuss the size of the instance for the minimum spanning tree problem.

Discuss the size of the instance for the minimum spanning tree problem. 3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can

More information

An HPC Application Deployment Model on Azure Cloud for SMEs

An HPC Application Deployment Model on Azure Cloud for SMEs An HPC Application Deployment Model on Azure Cloud for SMEs Fan Ding CLOSER 2013, Aachen, Germany, May 9th,2013 Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Windows Azure Relevant Technology

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Program Optimization for Multi-core Architectures

Program Optimization for Multi-core Architectures Program Optimization for Multi-core Architectures Sanjeev K Aggarwal ([email protected]) M Chaudhuri ([email protected]) R Moona ([email protected]) Department of Computer Science and Engineering, IIT Kanpur

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

More information

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms

More information

Operating System Components

Operating System Components Lecture Overview Operating system software introduction OS components OS services OS structure Operating Systems - April 24, 2001 Operating System Components Process management Memory management Secondary

More information

System Models for Distributed and Cloud Computing

System Models for Distributed and Cloud Computing System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems

More information

Successful Implementation of L3B: Low Level Load Balancer

Successful Implementation of L3B: Low Level Load Balancer Successful Implementation of L3B: Low Level Load Balancer Kiril Cvetkov, Sasko Ristov, Marjan Gusev University Ss Cyril and Methodius, FCSE, Rugjer Boshkovic 16, Skopje, Macedonia Email: kiril [email protected],

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

Large Scale Simulation on Clusters using COMSOL 4.2

Large Scale Simulation on Clusters using COMSOL 4.2 Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,

More information

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE 1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information