Parallel Scalable Algorithms- Performance Parameters

Save this PDF as:

Size: px
Start display at page:

Transcription

1 Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain

2 Overview Sources of Overhead in Parallel Programs Performance Metrics for Parallel Systems Scalability of Parallel Systems and Algorithms Analysis of Parallel Programs

3 Parameters A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size). The asymptotic runtime of a sequential program is identical on any serial platform. The parallel runtime of a program depends on the input size, the number of processors, and the communication parameters of the machine. An algorithm must therefore be analyzed in the context of the underlying platform. A parallel system is a combination of a parallel algorithm and an underlying platform.

4 Parameters A number of performance measures are intuitive. Wall clock time - the time from the start of the first processor to the stopping time of the last processor in a parallel ensemble. But how does this scale when the number of processors is changed of the program is ported to another machine altogether? How much faster is the parallel version? This begs the obvious follow up question what is the baseline serial version with which we compare?

5 Sources of Overhead in Parallel Programs If I use two processors, shouldn t my program run twice as fast? No - a number of overheads, including wasted computation, communication, idling, and contention cause degradation in performance. The execution profile of a hypothetical parallel program executing on eight processing elements. Profile indicates times spent performing computation (both essential and excess), communication, and idling.

6 Sources of Overheads in Parallel Programs Interprocess interactions: Processors working on any nontrivial parallel problem will need to talk to each other. Idling: Processes may idle because of load imbalance, synchronization, or serial components. Excess Computation: This is computation not performed by the serial version. This might be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication.

7 Performance Metrics for Parallel Systems: Execution Time Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution. We denote the serial runtime by and the parallel runtime by TP.

8 Performance Metrics for Parallel Systems: Total Parallel Overhead Let Tall be the total time collectively spent by all the processing elements. TS is the serial time. Observe that Tall - TS is then the total time spend by all processors combined in non-useful work. This is called the total overhead. The total time collectively spent by all the processing elements Tall = p TP (p is the number of processors). The overhead function (To) is therefore given by To = p TP - TS (1)

9 Performance Metrics for Parallel Systems: Speedup What is the benefit from parallelism? Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements. Speedup Sp=Ts/Tp

10 Performance Metrics: Example Consider the problem of adding n numbers by using n processing elements. If n is a power of two, we can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors.

11 Performance Metrics: Example Computing the global sum of 16 partial sums using 16 processing elements. Σji denotes the sum of numbers with consecutive labels from i to j.

12 Performance Metrics: Speedup For a given problem, there might be many serial algorithms available. These algorithms may have different asymptotic runtimes and may be parallelizable to different degrees. For the purpose of computing speedup, we always consider the best sequential program as the baseline.

13 Performance Metrics: Example (continued) If an addition takes constant time, say, tc and communication of a single word takes time ts + tw, we have the parallel time TP = Θ (log n) We know that TS = Θ (n) Speedup S is given by S = Θ (n / log n)

14 Performance Metrics: Speedup Bounds Speedup can be as low as 0 (the parallel program never terminates). Speedup, in theory, should be upper bounded by p - after all, we can only expect a p-fold speedup if we use times as many resources. Usually 0< Sp < p A speedup greater than p is possible only if each processing element spends less than time TS / p solving the problem. In this case, a single processor could be timeslided to achieve a faster serial program, which contradicts our assumption of fastest serial program as basis for speedup.

15 Performance Metrics: Efficiency Efficiency is a measure of the fraction of time for which a processing element is usefully employed Mathematically, it is given by = (2) 0 < E < 1 Following the bounds on speedup, efficiency can be as low as 0 and as high as 1.

16 Performance Metrics: Efficiency Example The speedup of adding numbers on processors is given by Efficiency is given by =

17 Effect of Granularity on Performance Often, using fewer processors improves performance of parallel systems. Using fewer than the maximum possible number of processing elements to execute a parallel algorithm is called scaling down a parallel system. A naive way of scaling down is to think of each processor in the original case as a virtual processor and to assign virtual processors equally to scaled down processors. Since the number of processing elements decreases by a factor of n / p, the computation at each processing element increases by a factor of n / p. The communication cost should not increase by this factor since some of the virtual processors assigned to a physical processors might talk to each other. This is the basic reason for the improvement from building granularity.

18 Granularity: Example Consider the problem of adding n numbers on p processing elements such that p < n and both n and p are powers of 2. Use the parallel algorithm for n processors, except, in this case, we think of them as virtual processors. Each of the p processors is now assigned n / p virtual processors. The first log p of the log n steps of the original algorithm are simulated in (n / p) log p steps on p processing elements. Subsequent log n - log p steps do not require any communication.

19 Granularity: Example (continued) The overall parallel execution time now is Θ ( (n / p) log p). The cost is Θ (n log p), which is asymptotically higher than the Θ (n) cost of adding n numbers sequentially. Therefore, the parallel (algorithm) system is not costoptimal.

20 Granularity: Example (continued) Can we build granularity in the example in a cost-optimal fashion? Each processing element locally adds its n / p numbers in time Θ (n / p). The p partial sums on p processing elements can be added in time Θ(n /p). A cost-optimal way of computing the sum of 16 numbers using four processing elements.

21 Granularity: Example (continued) The parallel runtime of this algorithm is (3) The cost is

22 Scaling Characteristics of Parallel Programs The efficiency of a parallel program can be written as: or (4) The total overhead function To is an increasing function of p.

23 Scaling Characteristics of Parallel Programs For a given problem size (i.e., the value of TS remains constant), as we increase the number of processing elements, To increases. The overall efficiency of the parallel program goes down. This is the case for all parallel programs.

24 Scaling Characteristics of Parallel Programs: Example Consider the problem of adding numbers on processing elements. We have seen that: = (5) = (6) = (7)

25 Scaling Characteristics of Parallel Programs: Example (continued) Plotting the speedup for various input sizes gives us: Speedup versus the number of processing elements for adding a list of numbers. Speedup tends to saturate and efficiency drops as a consequence of Amdahl's law

26 Scaling Characteristics of Parallel Programs Total overhead function To is a function of both problem size Ts and the number of processing elements p. In many cases, To grows sub-linearly with respect to Ts. In such cases, the efficiency increases if the problem size is increased keeping the number of processing elements constant. For such systems, we can simultaneously increase the problem size and number of processors to keep efficiency constant. We call such systems scalable parallel systems.

27 Scalability For a given problem size, as we increase the number of processing elements, the overall efficiency of the parallel system goes down for all systems. For some systems, the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant.

28 Scalability What is the rate at which the problem size must increase with respect to the number of processing elements to keep the efficiency fixed? This rate determines the scalability of the system. The slower this rate, the better. Before we formalize this rate, we define the problem size W as the asymptotic number of operations associated with the best serial algorithm to solve the problem.

29 Isoefficiency Metric of Scalability We can write parallel runtime as: (8) The resulting expression for speedup is (9) Finally, we write the expression for efficiency as

30 Isoefficiency Metric of Scalability For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To / W is maintained at a constant value. For a desired value E of efficiency, (11) If K = E / (1 E) is a constant depending on the efficiency to be maintained, since To is a function of W and p, we have (12)

31 Isoefficiency Metric of Scalability The problem size W can usually be obtained as a function of p by algebraic manipulations to keep efficiency constant. This function is called the isoefficiency function. This function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements

32 Other Scalability Metrics A number of other metrics have been proposed, dictated by specific needs of applications. For real-time applications, the objective is to scale up a system to accomplish a task in a specified time bound. In memory constrained environments, metrics operate at the limit of memory and estimate performance under this problem growth rate.

33 Amdahl s Law - Fixed problem size In many practical applications the computational workload is fixed: Two parts for the problem with size W : sequential and parallel part W = αw + (1- α)w Sp = W / (αw + (1- α)(w/p)) = p / (1 + (p - 1) α) 1 / α as p Speedup is limited by 1 / α

Performance metrics for parallel systems

Performance metrics for parallel systems S.S. Kadam C-DAC, Pune sskadam@cdac.in C-DAC/SECG/2006 1 Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism

Performance Metrics for Parallel Programs. 8 March 2010

Performance Metrics for Parallel Programs 8 March 2010 Content measuring time towards analytic modeling, execution time, overhead, speedup, efficiency, cost, granularity, scalability, roadblocks, asymptotic

Performance metrics for parallelism

Performance metrics for parallelism 8th of November, 2013 Sources Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley. Definition

Advanced Computer Architecture Institute for Multimedia and Software Engineering Conduction of Exercises: Institute for Multimedia eda and Software Engineering g BB 315c, Tel: 379-1174 E-mail: marius.rosu@uni-due.de

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

Resource Allocation and the Law of Diminishing Returns

esource Allocation and the Law of Diminishing eturns Julia and Igor Korsunsky The law of diminishing returns is a well-known topic in economics. The law states that the output attributed to an individual

PERFORMANCE MODELS FOR APACHE ACCUMULO:

Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

Contributions to Gang Scheduling

CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

Geosci. Model Dev. Discuss., 8, C1166 C1176, 2015 www.geosci-model-dev-discuss.net/8/c1166/2015/ Author(s) 2015. This work is distributed under the Creative Commons Attribute 3.0 License. Geoscientific

Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

Scalable Computing in the Multicore Era

Scalable Computing in the Multicore Era Xian-He Sun, Yong Chen and Surendra Byna Illinois Institute of Technology, Chicago IL 60616, USA Abstract. Multicore architecture has become the trend of high performance

Control 2004, University of Bath, UK, September 2004

Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com

Chapter 1 - Web Server Management and Cluster Topology

Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management

Parallelization: Binary Tree Traversal

By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

Optimal Load Balancing in a Beowulf Cluster. Daniel Alan Adams. A Thesis. Submitted to the Faculty WORCESTER POLYTECHNIC INSTITUTE

Optimal Load Balancing in a Beowulf Cluster by Daniel Alan Adams A Thesis Submitted to the Faculty of WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degree of Master

PARALLELS CLOUD STORAGE

PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...

Real-Time Scheduling 1 / 39

Real-Time Scheduling 1 / 39 Multiple Real-Time Processes A runs every 30 msec; each time it needs 10 msec of CPU time B runs 25 times/sec for 15 msec C runs 20 times/sec for 5 msec For our equation, A

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element

System Models for Distributed and Cloud Computing

System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon

NAMD2- Greater Scalability for Parallel Molecular Dynamics Laxmikant Kale, Robert Steel, Milind Bhandarkar,, Robert Bunner, Attila Gursoy,, Neal Krawetz,, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan,,

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

Hardware: Input, Processing, and Output Devices

Hardware: Input, Processing, and Output Devices Computer Systems Hardware Components Execution of an Instruction Processing Characteristics and Functions Physical Characteristics of CPU Memory Characteristics

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

Performance Evaluation for Parallel Systems: A Survey

Performance Evaluation for Parallel Systems: A Survey Lei Hu and Ian Gorton Department of Computer Systems School of Computer Science and Engineering University of NSW, Sydney 2052, Australia E-mail: {lei,

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

Physical Data Organization

Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

A Comparison of Distributed Systems: ChorusOS and Amoeba

A Comparison of Distributed Systems: ChorusOS and Amoeba Angelo Bertolli Prepared for MSIT 610 on October 27, 2004 University of Maryland University College Adelphi, Maryland United States of America Abstract.

OPC COMMUNICATION IN REAL TIME

OPC COMMUNICATION IN REAL TIME M. Mrosko, L. Mrafko Slovak University of Technology, Faculty of Electrical Engineering and Information Technology Ilkovičova 3, 812 19 Bratislava, Slovak Republic Abstract

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...

Windows Server Performance Monitoring

Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

Load Balancing. Load Balancing 1 / 24

Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

Energy Efficient MapReduce

Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

Using In-Memory Computing to Simplify Big Data Analytics

SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

Parallelism and Cloud Computing

Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

A CDN-P2P Hybrid Architecture for Cost-Effective Streaming Media Distribution

A CDN-P2P Hybrid Architecture for Cost-Effective Streaming Media Distribution Dongyan Xu, Sunil Suresh Kulkarni, Catherine Rosenberg, Heung-Keung Chai Department of Computer Sciences School of Electrical

Operation Count; Numerical Linear Algebra

10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

Overview of High Performance Computing

Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://geco.mines.edu/workshop 1 This tutorial will cover all three time slots. In the first session we will discuss the

All ju The State of Software Development Today: A Parallel View. June 2012

All ju The State of Software Development Today: A Parallel View June 2012 2 What is Parallel Programming? When students study computer programming, the normal approach is to learn to program sequentially.

With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

HP Smart Array Controllers and basic RAID performance factors

Technical white paper HP Smart Array Controllers and basic RAID performance factors Technology brief Table of contents Abstract 2 Benefits of drive arrays 2 Factors that affect performance 2 HP Smart Array

Measuring Parallel Processor Performance

ARTICLES Measuring Parallel Processor Performance Alan H. Karp and Horace P. Flatt Many metrics are used for measuring the performance of a parallel algorithm running on a parallel processor. This article

Recommendations for Performance Benchmarking

Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

www.hlnand.com enabling Ultra-High Bandwidth Scalable SSDs with HLnand May 2013 2 Enabling Ultra-High Bandwidth Scalable SSDs with HLNAND INTRODUCTION Solid State Drives (SSDs) are available in a wide

An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

Driving force. What future software needs. Potential research topics

Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #

Development of Parallel Distributed Computing System for ATPG Program

Development of Parallel Distributed Computing System for ATPG Program ABSTRACT K. Han Electrical and Computer Engineering Auburn University I present an approach about how to increase the speed-up ratio

Basic Concepts in Parallelization

1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline

Formal Metrics for Large-Scale Parallel Performance

Formal Metrics for Large-Scale Parallel Performance Kenneth Moreland and Ron Oldfield Sandia National Laboratories, Albuquerque, NM 87185, USA {kmorel,raoldfi}@sandia.gov Abstract. Performance measurement

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

SAN Conceptual and Design Basics

TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

Partitioning and Divide and Conquer Strategies

and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,

Advances in Virtualization In Support of In-Memory Big Data Applications

9/29/15 HPTS 2015 1 Advances in Virtualization In Support of In-Memory Big Data Applications SCALE SIMPLIFY OPTIMIZE EVOLVE Ike Nassi Ike.nassi@tidalscale.com 9/29/15 HPTS 2015 2 What is the Problem We

CS 575 Parallel Processing

CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts Workshop on Computer Architecture Education 2015 Dan Connors, Kyle Dunn, Ryan Bueter Department of Electrical Engineering University

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT

81 CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT 5.1 INTRODUCTION Distributed Web servers on the Internet require high scalability and availability to provide efficient services to

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

Find-The-Number. 1 Find-The-Number With Comps

Find-The-Number 1 Find-The-Number With Comps Consider the following two-person game, which we call Find-The-Number with Comps. Player A (for answerer) has a number x between 1 and 1000. Player Q (for questioner)

The Truth Behind IBM AIX LPAR Performance

The Truth Behind IBM AIX LPAR Performance Yann Guernion, VP Technology EMEA HEADQUARTERS AMERICAS HEADQUARTERS Tour Franklin 92042 Paris La Défense Cedex France +33 [0] 1 47 73 12 12 info@orsyp.com www.orsyp.com

Six Strategies for Building High Performance SOA Applications

Six Strategies for Building High Performance SOA Applications Uwe Breitenbücher, Oliver Kopp, Frank Leymann, Michael Reiter, Dieter Roller, and Tobias Unger University of Stuttgart, Institute of Architecture

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

Capacity Planning Use Case: Mobile SMS How one mobile operator uses BMC Capacity Management to avoid problems with a major revenue stream

SOLUTION WHITE PAPER Capacity Planning Use Case: Mobile SMS How one mobile operator uses BMC Capacity Management to avoid problems with a major revenue stream Table of Contents Introduction...................................................

FLIX: Fast Relief for Performance-Hungry Embedded Applications

FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...

SCALABILITY AND AVAILABILITY

SCALABILITY AND AVAILABILITY Real Systems must be Scalable fast enough to handle the expected load and grow easily when the load grows Available available enough of the time Scalable Scale-up increase

PowerPC Microprocessor Clock Modes

nc. Freescale Semiconductor AN1269 (Freescale Order Number) 1/96 Application Note PowerPC Microprocessor Clock Modes The PowerPC microprocessors offer customers numerous clocking options. An internal phase-lock

FPGA area allocation for parallel C applications

1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

on an system with an infinite number of processors. Calculate the speedup of

1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,

An Optimized Load-balancing Scheduling Method Based on the WLC Algorithm for Cloud Data Centers

Journal of Computational Information Systems 9: 7 (23) 689 6829 Available at http://www.jofcis.com An Optimized Load-balancing Scheduling Method Based on the WLC Algorithm for Cloud Data Centers Lianying

Communicating with devices

Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.

CS/COE 1501 http://cs.pitt.edu/~bill/1501/

CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge

Relating Empirical Performance Data to Achievable Parallel Application Performance

Published in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), Vol. III, Las Vegas, Nev., USA, June 28-July 1, 1999, pp. 1627-1633.

Importance of Data locality

Importance of Data Locality - Gerald Abstract Scheduling Policies Test Applications Evaluation metrics Tests in Hadoop Test environment Tests Observations Job run time vs. Mmax Job run time vs. number

Load Balancing In Concurrent Parallel Applications

Load Balancing In Concurrent Parallel Applications Jeff Figler Rochester Institute of Technology Computer Engineering Department Rochester, New York 14623 May 1999 Abstract A parallel concurrent application

Multi-Core Programming

Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

CHAPTER 1 INTRODUCTION

1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level -ORACLE TIMESTEN 11gR1 CASE STUDY Oracle TimesTen In-Memory Database and Shared Disk HA Implementation

Motion Graphs. It is said that a picture is worth a thousand words. The same can be said for a graph.

Motion Graphs It is said that a picture is worth a thousand words. The same can be said for a graph. Once you learn to read the graphs of the motion of objects, you can tell at a glance if the object in

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu

Data Structures. Algorithm Performance and Big O Analysis

Data Structures Algorithm Performance and Big O Analysis What s an Algorithm? a clearly specified set of instructions to be followed to solve a problem. In essence: A computer program. In detail: Defined