Measuring Parallel Processor Performance
|
|
|
- Beryl Delphia Shepherd
- 10 years ago
- Views:
Transcription
1 ARTICLES Measuring Parallel Processor Performance Alan H. Karp and Horace P. Flatt Many metrics are used for measuring the performance of a parallel algorithm running on a parallel processor. This article introduces a new metric that has some advantages over the others. Its use is illustrated with data from the Linpack benchmark report and the winners of the Gordon Bell A ward. There are many ways to measure the performance of a parallel algorithm running on a parallel processor. The most commonly used measurements are the elapsed time, price/performance, the speed-up, and the efficiency. This article defines another metric which reveals aspects of the performance that are not easily discerned from the other metrics. The elapsed time to run a particular job on a given machine is the most important metric. A Cray Y-MP/l solves the order 1,000 linear system in 2.17 seconds compared to 5 seconds for a Sequent Balance with 30 processors [Z]. If you can afford the Cray and you spend most of your time factoring large matrices, then you should buy a Cray. Price/performance of a parallel system is simply the elapsed time for a program divided by the cost of the machine that ran the job. It is important if there are a group of machines that are fast enough. Given a fixed amount of money, it may be to your advantage to buy a number of slow machines rather than one fast machine. This is particularly true if you have many jobs to run and a limited budget. In the previous example, the Sequent Balance is a superior price/performer than the Cray if it costs less than 0.5 percent as much. On the other hand, if you can t wait 7 minutes for the answer, the Sequent is not a good buy even if it wins in price/performance. These two measurements are used to help you decide what machine to buy. Once you have bought the machine, speed-up and efficiency are the measurements often used to let you know how effectively you are using it. The speed-up is generally measured by running the same program on a varying number of processors. The speed-up is then the elapsed time needed by 1 processor divided by the time needed on p processors, s = T(l)/T(p). (Of course, the correct time for the uniprocessor run would be the time for the best serial algorithm, but almost nobody bothers to write two programs.) If you are interested in studying algorithms for ACM /90/ $1.50 parallel processors, and system A gives a higher speedup than system B for the same program, then you would say that system A provides better support for parallelizing this program than does system B. An example of such support is the presence of more processors. A Sequent with 30 processors will almost certainly produce a higher speed-up than a Cray with only 8 processors. The issue of efficiency is related to that of price/ performance. It is usually defined as T(l) e=pto=p. Efficiency close to unity means that you are using your hardware effectively; low efficiency means that you are wasting resources. As a practical matter, you may buy a system with 100 processors that each takes 100 times longer than you are willing to wait to solve your problem. If you can code your problem to run at high efficiency, you ll be happy. Of course, if you have 200 processors, you may not be unhappy with 50 percent efficiency, particularly if the 200 processors cost less than other machines that you can use. Each of these metrics has disadvantages. In fact, there is important information that cannot be obtained even by looking at all of them. It is obvious that adding processors should reduce the elapsed time, but by how much? That is where speed-up comes in. Speed-up close to linear is good news, but how close to linear is good enough? Well, efficiency will tell you how close you are getting to the best your hardware can do, but if your efficiency is not particularly high, why? The new metric defined in the following section is intended to answer these questions. NEW METRIC We will now derive our new metric, the experimentally determined serial fraction, and show why it is useful. We will start with Amdahl s Law [l] which in its simplest form says that s May 1990 Volume 33 Number 5 Communications of the ACM 539
2 Articles where T, is the time taken by the part of the program that must be run serially and T, is the time in the parallelizable part. Obviously, T(1) = T, + TP. If we define the fraction serial, f = T,/T(l) then equation (2) can be written as T(p) = T(l)f + T ll(l - f), P or, in terms of the speed-up s We can now solve for the serial fraction, namely f= l/s - l/p l-l/p The experimentally cletermined serial fraction is our new metric. While this quantity is mentioned in a large percentage of papers on parallel algorithms, it is virtually never used as a diagnostic tool the way speedup and efficiency are. It is our purpose to correct this situation. The value off is useful because equation (2) is incomplete. First, it assumes that all processors compute for the same amount of time, i.e., the work is perfectly load balanced. If some processors take longer than others, the measured speed-up will be reduced giving a larger measured serial fraction. Second, there is a term missing that represents the overhead of synchronizing processors. Load-balancing effects are likely to result in an irregular change in f as p increases. For example, if you have 12 pieces of work to do that take the same amount of time, you will have perfect load balancing for 2, 3,, 6, and 12 processors, but less than perfect load balancing for other values of p. Since a larger load imbalance results in a larger inc:rease in f, you can identify problems not apparent from speed-up or efficiency. The overhead of synchronizing processors is a monotonically increasing function of p, typically assumed to increase either linearly in p or as log p. Since increasing overhead decreases the speed-up, this effect results in a smooth increase in the serial fraction f as p increases. Smoothly increasing f is a warning that the granularity of the parallel tasks is too fine. A third effect is the potential reduction of vector lengths for certain parallelizations of a particular algorithm. Vector processor performance normally increases as vector length increases except for vector lengths slightly larger than the length of the vector registers. If the parallelization breaks up long vectors into shorter vectors, the time to execute the job can increase. This effect then also leads to a smooth increase in the measured serial fraction as the number of processors increases. However, large jobs usually have very long vectors, vector processors usually have only a few processors (the Intel IPSC-VX is an exception), and there are usually parallelizations that keep the vector lengths fixed. Thus, the reduction in vector length is rarely a problem and can often be avoided entirely. In order to see the advantage of the serial fraction over the other metrics, look at Table I which is extracted from Table 2 of the Linpack report [2]. The Cray Y-MP shows speed-ups ranging from 1.95 to Is that good? Even if you look at the efficiency, which ranges from to 0.870, you still don t know. Why does the efficiency fall so rapidly? Is there a lot of overhead when using 8 processors? The serial fraction, f, answers the question: the serial fraction is nearly constant for all values of p. The loss of efficiency is due to the limited parallelism of the program. The single data point for the Sequent Balance reveals that f performs better as a metric as the number of processors grows. The efficiency of the 30 processor Sequent is only 83 percent. Is that good? Yes, it is since the serial fraction is only The data for the Alliant FX/0 shows something different. Here the speed-up ranges from 1.90 to 3.22 and the efficiency from to Although neither of these measurements tells much, the fact that f ranges from to does; there is some overhead that is increasing as the number of processors grows. We can t tell what this overhead is due to-synchronization cost, memory contention, or what-but at least we know it is there. The effect on the FX/80 is much smaller although there is a slight increase in f, especially for fewer than 5 processors. Even relatively subtle effects can be seen. The IBM 3090 has a serial fraction under for 2 and 3 processors, but over for or more. Here the reason is most likely due to the machine configuration; each set of 3 processors is in a single frame and shares a memory management unit. Overhead increases slightly when two of these units must coordinate their activities This effect also shows up on the 3090~280s which has two processors in two frames. Its run has twice the serial fraction as does the run on the None of the other metrics would have revealed this effect. Another subtle effect shows up on the Convex. The processor C-20 shows a smaller serial fraction than does the 2 processor C-220. Since the same code was presumably run on both machines, the actual serial fraction must be the same. How can the measured value decrease? This appears to be similar to the superlinear speed-ups reported on some machines. As in those cases, adding processors adds cache and memory bandwidth which reduces overhead. Perhaps that is the case here. Care must be used when comparing different machines. For example, the serial fraction on the Cray is 3 times larger than on the Sequent. Is the Cray really that inefficient? The answer is no. Since almost all the parallel work can be vectorized, the Cray spends relatively less time in the parallel part of the code than does the Sequent which has no vector unit. Since the parallelizable part speeds up more than the serial part which 50 Communications of the ACM May 1990 Volume 33 Number 5
3 Computer Gray Y-MPI8 Table I. Summary of Linpack report Table 2 P Time(sec) s e f i IBM OSVF IBM SVF IBM SVF IBM SVF IBM s VF IBM SVF IBM SVF Alliant FX/ Alliant FX/0 3.8 Alliant FX/ Alliant FX/ Alliant FX Alliant FX/ Alliant FX Alliant FX Alliant FX/ Alliant FX/ Alliant FX/ Alliant FX Sequent Sequent 3b 5 Convex C Convex C Convex C ; Note: p=#processors, s=speed-up, e=efficiency, f=serial fraction has less vector content, the fraction of the time spent in serial code is increased. The Linpack benchmark report measures the performance of a computational kernel running on machines with no more than 30 processors. The results in Table II are taken from the work of the winners of the Gordon Bell Award [5]. Three applications are shown with maximum speed-ups of 639, 519, and 351 and efficiencies ranging from to We know this is good work since they won the award, but how good a job did they do? The serial fraction ranges from to indicating that they did a very good job, indeed. The serial fraction reveals an interesting point. On all three problems, there is a significant reduction in the serial fraction in going from to 16 processors (from 16 to 6 for the wave motion problem). As with the Convex results, these numbers indicate something akin to superlinear speed-up. Perhaps the -processor run sends longer messages than does the 16-processor run, and these longer messages are too long for the system to handle efficiently. At any rate, the serial fraction has pointed up an inconsistency that needs further study. SCALED SPEED-UP All the analysis presented so far refers to problems of fixed size. Gustafson [] has argued that this is not how parallel processors are used. He argues that users will increase their problem size to keep the elapsed time of the run more or less constant. As the problem size grows, we should find the fraction of the time spent executing serial code decreases, leading us to predict a decrease in the measured serial fraction. If we assume that the serial time and overhead are independent of problem size, neither of which is fully justified, [3] T(p, k) = T, + ze P' where T( p, k) is the time to run the program on p processors for a problem needing k times more arithmetic. Here k is the scaling factor and k = 1 when p = 1. Flatt [3] points out that the scaling factor k must count arithmetic, not some more convenient measure such as memory size. Our definition of speed-up must now account for the additional arithmetic that must be done to solve our (6) May 1990 Volume 33 Number 5 Communications of the ACM 51
4 Articles Table II. Summary of Bell Award winning performance P s e f Wave Motion Note:p=#processors,s=speed-up, larger problem Fluid Dynamics Beam Stress We can use kt(1, 1) sk = T(p, k) e=efficiency, f=serialfraction The scaled efficiency is then L ~ = sk/ps and the scaled serial fraction becomes fa = l/sr - l/p l-1//7 By our previous definitions WC see that f = kfk which under ideal circumstances would remain constant as 1~ increases. The scaled results are shown in Table III. Although these runs take constant time as the problem size grows, the larger problems were run with shorter time (8) steps. A better scaling would be one in which the time integration is continued to a specific: value. If the run time were still held constant. the problems would scale more slowly than linearly in 1. In these examples. the Courant condition limits the step size which means that the correct scaling woulcl be k = v$. The scaling chosen for the problems of Table III is k = 17. As predicted 131. the efficiency decreases. barely. as 11 increases even for scaled problems. The scaled serial fraction. on the other hand. decreases smoothly. This fact tells us that the decreasing efficiency is not caused by a growing serial fraction. Instead it tells us the problem size is not growing fast enough to completely counter the loss of efficiency as more processors are addecl. The variation in fk as the problem size grows makes it difficult to intcrprct. If we allow for the fact that ideally the increase in the problem size dots not affect the serial work. we again have a metric that should remain constant as the problem size grows. Table 111 shows how fcfa varies with problem size. We see that this quantity grows slowly for the wave motion problem, but that thcrc is virtually no increase for the fluid dynamics problem. These results indicate that the serial work increases for the wave motion problem as the problem size grows but not for the fluid dynamics problem. The irregular behavior of kfk for the beam stress problem warrants further study. SUMMARY We have shown that the measured serial fraction. fs provides information not revealed by other commonly used metrics. The metric. properly defined. may also be useful if the problem size is allowed to increase with the number of processors. What makes the experimentally determined serial fraction such a good diagnostic tool of potential pcr- P sk Table Ill. Bell Award scaled Droblems. k=r, ek Wave Motion Fluid Dynamics Beam Stress fk k fk ;: Note: p=#processors, s,=scaled speed-up, e,=scaled efficiency, f,=scaled serial fraction Mu.11 IY90 Vd~rnre 33 Number 5
5 formance proldems? While elapsecl time. speed-up. and efficiency vary as the number of processors increases. the serial fraction would remain constant in an ideal system. Small variations from pcrfcct behavior arc much easier to detect from something that should bc constant than from something that varies. Since!,fL for scaled problems shares this property. it. too. is a useful tool. Ignoring the fact that ~7 takes on only intcgcr values makes it easy to show that ri 1 IijJ -=f 1 if we ignore the overheacl and assume that the serial fraction is independent of 17. Thus, we WC that the serial fraction is a measure of the rate of change of the efficiency. Even in the idcal case. l/c increases linearly as the number of processors increases. Any deviation if 1 /c from linearity is a sign of lost parallelism. The fraction serial is a particularly convenient measure of this deviation from linearity. The case of problem sizes that grow as the number of processors is increased is only slightly more complicated. In this case (9) led him to examine the way the IBM Parallel I:ortran prototype compiler was splitting up the work in the loops. Due to an oversight. the compiler truncated the result of dividing the number of loop iterations by the number of processors. This error meant that one processor had to do one exlra pass to finish the work in the loop. For some values of p this rcmainclcr was small: for others. it was large. The solution was to increase his problem size from XXI. which gives perfect load balancing on his six-processor IBM 309(1-600s only for 2 and 5 processors, IO 360 which always balances perfectly. (This cwor was reported to the IBM Parallel Portran compiler group.) If k = 1. i.e.. there is no scaling. equation (10) reduces to equation (9). If li = /I. then l/e has a slope of f/j] which is a very slow loss of efficiency. Other scalings lie between thcsc two curves. It is easy to read too much into the numerical values. When the efficiency is near unity. l/s is close to l/p which leads to a loss of precision when subtracting in the numerator of equation (5). If care is not taken. variations may appear that are mere round-off noise. All entries in the tables were computed from Since both s and p arc considerably greater than unity and neither l/s nor l/p is near the precision of the floating point arithmetic. there is only one place where precision can bc lost. Rouncling the result to the significance of the reported times guarantees that the results are accurate. Similarly. is used for the scaled serial fraction. Although our numerical examples come from rather simple cases. one of us (Alan Karp) has suc~-~~ssfull~ used this metric to find a performance bug in one of his applications. Noting an irregularity in the behavior off (11) (I ) r\uoi!t THE AUTHOI~S: ALAN KARP is a staff n~emhcr at IBhl s Palo Alto Scicnlific: Ccntcr. Hc has worked on pr~blcms of r;ldialive Irarlsfer in moving stellar matter and in planetary atmosphcrcs. hyclrodynamics proldcnis in pulsating stars and in onhanc:rd oil rccoucry. and numerical mcthotls for parsllel proc:essors. Hc is c:urrr~~lly studying the inlr&:e l~etwren programmers and parallel proc:cssors wilh special attention to debugging parallel algorithms. HORACE P. FLATT is manager of IBM s Palo Alto Scientific Ccntcr. He received a Ph.D. in mathcmntics from Ricx Ilnivcrsity in subsequently bcconiing manager of the applied mathematics group of Atomits International. IIIC. Hc joined IBM in IWI and has primarily worked in manapcment assignments in applied rcscarch in coniputor systems and applicalions. Aulhors Prcscnt hddress: IBM Scientific: Cc~~tcr Page Mill Road. I ~~lo Alto. CA!&Kw~.
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
14:440:127 Introduction to Computers for Engineers. Notes for Lecture 06
14:440:127 Introduction to Computers for Engineers Notes for Lecture 06 Rutgers University, Spring 2010 Instructor- Blase E. Ur 1 Loop Examples 1.1 Example- Sum Primes Let s say we wanted to sum all 1,
Performance metrics for parallel systems
Performance metrics for parallel systems S.S. Kadam C-DAC, Pune [email protected] C-DAC/SECG/2006 1 Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
THE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Multi-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral
Click here to print this article. Re-Printed From SLCentral RAID: An In-Depth Guide To RAID Technology Author: Tom Solinap Date Posted: January 24th, 2001 URL: http://www.slcentral.com/articles/01/1/raid
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization
An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended
Communication Protocol
Analysis of the NXT Bluetooth Communication Protocol By Sivan Toledo September 2006 The NXT supports Bluetooth communication between a program running on the NXT and a program running on some other Bluetooth
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
AP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
Big-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Performance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
Parallel & Distributed Optimization. Based on Mark Schmidt s slides
Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M
Evaluating a Real Machine Performance Isolation using Microbenchmarks Choosing Workloads Evaluating a Fixed-size Machine Varying Machine Size Metrics All these issues, plus more, relevant to evaluating
Performance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
PARALLEL PROCESSING AND THE DATA WAREHOUSE
PARALLEL PROCESSING AND THE DATA WAREHOUSE BY W. H. Inmon One of the essences of the data warehouse environment is the accumulation of and the management of large amounts of data. Indeed, it is said that
How To Improve Performance On A Single Chip Computer
: Redundant Arrays of Inexpensive Disks this discussion is based on the paper:» A Case for Redundant Arrays of Inexpensive Disks (),» David A Patterson, Garth Gibson, and Randy H Katz,» In Proceedings
Understanding the Performance of an X550 11-User Environment
Understanding the Performance of an X550 11-User Environment Overview NComputing's desktop virtualization technology enables significantly lower computing costs by letting multiple users share a single
Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK
Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK It has been my pleasure to give two presentations
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
CS 6290 I/O and Storage. Milos Prvulovic
CS 6290 I/O and Storage Milos Prvulovic Storage Systems I/O performance (bandwidth, latency) Bandwidth improving, but not as fast as CPU Latency improving very slowly Consequently, by Amdahl s Law: fraction
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Guideline for stresstest Page 1 of 6. Stress test
Guideline for stresstest Page 1 of 6 Stress test Objective: Show unacceptable problems with high parallel load. Crash, wrong processing, slow processing. Test Procedure: Run test cases with maximum number
The Comeback of Batch Tuning
The Comeback of Batch Tuning By Avi Kohn, Time Machine Software Introduction A lot of attention is given today by data centers to online systems, client/server, data mining, and, more recently, the Internet.
Cloud Analytics for Capacity Planning and Instant VM Provisioning
Cloud Analytics for Capacity Planning and Instant VM Provisioning Yexi Jiang Florida International University Advisor: Dr. Tao Li Collaborator: Dr. Charles Perng, Dr. Rong Chang Presentation Outline Background
What Is Specific in Load Testing?
What Is Specific in Load Testing? Testing of multi-user applications under realistic and stress loads is really the only way to ensure appropriate performance and reliability in production. Load testing
A Guide to Getting Started with Successful Load Testing
Ingenieurbüro David Fischer AG A Company of the Apica Group http://www.proxy-sniffer.com A Guide to Getting Started with Successful Load Testing English Edition 2007 All Rights Reserved Table of Contents
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Advances in Virtualization In Support of In-Memory Big Data Applications
9/29/15 HPTS 2015 1 Advances in Virtualization In Support of In-Memory Big Data Applications SCALE SIMPLIFY OPTIMIZE EVOLVE Ike Nassi [email protected] 9/29/15 HPTS 2015 2 What is the Problem We
We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.
LING115 Lecture Note Session #4 Python (1) 1. Introduction As we have seen in previous sessions, we can use Linux shell commands to do simple text processing. We now know, for example, how to count words.
Case Study: Load Testing and Tuning to Improve SharePoint Website Performance
Case Study: Load Testing and Tuning to Improve SharePoint Website Performance Abstract: Initial load tests revealed that the capacity of a customized Microsoft Office SharePoint Server (MOSS) website cluster
Real-Time Scheduling 1 / 39
Real-Time Scheduling 1 / 39 Multiple Real-Time Processes A runs every 30 msec; each time it needs 10 msec of CPU time B runs 25 times/sec for 15 msec C runs 20 times/sec for 5 msec For our equation, A
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
Application Performance Analysis of the Cortex-A9 MPCore
This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development Application Performance Analysis
Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju
Chapter 7: Distributed Systems: Warehouse-Scale Computing Fall 2011 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note:
White Paper. Java versus Ruby Frameworks in Practice STATE OF THE ART SOFTWARE DEVELOPMENT 1
White Paper Java versus Ruby Frameworks in Practice STATE OF THE ART SOFTWARE DEVELOPMENT 1 INTRODUCTION...3 FRAMEWORKS AND LANGUAGES...3 SECURITY AND UPGRADES...4 Major Upgrades...4 Minor Upgrades...5
The Practical Organization of Automated Software Testing
The Practical Organization of Automated Software Testing Author: Herbert M. Isenberg Ph.D. Quality Assurance Architect Oacis Healthcare Systems PO Box 3178 Sausalito, CA. 94966 Type: Experience Report
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
EEM 486: Computer Architecture. Lecture 4. Performance
EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design
Relative and Absolute Change Percentages
Relative and Absolute Change Percentages Ethan D. Bolker Maura M. Mast September 6, 2007 Plan Use the credit card solicitation data to address the question of measuring change. Subtraction comes naturally.
TPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
CS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and
White Paper Perceived Performance Tuning a system for what really matters
TMurgent Technologies White Paper Perceived Performance Tuning a system for what really matters September 18, 2003 White Paper: Perceived Performance 1/7 TMurgent Technologies Introduction The purpose
The mathematics of RAID-6
The mathematics of RAID-6 H. Peter Anvin 1 December 2004 RAID-6 supports losing any two drives. The way this is done is by computing two syndromes, generally referred P and Q. 1 A quick
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
Intelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
An Introduction to Applied Mathematics: An Iterative Process
An Introduction to Applied Mathematics: An Iterative Process Applied mathematics seeks to make predictions about some topic such as weather prediction, future value of an investment, the speed of a falling
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu
Local Sidereal Time is the hour angle of the First Point of Aries, and is equal to the hour angle plus right ascension of any star.
1 CHAPTER 7 TIME In this chapter we briefly discuss the several time scales that are in use in astronomy, such as Universal Time, Mean Solar Time, Ephemeris Time, Terrestrial Dynamical Time, and the several
IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications
Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
How To Write A Design Document For Anorexic Css
Computer Science 161: Operating Systems How to write a Design Document CS161 Course Staff [email protected] January 22, 2013 1 Introduction Assignments 2, 3, and 4 require that you write and submit
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Envox CDP 7.0 Performance Comparison of VoiceXML and Envox Scripts
Envox CDP 7.0 Performance Comparison of and Envox Scripts Goal and Conclusion The focus of the testing was to compare the performance of and ENS applications. It was found that and ENS applications have
Comparison of Windows IaaS Environments
Comparison of Windows IaaS Environments Comparison of Amazon Web Services, Expedient, Microsoft, and Rackspace Public Clouds January 5, 215 TABLE OF CONTENTS Executive Summary 2 vcpu Performance Summary
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
The Importance of Continuous Integration for Quality Assurance Teams
The Importance of Continuous Integration for Quality Assurance Teams Without proper implementation, a continuous integration system will go from a competitive advantage for a software quality assurance
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Response Time Analysis
Response Time Analysis A Pragmatic Approach for Tuning and Optimizing SQL Server Performance By Dean Richards Confio Software 4772 Walnut Street, Suite 100 Boulder, CO 80301 866.CONFIO.1 www.confio.com
